Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens...
Transcript of Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens...
![Page 1: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/1.jpg)
Programming Multi-Core Systemswith OpenMP
Clemens Grelck
University of Amsterdam
UvA / SURFsaraHigh Performance Computing and Big Data
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 2: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/2.jpg)
Programming Multi-Core Systems with OpenMP
Targeted Architectures
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 3: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/3.jpg)
Target Multi-core Systems
Small-scale general-purpose (x86) multicore processors:
I Intel / AMD commodity processors with 2, 4, 6 or 8 cores
I potentially hyperthreaded
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 4: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/4.jpg)
Target Multi-core Systems
Medium-scale server systems:
I multiple (2 or 4 in practice) identical processors
I each processor with several cores
I high bandwidth data path between processors
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 5: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/5.jpg)
Target Multi-core Systems
Large-scale shared address space compute systems:
I large number of slightly simpler cores
I SUN MicroSystems / Oracle Niagara / UltraSparc T series
I up to 512 hardware threads (T3-4 server)
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 6: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/6.jpg)
Symmetric Multiprocessor Architecture Model
core core corecore corecore corecore core core corecorecore core corecore
L2
L3 cache
L2
L1
Shared Address Space
L3 cache L3 cache L3 cache
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1DI
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
L2
L1 L1DI
Characteristics:
I Shared address space notion of shared memory
I Multiple levels of hardware-coherent caches
I Multiple processors
I Each processor has multiple cores
I Each core has multiple hardware threads
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 7: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/7.jpg)
Programming Multi-Core Systems with OpenMP
Targeted Architectures
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 8: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/8.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation e↵ort for development,debugging, maintenance, etc.
Problem:
I Data dependences are di�cult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 9: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/9.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation e↵ort for development,debugging, maintenance, etc.
Problem:
I Data dependences are di�cult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 10: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/10.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation e↵ort for development,debugging, maintenance, etc.
Problem:
I Data dependences are di�cult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 11: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/11.jpg)
OpenMP at a Glance
OpenMP as a programming interface:
I Compiler directivesI Library functionsI Environment variables
C/C++ version:
#pragma omp name [clause]*
structured block
Fortran version:
!$ OMP name [ clause [, clause]*]
code block
!$ OMP END name
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 12: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/12.jpg)
OpenMP at a Glance
OpenMP as a programming interface:
I Compiler directivesI Library functionsI Environment variables
C/C++ version:
#pragma omp name [clause]*
structured block
Fortran version:
!$ OMP name [ clause [, clause]*]
code block
!$ OMP END name
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 13: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/13.jpg)
Hello World with OpenMP
Example program:
#include "omp.h"
#include <stdio.h>
int main()
{
printf( "Starting execution with %d threads .\n",
omp_get_num_threads ());
printf( "Hello world says thread %d of %d.\n",
omp_get_thread_num (),
omp_get_num_threads ());
printf( "Execution of %d threads terminated .\n",
omp_get_num_threads ());
return 0;
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 14: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/14.jpg)
Hello World with OpenMP
#include "omp.h"
#include <stdio.h>
int main()
{
printf( "Starting execution with %d threads .\n",
omp_get_num_threads ());
printf( "Hello world says thread %d of %d.\n",
omp_get_thread_num (),
omp_get_num_threads ());
printf( "Execution of %d threads terminated .\n",
omp_get_num_threads ());
return 0;
}
Compiling the code:
gcc -fopenmp hello_world.c
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 15: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/15.jpg)
Hello World with OpenMP#include "omp.h"
#include <stdio.h>
int main()
{
printf( "Starting execution with %d threads .\n",
omp_get_num_threads ());
printf( "Hello world says thread %d of %d.\n",
omp_get_thread_num (),
omp_get_num_threads ());
printf( "Execution of %d threads terminated .\n",
omp_get_num_threads ());
return 0;
}
Running the compiled code:
Starting execution with 1 threads.
Hello world says thread 0 of 1.
Execution of 1 threads terminated.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 16: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/16.jpg)
Hello World with OpenMP — now in parallel
#include "omp.h"
#include <stdio.h>
int main()
{
printf( "Starting execution with %d threads .\n",
omp_get_num_threads ());
#pragma omp parallel
{
printf( "Hello world says thread %d of %d.\n",
omp_get_thread_num (),
omp_get_num_threads ());
}
printf( "Execution of %d threads terminated .\n",
omp_get_num_threads ());
return 0;
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 17: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/17.jpg)
Hello World with OpenMP — now in parallel
Running the code with 4 threads:
Starting execution with 1 threads.
Hello world says thread 2 of 4.
Hello world says thread 3 of 4.
Hello world says thread 1 of 4.
Hello world says thread 0 of 4.
Execution of 1 threads terminated.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 18: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/18.jpg)
Hello World with OpenMP — now in parallel
Running the code with 4 threads:
Starting execution with 1 threads.
Hello world says thread 2 of 4.
Hello world says thread 3 of 4.
Hello world says thread 1 of 4.
Hello world says thread 0 of 4.
Execution of 1 threads terminated.
Who determines number of threads ?
I Environment variable: export OMP_NUM_THREADS=4
I Library function: void omp_set_num_threads( int)
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 19: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/19.jpg)
OpenMP Execution Model
Classical Fork/Join:
Master and slave threads concurrently
Master thread executes serial code.
Master thread encounters parallel directive.
Implicit barrier, wait for all threads to finish.
Master thread resumes serial execution.
execute parallel block.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 20: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/20.jpg)
Programming Multi-Core Systems with OpenMP
Targeted Architectures
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 21: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/21.jpg)
Simple Loop Parallelisation
Example: element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Idea of parallelisation:
I Have each thread compute some disjoint part of the vectors:
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 22: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/22.jpg)
Simple Loop Parallelisation
Example: element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Idea of parallelisation:
I Have each thread compute some disjoint part of the vectors:
*= = = =
c
b
a
* * *
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 23: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/23.jpg)
Simple Loop Parallelisation
Example: parallelised element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 24: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/24.jpg)
Simple Loop Parallelisation
Example: parallelised element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 25: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/25.jpg)
Simple Loop Parallelisation
Example: parallelised element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 26: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/26.jpg)
Directive #pragma omp parallel for
What the compiler directive does for you:
I It starts additional worker threads depending onOMP NUM THREADS or omp set num threads().
I It divides the iteration space among all threads.
I It lets all threads execute loop restricted to their mutuallydisjoint subsets.
I It synchronizes all threads at an implicit barrier.
I It terminates the worker threads.
Restrictions:
I The directive must directly precede for-loop.
I The for-loop must match a constrained pattern.
I The trip-count of the for-loop must be known in advance.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 27: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/27.jpg)
Directive #pragma omp parallel for
What the compiler directive does for you:
I It starts additional worker threads depending onOMP NUM THREADS or omp set num threads().
I It divides the iteration space among all threads.
I It lets all threads execute loop restricted to their mutuallydisjoint subsets.
I It synchronizes all threads at an implicit barrier.
I It terminates the worker threads.
Restrictions:
I The directive must directly precede for-loop.
I The for-loop must match a constrained pattern.
I The trip-count of the for-loop must be known in advance.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 28: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/28.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
Private variables:
I One private instance for each threadI No communication between threads within parallel sectionI No communication between parallel and sequential sections
Shared variables:
I One shared instance for all threadsI Allows communication between threads within parallel sectionI Allows communication between parallel and sequential sections
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 29: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/29.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
Who decides that a, b, c and len are shared variables,whereas i is private ??
Default rules:
I All variables are shared.
I Only loop variables of parallel loops are private.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 30: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/30.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
Who decides that a, b, c and len are shared variables,whereas i is private ??
Default rules:
I All variables are shared.
I Only loop variables of parallel loops are private.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 31: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/31.jpg)
Shared and Private Variables
The default rule is not what you want ?
I The shared clause determines shared variables
I The private clause determines private variables
Example with explicit clauses:
#pragma omp parallel for private( i) shared( c, a, b, len)
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 32: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/32.jpg)
From Vector Product to Scalar Product
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
sp
a
* * * *b
ep
+
= = = =
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 33: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/33.jpg)
From Vector Product to Scalar Product
sp
a
* * * *b
ep
+
= = = =
Reduction operations:
I Reductions reduce a set of elements to one
I Using some function
I Examples: sum or product of a set of numbers
Can we parallelise reductions ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 34: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/34.jpg)
From Vector Product to Scalar Product
sp
a
* * * *b
ep
+
= = = =
Reduction operations:
I Reductions reduce a set of elements to one
I Using some function
I Examples: sum or product of a set of numbers
Can we parallelise reductions ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 35: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/35.jpg)
Parallel Reduction
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction introduces a loop-carried dependence.
I Good news: operation is associativeI Order of reductions irrelevant for final resultI (If we ignore deficiencies of computer arithmetic...)
Private variables ?
Shared variables ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 36: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/36.jpg)
Parallel Reduction
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction introduces a loop-carried dependence.I Good news: operation is associative
I Order of reductions irrelevant for final resultI (If we ignore deficiencies of computer arithmetic...)
Private variables ?
Shared variables ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 37: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/37.jpg)
Parallel Reduction
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction introduces a loop-carried dependence.I Good news: operation is associativeI Order of reductions irrelevant for final result
I (If we ignore deficiencies of computer arithmetic...)
Private variables ?
Shared variables ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 38: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/38.jpg)
Parallel Reduction
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction introduces a loop-carried dependence.I Good news: operation is associativeI Order of reductions irrelevant for final resultI (If we ignore deficiencies of computer arithmetic...)
Private variables ?
Shared variables ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 39: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/39.jpg)
Parallel Reduction
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction introduces a loop-carried dependence.I Good news: operation is associativeI Order of reductions irrelevant for final resultI (If we ignore deficiencies of computer arithmetic...)
Private variables ?
Shared variables ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 40: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/40.jpg)
Parallel Reduction
Parallel scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Is this really correct ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 41: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/41.jpg)
Parallel Reduction
Parallel scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++)
{
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Is this really correct ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 42: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/42.jpg)
A Look into Assembly
The troublemaker in C:
sp = sp + ep;
The troublemaker in (pseudo) assembly:
load sp -> reg1
load ep -> reg2
add reg1 , reg2 -> reg1
store reg1 -> sp
Problem:
I Same code executed by multiple threads simultaneously
I Any interleaving of assembly instructions possible
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 43: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/43.jpg)
A Look into Assembly
The troublemaker in C:
sp = sp + ep;
The troublemaker in (pseudo) assembly:
load sp -> reg1
load ep -> reg2
add reg1 , reg2 -> reg1
store reg1 -> sp
Problem:
I Same code executed by multiple threads simultaneously
I Any interleaving of assembly instructions possible
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 44: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/44.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 0
8 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 45: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/45.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 0
8 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 46: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/46.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 0
8 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 47: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/47.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 0
8 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 48: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/48.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 0
8 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 49: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/49.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 0
8 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 50: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/50.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 5
8 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 51: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/51.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 5
8 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 52: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/52.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 8 29 5 0 08 37 8 29 5 0 08 37 8 37 5 0 08 37 8 37 5 37 08 37 8 37 5 37 58 37 8 37 5 42 58 37 8 42 5 42 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 53: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/53.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 0
8 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 54: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/54.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 0
8 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 55: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/55.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 5
8 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 56: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/56.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 5
8 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 57: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/57.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 5
8 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 58: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/58.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 5
8 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 59: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/59.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 5
8 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 60: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/60.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 5
8 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 61: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/61.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 0 0 29 5 29 08 0 0 29 5 29 58 0 0 29 5 34 58 0 0 34 5 34 58 34 0 34 5 34 58 34 8 34 5 34 58 42 8 34 5 34 58 42 8 42 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 62: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/62.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 0
8 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 63: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/63.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 0
8 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 64: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/64.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 0
8 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 65: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/65.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 0
8 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 66: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/66.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 5
8 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 67: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/67.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 5
8 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 68: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/68.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 5
8 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 69: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/69.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 5
8 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 70: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/70.jpg)
Interleaved Execution of Threads
Thread 1:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 2:load sp -> reg1
load ep -> reg2
add reg1, reg2 -> reg1
store reg1 -> sp
Thread 1 private Shared Thread 2 privateep reg1 reg2 sp ep reg1 reg28 0 0 29 5 0 08 29 0 29 5 0 08 29 0 29 5 29 08 29 8 29 5 29 08 29 8 29 5 29 58 37 8 29 5 29 58 37 8 29 5 34 58 37 8 37 5 34 58 37 8 34 5 34 5
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 71: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/71.jpg)
Race Condition / Data Race
Definition:
I A race condition / data race exists if the behaviour (themeaning) of a program depends on the execution order ofprogram parts (threads) whose temporal behaviour is beyondcontrol.
Origin of term:
I Electronics
I Two electric signals race against each other.
I Arrival order of input signals at gate determines output signal.
Big question: how can we avoid data races ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 72: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/72.jpg)
Race Condition / Data Race
Definition:
I A race condition / data race exists if the behaviour (themeaning) of a program depends on the execution order ofprogram parts (threads) whose temporal behaviour is beyondcontrol.
Origin of term:
I Electronics
I Two electric signals race against each other.
I Arrival order of input signals at gate determines output signal.
Big question: how can we avoid data races ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 73: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/73.jpg)
Race Condition / Data Race
Definition:
I A race condition / data race exists if the behaviour (themeaning) of a program depends on the execution order ofprogram parts (threads) whose temporal behaviour is beyondcontrol.
Origin of term:
I Electronics
I Two electric signals race against each other.
I Arrival order of input signals at gate determines output signal.
Big question: how can we avoid data races ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 74: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/74.jpg)
How can we Avoid Data Races ?
Solution: critical sections
I Once a thread enters a critical section, it must leave it beforeany other thread can enter; no interleaving.
I Critical sections must be made explicit throughout program.
Example:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
#pragma omp critical
{
sp = sp + ep;
} }
return sp;
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 75: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/75.jpg)
Critical Sections
Parallel scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
#pragma omp critical
{
sp = sp + ep;
} }
return sp;
}
The critical directive:
I Directive must immediately precede new statement block.I Statement block is executed without interleaving.I Directive implements critical section.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 76: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/76.jpg)
Named Critical Sections
Disadvantage:
I All critical sections in entire program are synchronised.
I Many might be unrelated.
I Unnecessary synchronisation overhead.
Solution: named critical directive
I Critical sections may be associated with names.
I Critical sections with identical names are synchronised.
I Critical sections with di↵erent names are executedconcurrently.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 77: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/77.jpg)
Named Critical Sections
Disadvantage:
I All critical sections in entire program are synchronised.
I Many might be unrelated.
I Unnecessary synchronisation overhead.
Solution: named critical directive
I Critical sections may be associated with names.
I Critical sections with identical names are synchronised.
I Critical sections with di↵erent names are executedconcurrently.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 78: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/78.jpg)
Named Critical Section
Scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
#pragma omp critical (scalar_prod)
{
sp = sp + ep;
} }
return sp;
}
BUT: Is this really e�cient ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 79: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/79.jpg)
Named Critical Section
Scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len , sp) \
private( i, ep)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
#pragma omp critical (scalar_prod)
{
sp = sp + ep;
} }
return sp;
}
BUT: Is this really e�cient ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 80: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/80.jpg)
Reduction Clause
Scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len) \
private( i, ep) \
reduction( +: sp)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 81: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/81.jpg)
Reduction Clause
Scalar product:
double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len) \
private( i, ep) \
reduction( +: sp)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
a
* * * *b
ep
= = = =
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 82: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/82.jpg)
Reduction Clause
Scalar product:double scalar_prod( double *a, double *b, int len)
{
int i; double ep , sp=0.0;
#pragma omp parallel for shared( a, b, len) \
private( i, ep) \
reduction( +: sp)
for (i=0; i<len; i++) {
ep = a[i] * b[i];
sp = sp + ep;
}
return sp;
}
Properties:
I Reduction clause only supports built-in reduction operations:+, *, ^, &, |, &&, ||, min, max.
I Bit accuracy not guaranteed.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 83: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/83.jpg)
Conditional Parallelisation
Problem:
I Parallel execution of aloop incurs overhead:
I creation of workerthreads
I schedulingI synchronisation barrier
I This overhead must beoutweighed by su�cientworkload.
I Workload depends onI loop body,I trip count.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 84: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/84.jpg)
Conditional Parallelisation
Problem:
I Parallel execution of aloop incurs overhead:
I creation of workerthreads
I schedulingI synchronisation barrier
I This overhead must beoutweighed by su�cientworkload.
I Workload depends onI loop body,I trip count.
Example:
if (len < 1000) {
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
}
else {
#pragma omp parallel for
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 85: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/85.jpg)
Conditional Parallelisation
Introducing the if-clause:
if (len < 1000) {
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
}
else {
#pragma omp parallel for
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
}
#pragma omp parallel for if (len >= 1000)
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 86: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/86.jpg)
Conditional Parallelisation
If-clause:
#pragma omp parallel for if (len >= 1000)
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
Some facts:
I If-clause can contain any kind of C expression
I C expression may refer to all identifiers in scopeI C expression is evaluated first:
I false: sequential executionI true: parallel execution
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 87: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/87.jpg)
Programming Multi-Core Systems with OpenMP
Targeted Architectures
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 88: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/88.jpg)
Loop Scheduling
Definition:
I Loop scheduling determines which iterations are executed bywhich thread.
Aim:
I Equal workload distribution
timeworkwait
task 4
task 3
task 2
task 1
sy
nc
hro
niz
ati
on
ba
rrie
r
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 89: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/89.jpg)
Loop Scheduling
Problem:
I Di↵erent situations require di↵erent techniques
The schedule clause:
#pragma omp parallel for schedule( <type > [, <chunk >])
for (...)
{
...
}
Properties:
I Clause selects one out of a set of scheduling techniques.
I Optionally, a chunk size can be specified.
I Default chunk size depends on scheduling technique.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 90: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/90.jpg)
Loop Scheduling
Static scheduling:
#pragma omp parallel for schedule( static)
I Loop is subdivided into as many chunks as threads exist.
I Often called block scheduling.
Illustration:
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 91: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/91.jpg)
Loop Scheduling
Static scheduling with chunk size 1:
#pragma omp parallel for schedule( static , 1)
I Iterations are assigned to threads in a round-robin fashion.
I Also called cyclic scheduling.
Illustration:
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 92: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/92.jpg)
Loop Scheduling
Static scheduling with chunk size ¡n¿:
#pragma omp parallel for schedule( static , <n>)
I Loop is subdivided into chunks of n iterations.
I Chunks are assigned to threads in a round-robin fashion.
I Also called block-cyclic scheduling.
Illustration:
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 93: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/93.jpg)
Loop Scheduling
Dynamic scheduling:
#pragma omp parallel for schedule( dynamic , <n>)
I Loop is subdivided into chunks of n iterations.
I Chunks are dynamically assigned to threads on their demand.
I Also called self scheduling.
I Default chunk size: 1 iteration.
Properties to keep in mind:
I Allows for dynamic load distribution and adjustment.
I Requires additional synchronization.
I Generates more overhead than static scheduling.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 94: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/94.jpg)
Loop Scheduling
Dynamic scheduling:
#pragma omp parallel for schedule( dynamic , <n>)
I Loop is subdivided into chunks of n iterations.
I Chunks are dynamically assigned to threads on their demand.
I Also called self scheduling.
I Default chunk size: 1 iteration.
Properties to keep in mind:
I Allows for dynamic load distribution and adjustment.
I Requires additional synchronization.
I Generates more overhead than static scheduling.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 95: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/95.jpg)
Loop Scheduling
Dilemma of chunk size selection:
I Small chunk sizes mean good load balancing, but highsynchronisation overhead.
I Large chunk sizes reduce synchronisation overhead, but resultin poor load balancing.
Rationale of guided scheduling:
I In the beginning, large chunks keep synchronisation overheadsmall.
I When approaching the final barrier, small chunks balanceworkload.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 96: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/96.jpg)
Loop Scheduling
Dilemma of chunk size selection:
I Small chunk sizes mean good load balancing, but highsynchronisation overhead.
I Large chunk sizes reduce synchronisation overhead, but resultin poor load balancing.
Rationale of guided scheduling:
I In the beginning, large chunks keep synchronisation overheadsmall.
I When approaching the final barrier, small chunks balanceworkload.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 97: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/97.jpg)
Loop Scheduling
Guided scheduling:#pragma omp parallel for schedule( guided , <n>)
I Chunks are dynamically assigned to threads on their demand.I Initial chunk size is implementation dependent.I Chunk size decreases exponentially with every assignment.I Also called guided self scheduling.I Minimum chunk size: n (default: 1)
Example:
I Total number of iterations: 250I Initial / minimal chunk size: 50 / 5I Current chunk size: 80% of last chunk size:
50 – 40 – 32 – 26 – 21 – 17 – 14 – 12 – 10 – 8 – 6 – 5 – 5 – 4I All properties are implementation-dependent !
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 98: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/98.jpg)
Loop Scheduling
Guided scheduling:#pragma omp parallel for schedule( guided , <n>)
I Chunks are dynamically assigned to threads on their demand.I Initial chunk size is implementation dependent.I Chunk size decreases exponentially with every assignment.I Also called guided self scheduling.I Minimum chunk size: n (default: 1)
Example:
I Total number of iterations: 250I Initial / minimal chunk size: 50 / 5I Current chunk size: 80% of last chunk size:
50 – 40 – 32 – 26 – 21 – 17 – 14 – 12 – 10 – 8 – 6 – 5 – 5 – 4I All properties are implementation-dependent !
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 99: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/99.jpg)
Choice of Scheduling Technique
Which scheduling to choose when ?
I Depends on your code !I Crucial question: is the amount of computational work per
iteration (roughly) the same for each iteration or not?
Static scheduling techniques:
I Preferable for uniform workload distributionsI Minimal overheadI (Block-)Cyclic schedulings may be useful for regular uneven
workload distributionsI (Block-)Cyclic schedulings may run into cache issues
Dynamic scheduling techniques:
I Preferable for irregular workload distributionsI Additional synchronisation overhead needs compensationI Guided self-scheduling usually superior
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 100: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/100.jpg)
Choice of Scheduling Technique
Which scheduling to choose when ?
I Depends on your code !I Crucial question: is the amount of computational work per
iteration (roughly) the same for each iteration or not?
Static scheduling techniques:
I Preferable for uniform workload distributionsI Minimal overheadI (Block-)Cyclic schedulings may be useful for regular uneven
workload distributionsI (Block-)Cyclic schedulings may run into cache issues
Dynamic scheduling techniques:
I Preferable for irregular workload distributionsI Additional synchronisation overhead needs compensationI Guided self-scheduling usually superior
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 101: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/101.jpg)
Choice of Scheduling Technique
Which scheduling to choose when ?
I Depends on your code !I Crucial question: is the amount of computational work per
iteration (roughly) the same for each iteration or not?
Static scheduling techniques:
I Preferable for uniform workload distributionsI Minimal overheadI (Block-)Cyclic schedulings may be useful for regular uneven
workload distributionsI (Block-)Cyclic schedulings may run into cache issues
Dynamic scheduling techniques:
I Preferable for irregular workload distributionsI Additional synchronisation overhead needs compensationI Guided self-scheduling usually superior
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 102: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/102.jpg)
Programming Multi-Core Systems with OpenMP
Targeted Architectures
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 103: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/103.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 104: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/104.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 105: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/105.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 106: Programming Multi-Core Systems with OpenMP · Programming Multi-Core Systems with OpenMP Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data](https://reader030.fdocuments.us/reader030/viewer/2022040616/5f146b732592a657336ec3a4/html5/thumbnails/106.jpg)
The End: Questions ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP