Post on 16-Aug-2020
IntroductionOpenMP
Final Considerations
An Introduction to OpenMP
Mirto Musci, PhD Candidate
Department of Computer ScienceUniversity of Pavia
Processors Architecture Class, Fall 2011
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Recap
Parallel machines are everywhere
Many architectures, many programming model
Multithreading is the natural programming model for multicoreand SMP machines (desktop, server, cluster, embedded):
All processors share the same memoryThreads in a process see the same address spaceLots of shared-memory algorithms de�ned
Multithreading is (correctly) perceived to be hard!
Lots of expertise necessarySynchronization, dependencies, granularity, balancing...Non-deterministic behavior makes it hard to debug
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
What is OpenMP?
What is OpenMP?
Standard API for de�ning multi-threaded shared-memoryprogramswww.openmp.org � Talks, examples, forums, etc.Last release: OpenMP 3.1 (July 2011)
High-level API
Preprocessor (compiler) directives ( ~ 80% )
Library Calls ( ~ 19% )
Environment Variables ( ~ 1% )
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
A Programmer's View of OpenMP
OpenMP is a portable, threaded, shared-memory programmingspeci�cation with �light� syntax
Exact behavior depends on OpenMP implementation!Requires compiler support (C or Fortran)
OpenMP will:
Allow a programmer to separate a program into serial regionsand parallel regions, rather than T concurrent threads.Hide stack managementProvide synchronization constructs
OpenMP will not:
Parallelize (or detect!) dependenciesGuarantee speedupProvide freedom from data races
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Ideal Parallel Programming
Ideal Parallel Programming Paradigm
1 Start with any algorithm
Embarrassing parallelism is helpful, but not necessary
2 Implement serially, ignoring:
Data RacesSynchronizationThreading Syntax
3 Test and Debug
4 Automatically (magically?) parallelize
Expect linear speedup
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Threading Programming Paradigm
Threading Programming Paradigm
1 Start with a parallel algorithm
2 Implement, keeping in mind:
Data racesSynchronizationThreading Syntax
3 Test & Debug
4 Debug
5 Debug
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Problem
Parallelize the following code using threads:
i n t main ( ) {f o r ( i n t i =0; i <16; i++)
p r i n t f ( " He l l o , World !\ n" ) ;}
A lot of work to do a simple thing
Di�erent threading APIs:
Windows: CreateThreadUNIX: pthread_create
Problems with the code:
Di�erent code for serial and parallel versionNo built-in tuning (# of processors someone?)
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Motivation - Threading Library
vo id ∗ SayHe l l o ( vo id ∗ f oo ) {p r i n t f ( " He l l o , wor ld !\ n" ) ;re tu rn NULL ;
}
i n t main ( ) {pthread_att r_t a t t r ;pthread_t t h r e ad s [ 1 6 ] ;i n t tn ;p t h r e ad_a t t r_ i n i t (& a t t r ) ;p th r ead_at t r_se t s cope (&a t t r , PTHREAD_SCOPE_SYSTEM) ;
f o r ( tn=0; tn <16; tn++) {pth read_crea te (& th r e ad s [ tn ] , &a t t r , SayHe l lo , NULL) ; }
f o r ( tn=0; tn<16 ; tn++) {p th r ead_jo i n ( t h r e ad s [ tn ] , NULL) ; }
re tu rn 0 ;}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Motivation - Threading Library II
Thread libraries are hard to use
PThreads has many library calls for initialization,synchronization, thread creation, condition variables, etc.Programmer must code with multiple threads in mind
Synchronization between threads introduces a new dimensionof program correctness
Wouldn't it be nice to write serial programs and somehowparallelize them �automatically�?
OpenMP can parallelize many serial programs with relativelyfew annotations that specify parallelism and independenceOpenMP is a small API that hides cumbersome threading callswith simpler directives
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Motivation - OpenMP
i n t main ( ) {
// Do t h i s p a r t i n p a r a l l e l
p r i n t f ( " He l l o , World !\ n" ) ;
re tu rn 0 ;}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
Motivation - OpenMP
i n t main ( ) {
omp_set_num_threads (16) ;
// Do t h i s p a r t i n p a r a l l e l#pragma omp p a r a l l e l{
p r i n t f ( " He l l o , World !\ n" ) ;}
re tu rn 0 ;}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
OpenMP Parallel Programming
OpenMP Programming Paradigm
1 Start with a parallelizable algorithm
Embarrassing parallelism is good, loop-level parallelism isnecessary
2 Implement serially, mostly ignoring:
Data RacesSynchronizationThreading Syntax
3 Test and Debug
4 Annotate the code with parallelization (and sync) directives
Hope for linear speedup
5 Test and Debug
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
MotivationParadigm Shift
OpenMP Methodology
Parallelization with OpenMP is an optimization process.Proceed with care:
Start with a working program, then add parallelizationMeasure the changes after every stepRemember Amdahl's law.Use the pro�lertools available
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Programming Model
Thread-like Fork/Join model
Arbitrary number of logicalthread creation/ destructionevents
Serial regions by default,annotate to create parallelregions
Generic parallel regionsParallelized loopsSectioned parallel regions
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Nested Threading
Fork/Join can be nested
Nesting complication handled�automagically� atcompile-timeIndependent of the number ofthreads actually running
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Thread Identi�cation
Master Thread
Thread with ID=0Only thread that exists insequential regionsDepending onimplementation, may havespecial purpose inside parallelregionsSome special directives a�ectonly the master thread (likemaster)
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Data/Control Parallelism
Data parallelism
Threads perform similarfunctions, guided by threadidenti�er
Control parallelism
Threads perform di�eringfunctionsOne thread for I/O, one forcomputation, etc. . .
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Programming Model - Summary
Fork and Join: Master thread spawns a team of threads as needed
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Memory Model
Shared memory communication
Threads cooperates by accessing shared variables
The sharing is de�ned syntactically
Any variable that is seen by two or more threads is sharedAny variable that is seen by one thread only is private
Race conditions possible
Use synchronization to protect from con�ictsChange how data is stored to minimize the synchronization
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
OpenMP Syntax
Most of the constructs of OpenMP are pragmas
#pragma omp c o n s t r u c t [ c l a u s e [ c l a u s e ] . . . ]
(FORTRAN: !$OMP, not covered here)An OpenMP construct applies to a structural block (one entrypoint, one exit point)
In addition:
Several omp_<something> function callsSeveral OMP_<something> environment variables
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Controlling OpenMP Behavior
Function calls and (for each one) matching environment variables:
omp_set_dynamic(int)/omp_get_dynamic()
Allows the implementation to adjust the number of threadsdynamically
omp_set_num_threads(int)/omp_get_num_threads()
Control the number of threads used for parallelization(maximum in case of dynamic adjustment)
Must be called from sequential code
Also can be set by OMP_NUM_THREADS environmentvariable
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Controlling OpenMP Behavior II
omp_get_num_procs()
How many processors are currently available?
omp_get_thread_num()
omp_set_nested(int)/omp_get_nested()
Enable nested parallelism
omp_in_parallel()
Am I currently running in parallel mode?
omp_get_wtime()
A portable way to compute wall clock time
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
OpenMP Structure
OpenMP languageextensions
parallel controlstructures
work sharingdata
environmentsynchronization
runtimefunctions, env.
variables
governs flow ofcontrol in theprogram
parallel directive
distributes workamong threads
do/parallel doandsection directives
scopesvariables
shared andprivateclauses
coordinates threadexecution
critical andatomic directivesbarrier directive
runtime environment
omp_get_thread_num()OMP_NUM_THREADSOMP_SCHEDULE
omp_set_num_threads()
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Parallel Regions I
Main construct:#pragma omp parallel
De�nes a parallel region overstructured block of code
Threads are created as �parallel�pragma is crossed
Threads block at end of region(implicit barrier)
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Parallel Regions II
double D[ 1 0 0 0 ] ;#pragma omp p a r a l l e l{
i n t i ; double sum = 0 ;f o r ( i =0; i <1000; i++) sum += D[ i ] ;p r i n t f ( "Thread %d computes %f \n" , omp_thread_num ( ) , sum) ;
}
Executes the same code as many times as there are threads
How many threads do we have? omp_set_num_threads(n)What is the use of repeating the same work n times in parallel?Can use omp_thread_num() to distribute the work betweenthreads.D is shared between the threads, i and sum are private
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Implementing Work Sharing
Sequential code f o r ( i n t i =0; i<N; i++) { a [ i ]=b [ i ]+c [ i ] ; }
(Semi) manualparallelization
#pragma omp p a r a l l e l{
i n t i d = omp_get_thread_num ( ) ;i n t Nthr = omp_get_num_threads ( ) ;i n t i s t a r t = i d ∗N/Nthr ;i n t i e nd = ( i d+1)∗N/Nthr ;f o r ( i n t i= i s t a r t ; i<i end ; i++) {
a [ i ]=b [ i ]+c [ i ] ; }}
Automaticparallelization
#pragma omp p a r a l l e l f o r s c h edu l e ( s t a t i c ){
f o r ( i n t i =0; i<N; i++) {a [ i ]=b [ i ]+c [ i ] ; }
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Work Sharing: For
Used to assign each thread anindependent set of iterations
Threads must wait at the end
Can combine the directives:
#pragma omp parallel for
Only simple kinds of for loops:
Only one signed integer variableInitialization: var=initComparison: var op last
op: <, >, <=, >=Increment: var++, var--,var+=incr, var-=incr, etc.
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Problems of #parallel for
Load balancing
If all the iterations execute at the same speed, the processorsare used optimallyIf some iterations are faster than others, some processors mayget idle, reducing the speedupWe don't always know the distribution of work, may need tore-distribute dynamically
Granularity
Thread creation and synchronization takes timeAssigning work to threads on per-iteration resolution may takemore time than the execution itself!Need to coalesce the work to coarse chunks to overcome thethreading overhead
Trade-o� between load balancing and granularity!
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Controlling Granularity
#pragma omp parallel if (expression)
Can be used to disable parallelization in some cases (when theinput is determined to be too small to be bene�ciallymultithreaded)
#pragma omp num_threads (expression)
Control the number of threads used for this parallel region
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Work Sharing: Sections
answer1 = long_computation_1 ( ) ;answer2 = long_computation_2 ( ) ;i f ( answer1 != answer2 ) { . . . }
How to parallelize? These are just two independentcomputations!
#pragma omp s e c t i o n s{#pragma omp s e c t i o nanswer1 = long_computation_1 ( ) ;#pragma omp s e c t i o nanswer2 = long_computation_2 ( ) ;
}i f ( answer1 != answer2 ) { . . . }
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Schedule Clause: Controlling Work Distribution
schedule(static [, chunksize])
Default: chunks of approximately equivalent size, one to eachthreadIf more chunks than threads: assigned in round-robin to thethreadsWhy might we want to use chunks of di�erent size?
schedule(dynamic [, chunksize])
Threads receive chunk assignments dynamicallyDefault chunk size = 1 (why?)
schedule(guided [, chunksize])
Start with large chunksThreads receive chunks dynamicallyChunk size reduces exponentially, down to chunksize
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Graphic Scheduling
static dynamic guided(1)
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Scheduling Example
The function TestForPrime (usually) takes little time
But can take long, if the number is a prime indeed
#pragma omp p a r a l l e l f o r s c h edu l e ????f o r ( i n t i = s t a r t ; i <= end ; i += 2 )
{i f ( TestForPr ime ( i ) ) gPrimesFound++;
}
Solution: use dynamic, but with chunks
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Data Visibility
Shared Memory programming model
Most variables (including locals) are shared by default(unlike Pthreads!)
{i n t sum = 0 ;#pragma omp p a r a l l e l f o rf o r ( i n t i =0; i<N; i++) sum += i ;
}
Global variables are shared
Some variables can be private
Automatic variables inside the statement blockAutomatic variables in the called functionsVariables can be explicitly declared as private. In that case, alocal copy is created for each thread
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Overriding Storage Attributes
private:
A copy of the variable iscreated for each thread.
No connection between theoriginal variable and theprivate copies
Can achieve the same usingvariables inside { }
i n t i ;
#pragma omp p a r a l l e l f o r \p r i v a t e ( i )
f o r ( i =0; i<n ; i++) { . . . }
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Overriding Storage Attributes II
�rstprivate:
Same, but the initial value iscopied from the main copy
lastprivate:
Same, but the last value iscopied to the main copy
i n t i d x =1;i n t x = 10 ;
#pragma omp p a r a l l e l f o r \f i r s p r i v a t e ( x ) \l a s t p r i v a t e ( i d x )
f o r ( i =0; i<n ; i++) {i f ( data [ i ]==x ) i d x = i ;
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Threadprivate
Similar to private, but de�ned per variable
Declaration immediately after variable de�nition. Must bevisible in all translation units.Persistent between parallel sectionsCan be initialized from the master's copy with#pragma omp copyinMore e�cient than private, but a global variable!
Example:
i n t data [ 1 0 0 ] ;#pragma omp t h r e a d p r i v a t e ( data ). . .#pragma omp p a r a l l e l f o r copy i n ( data )f o r ( . . . )
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Synchronization
X = 0 ;
#pragma omp p a r a l l e lX = X+1;
What should the result be (assuming 2 threads)?
2 is the expected answerBut can be 1 with unfortunate interleaving
OpenMP assumes that the programmer knows what he is doing
Regions of code that are marked to run in parallel areindependentIf access collisions are possible, it is the programmer'sresponsibility to insert protection
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Synchronization Mechanisms
Many of the existing mechanisms for shared programming
OpenMP Synchronization
Nowait (turn synchronization o�!)
Single/Master execution
Critical sections, Atomic updates
Ordered
Barriers
Flush (memory subsystem synchronization)
Reduction (special case)
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Single/Master
#pragma omp single
Only one of the threads will execute the following block ofcode
The rest will wait for it to completeGood for non-thread-safe regions of code (such as I/O)Must be used in a parallel regionApplicable to parallel for sections
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Single/Master II
#pragma omp master
The following block will be executed by the master thread
No synchronization involved
Applicable only to parallel sections
#pragma omp p a r a l l e l{
do_prep roce s s i ng ( ) ;
#pragma omp s i n g l eread_input ( ) ;#pragma omp masternot i fy_input_consumed ( ) ;
do_proces s ing ( ) ;}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Critical Sections
#pragma omp critical [name]
Standard critical section functionality
Critical sections are global in the program
Can be used to protect a single resource in di�erent functions
Critical sections are identi�ed by the name
All the unnamed critical sections are mutually exclusivethroughout the programAll the critical sections having the same name are mutuallyexclusive between themselves
i n t x = 0 ;#pragma omp p a r a l l e l s ha r ed ( x ){#pragma omp c r i t i c a lx++;
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Atomic Execution
Critical sections on the cheap
Protects a single variable updateCan be much more e�cient (a dedicated assembly instructionon some architectures)
#pragma omp atomicupdate_statement
Update statement is one of: var= var op expr, var op= expr,var++, var�.
The variable must be a scalarThe operation op is one of: +, -, *, /, ^, &, |, <�<, >�>The evaluation of expr is not atomic!
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Ordered
#pragma omp orderedstatement
Executes the statement in the sequential order of iterations
Example:
#pragma omp p a r a l l e l f o rf o r ( j =0; j<N; j++) {
i n t r e s u l t = heavy_computat ion ( j ) ;#pragma omp o rde r edp r i n t f ( " computat ion(%d ) = %d\n" , j , r e s u l t ) ;
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Barrier synchronization
#pragma omp barrier
Performs a barrier synchronization between all the threads in ateam at the given point.
Example:
#pragma omp p a r a l l e l{
i n t r e s u l t = heavy_computat ion_part1 ( ) ;#pragma omp atomicsum += r e s u l t ;#pragma omp b a r r i e rheavy_computat ion_part2 ( sum) ;
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Explicit Locking
Can be used to pass lock variables around (unlike criticalsections!)
Can be used to implement more involved synchronizationconstructs
Functions:
omp_init_lock(), omp_destroy_lock(), omp_set_lock(),omp_unset_lock(), omp_test_lock()The usual semantics
Use #pragma omp �ush to synchronize memory
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Consistency Violation?
#pragma omp p a r a l l e l f o r \sha r ed ( x ) p r i v a t e ( i )
f o r ( i =0; i <100; i++ ) {#pragma omp atomicx++;
}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/
#pragma omp p a r a l l e l f o r \sha r ed ( x ) p r i v a t e ( i )
f o r ( i =0; i <100; i++ ) {omp_set_lock (my_lock ) ;x++;
omp_unset_lock (my_lock ) ;}p r i n t f ( "%i " , x ) ; /∗ 96 ! ! ∗/
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Consistency Violation?
#pragma omp p a r a l l e l f o r \sha r ed ( x ) p r i v a t e ( i )
f o r ( i =0; i <100; i++ ) {#pragma omp atomicx++;
}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/
#pragma omp p a r a l l e l f o r \sha r ed ( x ) p r i v a t e ( i )
f o r ( i =0; i <100; i++ ) {omp_set_lock (my_lock ) ;x++;#pragma omp f l u s homp_unset_lock (my_lock ) ;
}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Reduction
f o r ( j =0; j<N; j++) {sum = sum+a [ j ]∗ b [ j ] ;
}
How to parallelize this code?
sum is not private, but accessing it atomically is too expensiveHave a private copy of sum in each thread, then add them up
Use the reduction clause!#pragma omp parallel for reduction(+: sum)
Any associative operator must be used: +, -, ||, |, *, etc.The private value is initialized automatically (to 0, 1, ~0 . . . )
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
General ConceptsParallelization ConstructsData EnvironmentSynchronization
Synchronization Overhead
Lost time waiting for locks
Prefer to use structures that are as lock-free as possible!Use parallelization granularity which is as large as possible
#pragma omp p a r a l l e l{#pragma omp c r i t i c a l{
. . .}. . .
}
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Numerical Integration
Mathematically, we know that:∫ 1
0
4.0
(1+ x2)dx = π
We can approximate the integralas a sum of rectangles:
N
∑i=0
F (xi )∆x ≈ π
Where each rectangle has width∆x and height F (xi ) at themiddle of interval i .
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Serial Code
s t a t i c long num_steps=100000;double s tep , p i ;
vo id main ( ){ i n t i ;
double x , sum = 0 . 0 ;
s t e p = 1 . 0/ ( double ) num_steps ;
f o r ( i =0; i< num_steps ; i++){x = ( i +0.5)∗ s t e p ;sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ;
}p i = s t ep ∗ sum ;p r i n t f ( "Pi = %f \n" , p i ) ;
}
Parallelize thenumericalintegration codeusing OpenMP
What variables canbe shared?
What variables needto be private?
What variablesshould be set up forreductions?
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Parallel Code
s t a t i c long num_steps=100000;double s tep , p i ;
vo id main ( ){ i n t i ;
double x , sum = 0 . 0 ;
s t e p = 1 . 0/ ( double ) num_steps ;
#pragma omp p a r a l l e l f o r \p r i v a t e ( x ) r e d u c t i o n (+:sum)
f o r ( i =0; i< num_steps ; i++){x = ( i +0.5)∗ s t e p ;sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ;
}p i = s t ep ∗ sum ;p r i n t f ( "Pi = %f \n" , p i ) ;
}
Parallelization codeis a one-liner!
sum is a reduction,hence shared,variable
i is private since itis the loop variable
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Outline
1 IntroductionMotivationParadigm Shift
2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization
3 Final ConsiderationsComplete ExampleSummary
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Summary
OpenMP: A framework for code parallelization
Available for C++ and FORTRANBased on a standardImplementations from a wide selection of vendors
Easy to use
Write (and debug!) code �rst, parallelize laterParallelization can be incrementalParallelization can be turned o� at runtime or compile timeCode is still correct for a serial machine
Mirto Musci, PhD Candidate An Introduction to OpenMP
IntroductionOpenMP
Final Considerations
Complete ExampleSummary
Limitations
OpenMP requires compiler support
Sun, Intel, GCC...Embedded system?
OpenMP does not parallelize dependencies
Often does not detect dependenciesNasty race conditions still exist!
OpenMP is not guaranteed to divide work optimally amongthreads
Programmer-tweakable with scheduling clausesStill lots of rope available
Mirto Musci, PhD Candidate An Introduction to OpenMP
Appendix For Further Reading
For Further Reading I
Clay BreshearsThe art of Concurrency
O'Really, 2009.
Blaise BarneyIntroduction to OpenMP, 2011https://computing.llnl.gov/tutorials/
Mirto Musci, PhD Candidate An Introduction to OpenMP