An Introduction to OpenMP · 2 OpenMP General Concepts Parallelization Constructs Data Environment...

IntroductionOpenMP

Final Considerations

An Introduction to OpenMP

Mirto Musci, PhD Candidate

Department of Computer ScienceUniversity of Pavia

Processors Architecture Class, Fall 2011

Mirto Musci, PhD Candidate An Introduction to OpenMP

IntroductionOpenMP

Outline

1 IntroductionMotivationParadigm Shift

2 OpenMPGeneral ConceptsParallelization ConstructsData EnvironmentSynchronization

3 Final ConsiderationsComplete ExampleSummary

IntroductionOpenMP

MotivationParadigm Shift

Outline

IntroductionOpenMP

Parallel machines are everywhere

Many architectures, many programming model

Multithreading is the natural programming model for multicoreand SMP machines (desktop, server, cluster, embedded):

All processors share the same memoryThreads in a process see the same address spaceLots of shared-memory algorithms de�ned

Multithreading is (correctly) perceived to be hard!

Lots of expertise necessarySynchronization, dependencies, granularity, balancing...Non-deterministic behavior makes it hard to debug

IntroductionOpenMP

What is OpenMP?

Standard API for de�ning multi-threaded shared-memoryprogramswww.openmp.org � Talks, examples, forums, etc.Last release: OpenMP 3.1 (July 2011)

High-level API

Preprocessor (compiler) directives ( ~ 80% )

Library Calls ( ~ 19% )

Environment Variables ( ~ 1% )

IntroductionOpenMP

A Programmer's View of OpenMP

OpenMP is a portable, threaded, shared-memory programmingspeci�cation with �light� syntax

Exact behavior depends on OpenMP implementation!Requires compiler support (C or Fortran)

OpenMP will:

Allow a programmer to separate a program into serial regionsand parallel regions, rather than T concurrent threads.Hide stack managementProvide synchronization constructs

OpenMP will not:

Parallelize (or detect!) dependenciesGuarantee speedupProvide freedom from data races

IntroductionOpenMP

Outline

IntroductionOpenMP

Ideal Parallel Programming

Ideal Parallel Programming Paradigm

1 Start with any algorithm

Embarrassing parallelism is helpful, but not necessary

2 Implement serially, ignoring:

Data RacesSynchronizationThreading Syntax

3 Test and Debug

4 Automatically (magically?) parallelize

Expect linear speedup

IntroductionOpenMP

Threading Programming Paradigm

1 Start with a parallel algorithm

2 Implement, keeping in mind:

Data racesSynchronizationThreading Syntax

3 Test & Debug

4 Debug

5 Debug

IntroductionOpenMP

Problem

Parallelize the following code using threads:

i n t main ( ) {f o r ( i n t i =0; i <16; i++)

p r i n t f ( " He l l o , World !\ n" ) ;}

A lot of work to do a simple thing

Di�erent threading APIs:

Windows: CreateThreadUNIX: pthread_create

Problems with the code:

Di�erent code for serial and parallel versionNo built-in tuning (# of processors someone?)

IntroductionOpenMP

Motivation - Threading Library

vo id ∗ SayHe l l o ( vo id ∗ f oo ) {p r i n t f ( " He l l o , wor ld !\ n" ) ;re tu rn NULL ;

i n t main ( ) {pthread_att r_t a t t r ;pthread_t t h r e ad s [ 1 6 ] ;i n t tn ;p t h r e ad_a t t r_ i n i t (& a t t r ) ;p th r ead_at t r_se t s cope (&a t t r , PTHREAD_SCOPE_SYSTEM) ;

f o r ( tn=0; tn <16; tn++) {pth read_crea te (& th r e ad s [ tn ] , &a t t r , SayHe l lo , NULL) ; }

f o r ( tn=0; tn<16 ; tn++) {p th r ead_jo i n ( t h r e ad s [ tn ] , NULL) ; }

re tu rn 0 ;}

IntroductionOpenMP

Motivation - Threading Library II

Thread libraries are hard to use

PThreads has many library calls for initialization,synchronization, thread creation, condition variables, etc.Programmer must code with multiple threads in mind

Synchronization between threads introduces a new dimensionof program correctness

Wouldn't it be nice to write serial programs and somehowparallelize them �automatically�?

OpenMP can parallelize many serial programs with relativelyfew annotations that specify parallelism and independenceOpenMP is a small API that hides cumbersome threading callswith simpler directives

IntroductionOpenMP

Motivation - OpenMP

i n t main ( ) {

// Do t h i s p a r t i n p a r a l l e l

p r i n t f ( " He l l o , World !\ n" ) ;

re tu rn 0 ;}

IntroductionOpenMP

Motivation - OpenMP

i n t main ( ) {

omp_set_num_threads (16) ;

// Do t h i s p a r t i n p a r a l l e l#pragma omp p a r a l l e l{

p r i n t f ( " He l l o , World !\ n" ) ;}

re tu rn 0 ;}

IntroductionOpenMP

OpenMP Parallel Programming

OpenMP Programming Paradigm

1 Start with a parallelizable algorithm

Embarrassing parallelism is good, loop-level parallelism isnecessary

2 Implement serially, mostly ignoring:

Data RacesSynchronizationThreading Syntax

3 Test and Debug

4 Annotate the code with parallelization (and sync) directives

Hope for linear speedup

5 Test and Debug

IntroductionOpenMP

OpenMP Methodology

Parallelization with OpenMP is an optimization process.Proceed with care:

Start with a working program, then add parallelizationMeasure the changes after every stepRemember Amdahl's law.Use the pro�lertools available

IntroductionOpenMP

General ConceptsParallelization ConstructsData EnvironmentSynchronization

Outline

IntroductionOpenMP

Programming Model

Thread-like Fork/Join model

Arbitrary number of logicalthread creation/ destructionevents

Serial regions by default,annotate to create parallelregions

Generic parallel regionsParallelized loopsSectioned parallel regions

IntroductionOpenMP

Nested Threading

Fork/Join can be nested

Nesting complication handled�automagically� atcompile-timeIndependent of the number ofthreads actually running

IntroductionOpenMP

Thread Identi�cation

Master Thread

Thread with ID=0Only thread that exists insequential regionsDepending onimplementation, may havespecial purpose inside parallelregionsSome special directives a�ectonly the master thread (likemaster)

IntroductionOpenMP

Data/Control Parallelism

Data parallelism

Threads perform similarfunctions, guided by threadidenti�er

Control parallelism

Threads perform di�eringfunctionsOne thread for I/O, one forcomputation, etc. . .

IntroductionOpenMP

Programming Model - Summary

Fork and Join: Master thread spawns a team of threads as needed

IntroductionOpenMP

Memory Model

Shared memory communication

Threads cooperates by accessing shared variables

The sharing is de�ned syntactically

Any variable that is seen by two or more threads is sharedAny variable that is seen by one thread only is private

Race conditions possible

Use synchronization to protect from con�ictsChange how data is stored to minimize the synchronization

IntroductionOpenMP

OpenMP Syntax

Most of the constructs of OpenMP are pragmas

#pragma omp c o n s t r u c t [ c l a u s e [ c l a u s e ] . . . ]

(FORTRAN: !$OMP, not covered here)An OpenMP construct applies to a structural block (one entrypoint, one exit point)

In addition:

Several omp_<something> function callsSeveral OMP_<something> environment variables

IntroductionOpenMP

Controlling OpenMP Behavior

Function calls and (for each one) matching environment variables:

omp_set_dynamic(int)/omp_get_dynamic()

Allows the implementation to adjust the number of threadsdynamically

omp_set_num_threads(int)/omp_get_num_threads()

Control the number of threads used for parallelization(maximum in case of dynamic adjustment)

Must be called from sequential code

Also can be set by OMP_NUM_THREADS environmentvariable

IntroductionOpenMP

Controlling OpenMP Behavior II

omp_get_num_procs()

How many processors are currently available?

omp_get_thread_num()

omp_set_nested(int)/omp_get_nested()

Enable nested parallelism

omp_in_parallel()

Am I currently running in parallel mode?

omp_get_wtime()

A portable way to compute wall clock time

IntroductionOpenMP

Outline

IntroductionOpenMP

OpenMP Structure

OpenMP languageextensions

parallel controlstructures

work sharingdata

environmentsynchronization

runtimefunctions, env.

variables

governs flow ofcontrol in theprogram

parallel directive

distributes workamong threads

do/parallel doandsection directives

scopesvariables

shared andprivateclauses

coordinates threadexecution

critical andatomic directivesbarrier directive

runtime environment

omp_get_thread_num()OMP_NUM_THREADSOMP_SCHEDULE

omp_set_num_threads()

IntroductionOpenMP

Parallel Regions I

Main construct:#pragma omp parallel

De�nes a parallel region overstructured block of code

Threads are created as �parallel�pragma is crossed

Threads block at end of region(implicit barrier)

IntroductionOpenMP

Parallel Regions II

double D[ 1 0 0 0 ] ;#pragma omp p a r a l l e l{

i n t i ; double sum = 0 ;f o r ( i =0; i <1000; i++) sum += D[ i ] ;p r i n t f ( "Thread %d computes %f \n" , omp_thread_num ( ) , sum) ;

Executes the same code as many times as there are threads

How many threads do we have? omp_set_num_threads(n)What is the use of repeating the same work n times in parallel?Can use omp_thread_num() to distribute the work betweenthreads.D is shared between the threads, i and sum are private

IntroductionOpenMP

Implementing Work Sharing

Sequential code f o r ( i n t i =0; i<N; i++) { a [ i ]=b [ i ]+c [ i ] ; }

(Semi) manualparallelization

#pragma omp p a r a l l e l{

i n t i d = omp_get_thread_num ( ) ;i n t Nthr = omp_get_num_threads ( ) ;i n t i s t a r t = i d ∗N/Nthr ;i n t i e nd = ( i d+1)∗N/Nthr ;f o r ( i n t i= i s t a r t ; i<i end ; i++) {

a [ i ]=b [ i ]+c [ i ] ; }}

Automaticparallelization

#pragma omp p a r a l l e l f o r s c h edu l e ( s t a t i c ){

f o r ( i n t i =0; i<N; i++) {a [ i ]=b [ i ]+c [ i ] ; }

IntroductionOpenMP

Work Sharing: For

Used to assign each thread anindependent set of iterations

Threads must wait at the end

Can combine the directives:

#pragma omp parallel for

Only simple kinds of for loops:

Only one signed integer variableInitialization: var=initComparison: var op last

op: <, >, <=, >=Increment: var++, var--,var+=incr, var-=incr, etc.

IntroductionOpenMP

Problems of #parallel for

Load balancing

If all the iterations execute at the same speed, the processorsare used optimallyIf some iterations are faster than others, some processors mayget idle, reducing the speedupWe don't always know the distribution of work, may need tore-distribute dynamically

Granularity

Thread creation and synchronization takes timeAssigning work to threads on per-iteration resolution may takemore time than the execution itself!Need to coalesce the work to coarse chunks to overcome thethreading overhead

Trade-o� between load balancing and granularity!

IntroductionOpenMP

Controlling Granularity

#pragma omp parallel if (expression)

Can be used to disable parallelization in some cases (when theinput is determined to be too small to be bene�ciallymultithreaded)

#pragma omp num_threads (expression)

Control the number of threads used for this parallel region

IntroductionOpenMP

Work Sharing: Sections

answer1 = long_computation_1 ( ) ;answer2 = long_computation_2 ( ) ;i f ( answer1 != answer2 ) { . . . }

How to parallelize? These are just two independentcomputations!

#pragma omp s e c t i o n s{#pragma omp s e c t i o nanswer1 = long_computation_1 ( ) ;#pragma omp s e c t i o nanswer2 = long_computation_2 ( ) ;

}i f ( answer1 != answer2 ) { . . . }

IntroductionOpenMP

Schedule Clause: Controlling Work Distribution

schedule(static [, chunksize])

Default: chunks of approximately equivalent size, one to eachthreadIf more chunks than threads: assigned in round-robin to thethreadsWhy might we want to use chunks of di�erent size?

schedule(dynamic [, chunksize])

Threads receive chunk assignments dynamicallyDefault chunk size = 1 (why?)

schedule(guided [, chunksize])

Start with large chunksThreads receive chunks dynamicallyChunk size reduces exponentially, down to chunksize

IntroductionOpenMP

Graphic Scheduling

static dynamic guided(1)

IntroductionOpenMP

Scheduling Example

The function TestForPrime (usually) takes little time

But can take long, if the number is a prime indeed

#pragma omp p a r a l l e l f o r s c h edu l e ????f o r ( i n t i = s t a r t ; i <= end ; i += 2 )

{i f ( TestForPr ime ( i ) ) gPrimesFound++;

Solution: use dynamic, but with chunks

IntroductionOpenMP

Outline

IntroductionOpenMP

Data Visibility

Shared Memory programming model

Most variables (including locals) are shared by default(unlike Pthreads!)

{i n t sum = 0 ;#pragma omp p a r a l l e l f o rf o r ( i n t i =0; i<N; i++) sum += i ;

Global variables are shared

Some variables can be private

Automatic variables inside the statement blockAutomatic variables in the called functionsVariables can be explicitly declared as private. In that case, alocal copy is created for each thread

IntroductionOpenMP

Overriding Storage Attributes

private:

A copy of the variable iscreated for each thread.

No connection between theoriginal variable and theprivate copies

Can achieve the same usingvariables inside { }

i n t i ;

#pragma omp p a r a l l e l f o r \p r i v a t e ( i )

f o r ( i =0; i<n ; i++) { . . . }

IntroductionOpenMP

Overriding Storage Attributes II

�rstprivate:

Same, but the initial value iscopied from the main copy

lastprivate:

Same, but the last value iscopied to the main copy

i n t i d x =1;i n t x = 10 ;

#pragma omp p a r a l l e l f o r \f i r s p r i v a t e ( x ) \l a s t p r i v a t e ( i d x )

f o r ( i =0; i<n ; i++) {i f ( data [ i ]==x ) i d x = i ;

IntroductionOpenMP

Threadprivate

Similar to private, but de�ned per variable

Declaration immediately after variable de�nition. Must bevisible in all translation units.Persistent between parallel sectionsCan be initialized from the master's copy with#pragma omp copyinMore e�cient than private, but a global variable!

Example:

i n t data [ 1 0 0 ] ;#pragma omp t h r e a d p r i v a t e ( data ). . .#pragma omp p a r a l l e l f o r copy i n ( data )f o r ( . . . )

IntroductionOpenMP

Outline

IntroductionOpenMP

Synchronization

X = 0 ;

#pragma omp p a r a l l e lX = X+1;

What should the result be (assuming 2 threads)?

2 is the expected answerBut can be 1 with unfortunate interleaving

OpenMP assumes that the programmer knows what he is doing

Regions of code that are marked to run in parallel areindependentIf access collisions are possible, it is the programmer'sresponsibility to insert protection

IntroductionOpenMP

Synchronization Mechanisms

Many of the existing mechanisms for shared programming

OpenMP Synchronization

Nowait (turn synchronization o�!)

Single/Master execution

Critical sections, Atomic updates

Ordered

Barriers

Flush (memory subsystem synchronization)

Reduction (special case)

IntroductionOpenMP

Single/Master

#pragma omp single

Only one of the threads will execute the following block ofcode

The rest will wait for it to completeGood for non-thread-safe regions of code (such as I/O)Must be used in a parallel regionApplicable to parallel for sections

IntroductionOpenMP

Single/Master II

#pragma omp master

The following block will be executed by the master thread

No synchronization involved

Applicable only to parallel sections

do_prep roce s s i ng ( ) ;

#pragma omp s i n g l eread_input ( ) ;#pragma omp masternot i fy_input_consumed ( ) ;

do_proces s ing ( ) ;}

IntroductionOpenMP

Critical Sections

#pragma omp critical [name]

Standard critical section functionality

Critical sections are global in the program

Can be used to protect a single resource in di�erent functions

Critical sections are identi�ed by the name

All the unnamed critical sections are mutually exclusivethroughout the programAll the critical sections having the same name are mutuallyexclusive between themselves

i n t x = 0 ;#pragma omp p a r a l l e l s ha r ed ( x ){#pragma omp c r i t i c a lx++;

IntroductionOpenMP

Atomic Execution

Critical sections on the cheap

Protects a single variable updateCan be much more e�cient (a dedicated assembly instructionon some architectures)

#pragma omp atomicupdate_statement

Update statement is one of: var= var op expr, var op= expr,var++, var�.

The variable must be a scalarThe operation op is one of: +, -, *, /, ^, &, |, <�<, >�>The evaluation of expr is not atomic!

IntroductionOpenMP

Ordered

#pragma omp orderedstatement

Executes the statement in the sequential order of iterations

Example:

#pragma omp p a r a l l e l f o rf o r ( j =0; j<N; j++) {

i n t r e s u l t = heavy_computat ion ( j ) ;#pragma omp o rde r edp r i n t f ( " computat ion(%d ) = %d\n" , j , r e s u l t ) ;

IntroductionOpenMP

Barrier synchronization

#pragma omp barrier

Performs a barrier synchronization between all the threads in ateam at the given point.

Example:

i n t r e s u l t = heavy_computat ion_part1 ( ) ;#pragma omp atomicsum += r e s u l t ;#pragma omp b a r r i e rheavy_computat ion_part2 ( sum) ;

IntroductionOpenMP

Explicit Locking

Can be used to pass lock variables around (unlike criticalsections!)

Can be used to implement more involved synchronizationconstructs

Functions:

omp_init_lock(), omp_destroy_lock(), omp_set_lock(),omp_unset_lock(), omp_test_lock()The usual semantics

Use #pragma omp �ush to synchronize memory

IntroductionOpenMP

Consistency Violation?

#pragma omp p a r a l l e l f o r \sha r ed ( x ) p r i v a t e ( i )

f o r ( i =0; i <100; i++ ) {#pragma omp atomicx++;

}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/

f o r ( i =0; i <100; i++ ) {omp_set_lock (my_lock ) ;x++;

omp_unset_lock (my_lock ) ;}p r i n t f ( "%i " , x ) ; /∗ 96 ! ! ∗/

IntroductionOpenMP

Consistency Violation?

f o r ( i =0; i <100; i++ ) {#pragma omp atomicx++;

}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/

f o r ( i =0; i <100; i++ ) {omp_set_lock (my_lock ) ;x++;#pragma omp f l u s homp_unset_lock (my_lock ) ;

}p r i n t f ( "%i " , x ) ; /∗ 100 ∗/

IntroductionOpenMP

Reduction

f o r ( j =0; j<N; j++) {sum = sum+a [ j ]∗ b [ j ] ;

How to parallelize this code?

sum is not private, but accessing it atomically is too expensiveHave a private copy of sum in each thread, then add them up

Use the reduction clause!#pragma omp parallel for reduction(+: sum)

Any associative operator must be used: +, -, ||, |, *, etc.The private value is initialized automatically (to 0, 1, ~0 . . . )

IntroductionOpenMP

Synchronization Overhead

Lost time waiting for locks

Prefer to use structures that are as lock-free as possible!Use parallelization granularity which is as large as possible

#pragma omp p a r a l l e l{#pragma omp c r i t i c a l{

. . .}. . .

IntroductionOpenMP

Complete ExampleSummary

Outline

IntroductionOpenMP

Numerical Integration

Mathematically, we know that:∫ 1

(1+ x2)dx = π

We can approximate the integralas a sum of rectangles:

∑i=0

F (xi )∆x ≈ π

Where each rectangle has width∆x and height F (xi ) at themiddle of interval i .

IntroductionOpenMP

Serial Code

s t a t i c long num_steps=100000;double s tep , p i ;

vo id main ( ){ i n t i ;

double x , sum = 0 . 0 ;

s t e p = 1 . 0/ ( double ) num_steps ;

f o r ( i =0; i< num_steps ; i++){x = ( i +0.5)∗ s t e p ;sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ;

}p i = s t ep ∗ sum ;p r i n t f ( "Pi = %f \n" , p i ) ;

Parallelize thenumericalintegration codeusing OpenMP

What variables canbe shared?

What variables needto be private?

What variablesshould be set up forreductions?

IntroductionOpenMP

Parallel Code

s t a t i c long num_steps=100000;double s tep , p i ;

vo id main ( ){ i n t i ;

double x , sum = 0 . 0 ;

s t e p = 1 . 0/ ( double ) num_steps ;

#pragma omp p a r a l l e l f o r \p r i v a t e ( x ) r e d u c t i o n (+:sum)

f o r ( i =0; i< num_steps ; i++){x = ( i +0.5)∗ s t e p ;sum = sum + 4 . 0 / ( 1 . 0 + x∗x ) ;

}p i = s t ep ∗ sum ;p r i n t f ( "Pi = %f \n" , p i ) ;

Parallelization codeis a one-liner!

sum is a reduction,hence shared,variable

i is private since itis the loop variable

IntroductionOpenMP

Outline

IntroductionOpenMP

Summary

OpenMP: A framework for code parallelization

Available for C++ and FORTRANBased on a standardImplementations from a wide selection of vendors

Easy to use

Write (and debug!) code �rst, parallelize laterParallelization can be incrementalParallelization can be turned o� at runtime or compile timeCode is still correct for a serial machine

IntroductionOpenMP

Limitations

OpenMP requires compiler support

Sun, Intel, GCC...Embedded system?

OpenMP does not parallelize dependencies

Often does not detect dependenciesNasty race conditions still exist!

OpenMP is not guaranteed to divide work optimally amongthreads

Programmer-tweakable with scheduling clausesStill lots of rope available

Appendix For Further Reading

For Further Reading I

Clay BreshearsThe art of Concurrency

O'Really, 2009.

Blaise BarneyIntroduction to OpenMP, 2011https://computing.llnl.gov/tutorials/

An Introduction to OpenMP · 2 OpenMP General Concepts Parallelization Constructs Data Environment...

Documents

Transcript of An Introduction to OpenMP · 2 OpenMP General Concepts Parallelization Constructs Data Environment...

OpenMP China MCP 1. Agenda Motivation: The Need The OpenMP Solution OpenMP Features OpenMP Implementation Getting Started with OpenMP on KeyStone.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

OpenMP and OpenACC a Comparison - GTC On Demand...Comparing OpenMP 4.0 Device Constructs to OpenACC 2.0 Author: James Beyer Subject: The talk will briefly introduce two accelerator

Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number

Parallel programming - Rutgers Physics & Astronomy · Parallel programming Modern parallel programing has two pillars: open-MP and MPI. • openMP: – is used only for parallelization

OpenMP Offload Evaluating Support for Features · OpenMP 1.0 OpenMP 2.0 OpenMP 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 OpenMP 4.0 OpenMP 4.5 OpenMP 5.0 OpenACC 1.0 OpenACC 2.5 OpenACC

Hybrid Parallelization of Standard Full Tableau Simplex Method with MPI and OpenMP Basilis Mamalis 1, Marios Perlitis 2 (1) Technological Educational Institute.

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

Development of OpenMP Parallelization for MFIX-DEM · 2013. 8. 6. · • After instrumentation with OpenMP directives, performance analysis shows an efficiency of 87% on 8 threads

Programming with OpenMP* - USTCscc.ustc.edu.cn/_upload/article/files/74/40/cff82ed144a3...or –openmp in Linux •Most of the constructs in OpenMP are compiler directives or pragmas

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Programming with OpenMP* · Most constructs in OpenMP* are compiler directives or pragmas. • For C and C++ the pragmas take the form:For C and C++, the pragmas take the form: #pragma

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer

OpenMP Application Program Interface Examples€¦ · 5 Introduction This collection of programming examples supplements the OpenMP API for Shared Memory Parallelization specifications,

Shared’Memory, Programming,in,OpenMP,Advanced,Research,Compu:ng, OpenMP constructs OpenMP,language,, extensions, parallel,control, structures, data, environment synchronizaon, •,governsﬂowof

Large-Scale MO Calculation with GPU-accelerated FMO …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · ˜ OpenMP+MPI parallelization ˜ Running on PC cluster,

Note 17-11 First Steps with OpenMP 4.5 on Ubuntu and Nvidia GPUs … · 2018. 5. 16. · Preface: We want to emphasis that this document is a note on our OpenMP 4.5 parallelization.

OpenMP Application Program Interface · 2.10 Master and Synchronization Constructs . . . . . . . . . . . . . . . . . . . . . .76 ... Application Program Interface (OpenMP API) ...