Download - Parallel Programming Models (Shared Address Space) 5 th week.

Parallel Programming Models(Shared Address Space)

5th week

OpenMP Is …

An Application Program Interface (API) to be used to explicitly direct multi-threaded, shared memory parallelism Three API components Compiler Directives Runtime Library Routines Environment Variables

Portable APIs for C/C++ and Fortran Multiple platforms: most Unix platforms and Windows NT

OpenMP Is … (Cont’d)

Standardized Jointly proposed by a group of major computer hardwar

e and software vendors Expected to become an ANSI standard

What does OpenMP stand for? Open specifications for multi-processing

Collaborative work with interested parties from the hardware and software industry, government and academia

OpenMP Is Not …

Distributed memory parallel systems by itselfImplemented identically by all vendors Guaranteed to make the most efficient use of shared memory There are no data locality constructs

History

Directive-based, Fortran programming extensions In the early 90's, by vendors of shared-memory machin

es Augment a serial Fortran program with directives to sp

ecify loops to be parallelized The compiler is responsible for parallelizing such loops

across the SMP processors Implementations were all functionally similar, but were

diverging (as usual)

History (Cont’d)

ANSI X3H5 In 1994 Rejected due to waning interest as distributed memory

machines became popular.

OpenMP In the spring of 1997 Taking over where ANSI X3H5 had left off, as newer s

hared memory machine architectures become popular

Goals

Standardization Provide a standard among a variety of shared memory

architectures(platforms) High-level interfaces to thread programming

Lean and Mean A simple and limited set of directives for shared

address space programming Just 3 or 4 directives are enough to represent significant

parallelism

Hello World Program:Pthread Version

#include <pthread.h>#include <stdio.h>

void* thrfunc(void* arg){ printf(“hello from thread %d\n”, *(int*)arg);}

int main(void){ pthread_t thread[4]; pthread_attr_t attr; int arg[4] = {0,1,2,3}; int i;

// setup joinable threads with system scope pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

// create N threads for(i=0; i<4; i++) pthread_create(&thread[i], &attr, thrfunc, (void*)&arg[i]); // wait for the N threads to finish for(i=0; i<4; i++) pthread_join(thread[i], NULL);}

Hello World:OpenMP Version

#include <omp.h>#include <stdio.h>

int main(void){ #pragma omp parallel printf(“hello from thread %d\n”, omp_get_thread_num());}

Goals (Cont’d)

Ease of use Incrementally parallelize a serial program

Unlike all or nothing approach of message-passing Implement both coarse-grain and fine-grain parallelism

Portability Fortran (77, 90, and 95), C, and C++ Public forum for API and membership

Matrix Multiplication:Sequential Version

for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; }}

Matrix Multiplication:MPI Version

BlkSz = N / # of processors;start = BlkSz * Rank;end = start + BlkSz;

MPI_Bcast (B, N * N, MPI_INT, 0, MPI_COMM_WORLD);if(Rank == 0) { for(i=1; i<# of processors; i++) MPI_Send(A + BlkSz * i, BlkSz, MPI_INT, i, TAG_INIT, MPI_COMM_WORLD);} else { MPI_Recv(A + start, BlkSz, MPI_INT, 0, TAG_INIT, MPI_COMM_WORLD, &status);}

Determine block size

Distributeblocks

for (i=start; i<end; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; }}

if (Rank == 0) { for (i=1; i<# of processors; i++) MPI_Recv (c + BLK_SZ * i, BLK_SZ, MPI_INT, i, TAG_END, MPI_COMM_WORLD, &status);} else { MPI_Send(c+start, BLK_SZ, MPI_INT, 0, TAG_END, MPI_COMM_WORLD);}

Calculate partial matrix multiplication

Gatherpartial result

Matrix Multiplication:OpenMP Version

#pragma omp parallel for private(temp), schedule(static)for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; }}

Add directive

Programming Model

Thread Based Parallelism A shared memory process with multiple threads Based upon multiple threads in the shared memory pro

gramming paradigm

Explicit Parallelism Explicit (not automatic) programming model Offer the programmer full control over parallelization

Programming Model (Cont’d)

Fork - Join Model All OpenMP programs begin as a single sequential proc

ess: the master thread Fork at the beginning of parallel constructs

The master thread creates a team of parallel threads The statements enclosed by the parallel region construct are ex

ecuted in parallel Join at the end of parallel constructs

The threads synchronize and terminate after completing the statements in the parallel construct

Only the master thread exists

Fork-Join Model

Programming Model (Cont’d)

Compiler Directive Based Parallelism is specified through the use of compiler directives

imbedded in C/C++ or Fortran source code

Nested Parallelism Support Parallel constructs may include other parallel constructs inside. Implementation-dependent

Dynamic Threads Alter the number of threads used to execute parallel regions Implementation-dependent

General Code Structure#include <omp.h>main () { int var1, var2, var3;

Serial code ... /* Beginning of parallel section. Fork a team of threads. Specify variable scoping */ #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads ... All threads join master thread and disband } Resume serial code }

Terms

Construct A statement, which consists of a directive and the subsequent struc

tured block.Directive A C or C++ #pragma followed by the omp identifier, other text, an

d a new line. The directive specifies program behavior.

Structured block A structured block is a statement that has a single entry and a singl

e exit. A compound statement is a structured block if its execution always

begins at the opening { and always ends at the closing }.

Terms (Cont’d)

Lexical extent The code textually enclosed between the beginning and the end of

a structured block following a directive. The static extent of a directives does not span multiple routines or

code files

Orphaned Directive An OpenMP directive that appears independently from another en

closing directive It exists outside of another directive's static (lexical) extent. Will span routines and possibly code files

Terms (Cont’d)

Dynamic extent (region) All statements in the lexical extent, plus any statement inside a

function that is executed as a result of the execution of statements within the lexical extent.

The dynamic extent of a directive includes both its static (lexical) extent and the extents of its orphaned directives.

Master thread The thread that creates a team when a parallel region is entered.

Team One or more threads cooperating in the execution of a construct.

Lexical/Orphan/Dynamic Extent

#pragma omp parallel{

…#pragma omp forfor(i=0; i<n; i++) {

for(j=0; j<m; j++)sub1();

sub2();}

}

sub1(){

#pragma omp critical…

}

sub2(){

#pragma omp sections…

}

Static extent Orphan directives

Dynamic extent

Terms (Cont’d)Parallel region Statements that bind to an OpenMP parallel construct and may be execut

ed by multiple threads.Serial region Statements executed only by the master thread outside of the dynamic ex

tent of any parallel region.Private A private variable names a block of storage that is unique to the thread m

aking the reference.Shared A shared variable names a single block of storage. All threads in a team that access this variable will access this single bloc

k of storage.

OpenMP Components

Directives Work-sharing constructs Data environment clauses Synchronization constructs

Runtime librariesEnvironment variables

Directive Format

#pragma omp Directive name [clause, …] newline

Start of OpenMP C/C++ directives

Valid OpenMP directive,

After the pragma and before any clau

ses

In any orderCan be repeated

Required, Proceeds the

structured block enclosed by this

directive

Ex) #pragma omp parallel default(shared) private(beta, pi)

Directive Format (Cont’d)

General Rules Directives follow conventions of the C/C++ standards Case sensitive Only one directive-name per directive Each directive applies to at most one succeeding structu

red block A long directive can be extend to multi-lines escaping t

he newline character with a backslash ("\") at the end of a directive line.

Parallel DirectivePurpose A block of code to be executed by multiple threads. The fundamental OpenMP parallel construct

Format #pragma omp parallel [clause ...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) structured_block

Parallel Directive (Cont’d)

Description In reaching a PARALLEL directive, a thread creates a

team of threads and becomes the master The master is a member of that team (id = 0) The code is duplicated and all threads will execute that

code. An implied barrier at the end of a parallel section Only the master thread continues execution past this

point.


# of threads Determined by the following factors, in order of preced

ence: omp_set_num_threads() library function OMP_NUM_THREADS environment variable Implementation default

Threads are numbered from 0 (master thread) to N-1


Clauses IF clause

If present, it must evaluate to .TRUE. (Fortran) or non-zero (C/C++) in order for a team of threads to be created.

Data scope attribute clauses

Restrictions A parallel region must be a structured block that does

not span multiple routines or code files Only a single IF clause is permitted


Dynamic Threads By default, a program uses the same number of threads to execute

each parallel region. The run-time system can dynamically adjust the number of threads

omp_set_dynamic() library function OMP_DYNAMIC environment variable

Nested Parallel Regions A parallel region nested within another parallel region results in th

e creation of a new team, consisting of one thread, by default. Implementation-dependent

Example of Parallel Region#include <omp.h>

main () {

int nthreads, tid;

#pragma omp parallel private(nthreads, tid){ /* Fork a team of threads giving them their own copies of variables */

/* Obtain and print thread id */tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);

if (tid == 0) { /* Only master thread does this */nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);

}} /* All threads join master thread and terminate */

}

Work-Sharing Constructs

Description Divides the execution of the region among the

members of the team An implied barrier at the end of the constructs No implied barrier upon the entry of the constructs Work-sharing constructs do not launch new threads

Construct Types

#pragma omp for Shares iterations of a loop across the team. Represents a type of data parallelism

#pragma omp single Serializes a section of code

#pragma omp sections Breaks work into separate, discreet sections. Each section is executed by a thread. Can be used to implement a type of functional parallelism

Construct Types (Cont’d)

#pragma omp parallel for Simplified form of #pragma omp parallel + #pragma o

mp for

#pragma omp parallel sections Simplified form of #pragma omp parallel + #pragma o

mp sections

Work-Sharing Constructs

Restrictions Must be enclosed dynamically within a parallel region

for parallel execution Must be encountered by all members of a team or none

at all Successive work-sharing constructs must be

encountered in the same order by all members of a team

#pragma omp for

Purpose The iterations of the loop immediately following this

directive must be executed in parallel by the team This assumes a parallel region has already been

initiated Otherwise it executes in serial on a single processor

#pragma omp for (Cont’d)

#pragma omp for (Cont’d)

Format #pragma omp for [clause ...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) nowait for_loop

Clauses

SCHEDULE clause How iterations of the loop are divided among the

threads in the team The default schedule is implementation dependent STATIC

Loop iterations are divided into pieces of size chunk and then statically assigned to threads

By default, the iterations are evenly (if possible) divided contiguously among the threads

SCHEDULE Clause DYNAMIC

Loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads

When a thread finishes one chunk, it is dynamically assigned another.

The default chunk size is 1 GUIDED

The chunk size is exponentially reduced with each dispatched piece of the iteration space.

The chunk size specifies the minimum number of iterations to dispatch each time..

The default chunk size is 1.

SCHEDULE Clause (Cont’d) RUNTIME:

The scheduling decision is deferred until runtime by the environment variable OMP_SCHEDULE. It is illegal to specify a chunk size for this clause.

ORDERED clause When ORDERED directives are enclosed within the for directive

NOWAIT clause Threads do not synchronize at the end of the parallel loop Threads proceed directly to the next statements after the loop

SCHEDULE Clause (Cont’d)

Restrictions The for loop can not be a do while loop, or a loop without loop co

ntrol. The loop iteration variable must be an integer and the loop control

parameters must be the same for all threads. Program correctness must not depend upon which thread executes

a particular iteration. The chunk size must be specified as a loop invariant integer expres

sion The C/C++ for directive requires that the for-loop must have cano

nical shape. ORDERED and SCHEDULE clauses may appear once each.

Example of For Directive#include <omp.h>#define CHUNKSIZE 100#define N 1000

main () {

int i, chunk;float a[N], b[N], c[N];

/* Some initializations */for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;chunk = CHUNKSIZE;

#pragma omp parallel shared(a,b,c,chunk) private(i){

#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++)

c[i] = a[i] + b[i];} /* end of parallel section */

}

#pragma omp sections

Purpose A non-iterative work-sharing construct The enclosed section(s) of code are to be divided

among the threads in the team Independent SECTION directives are nested within a

SECTIONS directive Each SECTION is executed once by a thread in the

team. Different sections will be executed by different threads.

#pragma omp sections (Cont’d)


Format#pragma omp sections [clause ...] newline private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait{

#pragma omp section newline structured_block#pragma omp section newline structured_block

}


Clauses An implied barrier at the end of a SECTIONS directive, unless the

nowait clause is used

Questions What happens if the number of threads and the number of SECTI

ONs are different? More threads than SECTIONs? Less threads than SECTIONs?

Which thread executes which SECTION?

Restriction SECTION directives must occur within the lexical extent of an enc

losing SECTIONS directive

Example of Sections Directive

include <omp.h>#define N 1000

main (){

int i;float a[N], b[N], c[N];


a[i] = b[i] = i * 1.0;

Example of Sections Directive (Cont’d)

#pragma omp parallel shared(a,b,c) private(i){

#pragma omp sections nowait{

#pragma omp sectionfor (i=0; i < N/2; i++)

c[i] = a[i] + b[i];

#pragma omp sectionfor (i=N/2; i < N; i++)

c[i] = a[i] + b[i];} /* end of sections */

} /* end of parallel section */

}

#pragma omp single

Purpose The enclosed code is to be executed by only one thread

in the team May be useful when dealing with sections of code that

are not thread safe (such as I/O)

#pragma omp single (Cont’d)

#pragma omp single (Cont’d)

Format#pragma omp single [clause ...] newline private (list) firstprivate (list) nowait structured_block

Clauses Threads in the team that do not execute the SINGLE dir

ective, wait at the end of the enclosed code block, unless a nowait clause is specified

#pragma omp parallel for#include <omp.h>#define N 1000#define CHUNKSIZE 100

main () {int i, chunk;float a[N], b[N], c[N];


a[i] = b[i] = i * 1.0;chunk = CHUNKSIZE;

#pragma omp parallel for shared(a,b,c,chunk) private(i) schedule(static,chunk)for (i=0; i < n; i++)

c[i] = a[i] + b[i];}

Data Environment

#pragma omp threadprivateData scope clauses

#pragma omp threadprivate

Purpose Make global file scope variables local and persistent to

a thread through the execution of multiple parallel regions

Format #pragma omp threadprivate (list)

#pragma omp threadprivate (Cont’d)

Notes Appear after the declaration of listed variables/common

blocks. Written by one thread is not visible to other threads On first entry to a parallel region, data in

THREADPRIVATE variables should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive

Differ from PRIVATE variables because they are persistent

#pragma omp threadprivate (Cont’d)

Restrictions Data in THREADPRIVATE objects is guaranteed to

persist only if the dynamic threads mechanism is "turned off" and the number of threads in different parallel regions remains constant.

The default setting of dynamic threads is undefined. Must appear after every declaration of a thread private

variable block.

Example of Threadprivate Directive

int alpha[10], beta[10], i;#pragma omp threadprivate(alpha)

main () {

/* First parallel region */#pragma omp parallel private(i,beta) for (i=0; i < 10; i++)

alpha[i] = beta[i] = i;

/* Second parallel region */#pragma omp parallel printf("alpha[3]= %d and beta[3]= %d\n",alpha[3],beta[3]);

}

Data Scope Clauses

Data scope attribute clauses Explicitly define how variables should be scoped An important consideration for OpenMP programming is the unde

rstanding and use of data scoping Most variables are shared by default

Global variables include File scope variables, static

Private variables include Loop index variables Stack variables in subroutines called from parallel regions

Kinds of Data Scope Clauses

#pragma … private#pragma … firstprivate#pragma … lastprivate#pragma … shared#pragma … default#pragma … reduction#pragma … copyin

Data Scope Clauses (Cont’d)

Used in conjunction with several directives to control the scoping of enclosed variablesControl the data environment during execution of parallel constructs. How and which data variables in the serial section of the program

are transferred to the parallel sections of the program (and back) Which variables will be visible to all threads in the parallel section

s and which variables will be privately allocated to all threads.

Effective only within their lexical/static extent

PRIVATE Clause

Purpose Declares variables in its list to be private to each thread

Format private (list)

Behavior A new object of the same type is declared once for each

thread in the team All references to the original object are replaced with re

ferences to the new object Uninitialized for each thread

Comparison Between PRIVATE And

THREADPRIVATEPRIVATE THREADPRIVATE

Data Item

Where declared

Persistent

Extent

Initialized

C/C++: variable C/C++: variable

At start of region or work-sharing group

In declarations of each routine using block or global file scope

No Yes

Lexical only - unless passed as an argument to

subroutine Dynamic

FIRSTPRIVATE COPYIN

Shared Clause

Purpose Declares variables in its list to be shared among all

threads in the teamFormat shared (list)Notes Exists in only one memory location and all threads can

read or write to that address It is the programmer's responsibility to ensure that

multiple threads properly access SHARED variables

Default Clause

Purpose Allows the user to specify a default PRIVATE,

SHARED, or NONE scope for all variables in the lexical extent of any parallel region.

Format default (shared | none)Notes Specific variables can be exempted from the default

using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses

Default Clause (Cont’d)

Restrictions Only one DEFAULT clause can be specified on a

PARALLEL directive

Firstprivate Clause

Purpose Combines the behavior of the PRIVATE clause with au

tomatic initialization of the variables in its list.

Format firstprivate (list)

Notes Listed variables are initialized according to the value of

their original objects prior to entry into the parallel or work-sharing construct.

Lastprivate Clause

Purpose Combines the behavior of the PRIVATE clause with a

copy from the last loop iteration or section to the original variable object

Format lastprivate (list)

Note The value copied back into the original variable object i

s obtained from the last (sequentially) iteration or section of the enclosing construct

Copyin Clause

Purpose Provides a means for assigning the same value to THREADPRIV

ATE variables for all threads in the team

Format copyin (list)

Notes List contains the names of variables to copy. The master thread variable is used as the copy source. The team threads are initialized with its value upon entry into the p

arallel construct

Reduction Clause

Purpose Performs a reduction on the variables that appear in its

list. A private copy for each list variable is created for each

thread. At the end of the reduction, the reduction variable is

applied to all private copies of the shared variable, and the final result is written to the global shared variable

Format reduction (operator: list)

Reduction Clause (Cont’d)

Restrictions Variables in the list must be named scalar variables They must also be declared SHARED in the enclosing

context. Reduction operations may not be associative for real

numbers.

Reduction Clause (Cont’d) The reduction variable is used only in statements which

have one of following forms x = x op expr x = expr op x (except subtraction) x binop = expr x++ ++x x-- --x

Reduction Example

#include <omp.h>

main () {int i, n, chunk;float a[100], b[100], result;

/* Some initializations */n = 100;chunk = 10;result = 0.0;for (i=0; i < n; i++) {

a[i] = i * 1.0;b[i] = i * 2.0;

}

#pragma omp parallel for default(shared) private(i) schedule(static,chunk) \reduction(+:result)

for (i=0; i < n; i++) result = result + (a[i] * b[i]);

printf("Final result= %f\n",result);}

Synchronization Constructs

#pragma omp master#pragma omp critical#pragma omp barrier#pragma omp atomic#pragma omp flush#pragma omp ordered

Race Condition

increment(x){

x = x + 1;}

increment(x){

x = x + 1;}

Thread A

One possible execution sequence: 1. Thread 1 loads the value of x into register A. 2. Thread 2 loads the value of x into register A. 3. Thread 1 adds 1 to register A 4. Thread 2 adds 1 to register A 5. Thread 1 stores register A at location x 6. Thread 2 stores register A at location x

Thread B

Race Condition (Cont’d)

Solutions The increment of x must be synchronized between the t

wo threads OpenMP provides a variety of synchronization construc

ts to control how the execution of each thread proceeds relative to other team threads.

#pragma omp master

Purpose Specifies a region to be executed only by the master thr

ead of the team. All other threads on the team skip this section of code No implied barrier associated with this directive

Format #pragma omp master newline structured_block

#pragma omp critical

Purpose Specifies a region of code that must be executed by onl

y one thread at a time.

Format #pragma omp critical [ name ] newline structured_block

#pragma omp critical (Cont’d)

Notes Race condition

Other thread will block until the first thread exits the CRITICAL region

The optional name enables multiple different CRITICAL regions to exist

Different CRITICAL regions with the same name are treated as the same region

All unnamed CRITICAL sections are treated as the same section

Example of Critical Directive

#include <omp.h>

main(){

int x;x = 0;

#pragma omp parallel shared(x) {

#pragma omp critical x = x + 1;

} /* end of parallel section */}

#pragma omp atomic

Purpose Specifies that a specific memory location must be updat

ed atomically A mini-CRITICAL section

Format pragma omp atomic newline statement_expression

#pragma omp atomic (Cont’d)

Restriction An atomic statement must have one of the following for

ms x binop = expr x++ ++x x-- --x

#pragma omp ordered

Purpose Specifies that iterations of the enclosed loop will be exe

cuted in the same order as if they were executed on a serial processor

Format #pragma omp ordered newline structured_block

#pragma omp ordered (Cont’d)

Restrictions Only appear in the dynamic extent of the following

directives for or parallel for

Only one thread is allowed in an ordered section at any time

An iteration of a loop must not execute the same ORDERED directive more than once, and it must not execute more than one ORDERED directive.

A loop which contains an ORDERED directive, must be a loop with an ORDERED clause.

Directive Binding Rules

The for, SECTIONS, SINGLE, MASTER and BARRIER directives bind to the dynamically enclosing PARALLEL, if one exists.If no parallel region is currently being executed, the directives have no effect.The ORDERED directive binds to the dynamically enclosing for.The ATOMIC directive enforces exclusive access with respect to ATOMIC directives in all threads, not just the current team.

Directive Binding Rules (Cont’d)

The CRITICAL directive enforces exclusive access with respect to CRITICAL directives in all threads, not just the current team.A directive can never bind to any directive outside the closest enclosing PARALLEL.

Directive Nesting Rules

A PARALLEL directive dynamically inside another PARALLEL directive logically establishes a new team, which is composed of only the current thread unless nested parallelism is enabled. For, SECTIONS, and SINGLE directives that bind to the same PARALLEL are not allowed to be nested inside of each other.For, SECTIONS, and SINGLE directives are not permitted in the dynamic extent of CRITICAL, ORDERED and MASTER regions.

Directive Nesting Rules (Cont’d)

CRITICAL directives with the same name are not permitted to be nested inside of each other.BARRIER directives are not permitted in the dynamic extent of DO/for, ORDERED, SECTIONS, SINGLE, MASTER and CRITICAL regions. MASTER directives are not permitted in the dynamic extent of DO/for, SECTIONS and SINGLE directives.

Directive Nesting Rules (Cont’d)

ORDERED directives are not permitted in the dynamic extent of CRITICAL regions. Any directive that is permitted when executed dynamically inside a PARALLEL region is also legal when executed outside a parallel region. When executed dynamically outside a user-specified parallel region, the directive is executed with respect to a team composed of only the master thread.

Environment Variables

All environment variable names are uppercase. The values assigned to them are not case sensitive.

OMP_SCHEDULE Applies only to for, parallel for directives which have t

heir schedule clause set to RUNTIME setenv OMP_SCHEDULE "guided, 4" setenv OMP_SCHEDULE "dynamic"

Environment Variables (Cont’d)

OMP_NUM_THREADS Sets the maximum number of threads to use during execution.

setenv OMP_NUM_THREADS 8

OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads a

vailable for execution of parallel regions. setenv OMP_DYNAMIC TRUE

OMP_NESTED Enables or disables nested parallelism.

setenv OMP_NESTED TRUE