Parallelization - XS4ALL Klantenservice

Type to enter text

Challenge the future

Delft University of Technology

Parallelizationprogramming

Jan Thorbecke

!2

Contents

• Introduction • Programming

• OpenMP • MPI

• Issues with parallel computing • Running programs • Examples

• Describes the relation between the parallel portion of your code and the expected speedup

• P = parallel portion • N = number of processors used in parallel part

• P/N is the ideal parallel speed-up, it will always be less

Amdahl’s Law

!3

speedup =1

(1� P ) + PN

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Spee

dup

Number of Processors

1.00.990.98

Amdahl’s Law

!4

0

20

40

60

80

100

1 2 3 4 5 7 10 20 30 40 50 70 90

Max

imum

Spe

edup

Sequential Portion in %

maxspeedup(x)

Amdahl’s Law

!5

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Spee

dup

Number of Processors

1.00.990.96

super linear

Super Linear speed up

!6

Concurrency and Parallelism

• Concurrency and parallelism are often used synonymously.

Concurrency: The independence of parts of an algorithm (= independent of each other).

Parallelism(also parallel execution): Two or more parts of a program are executed at the same moment in time.

Concurrency is a necessary prerequisite for parallel execution for parallel execution but Parallel execution is only one possible consequence of concurrency.

!7

Concurrency vs. Parallelism

• Concurrency: two or more threads are in progress at the same time:

• Parallelism: two or more threads are executing at the same time

Multiple cores needed

Thread 1Thread 2

Thread 1Thread 2

Learning

Classical CPU’s are sequential

There is an enormous sequential programming knowledge build into compilers and know by most programmers.

Parallel Programming is requiring new skills and new tools.

Start to parallelise simple problems and keep on learning along the way to complex real-world problems.

!9

Recognizing Sequential Processes

•Time is inherently sequential • Dynamics and real-time, event driven applications are often difficult to

parallelize effectively • Many games fall into this category

• Iterative processes • The results of an iteration depend on the preceding iteration

• Audio encoders fall into this category

•

Parallel Programming Models

• Parallel programming models exist as an abstraction above hardware and memory architectures.

• Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better implementations of some models over others.

Parallel Programming Models

• Shared Memory • tasks share a common address space, which they read and write

asynchronously.

• Threads (functional) • a single process can have multiple, concurrent execution paths.

Example implementations: POSIX threads & OpenMP

• Message Passing • tasks exchange data through communications by sending and

receiving messages. Example: MPI-2 specification.

• Data Parallel languages • tasks perform the same operation on their partition of work.

Example: Co-array Fortran (CAF), Unified Parallel C (UPC), Chapel

• Hybrid

Programming Models

!13

Hardware Layer

Memory

Interconnect

Interconnect

System Software

Programming Model

UserNew to parallel programming Experienced programmer

Shared Memory Message passing Hybrid

Operating system compilers

Distributed Memory Shared Memory

Parallel Programming Concepts

• Work distribution • Which parallel task is doing what? • Which data is used by which task?

• Synchronization • Do the different parallel tasks meet?

• Communication • Is communication between parallel parts needed?

• Load Balance • Does every task has the same amount of work? • Are all processors of the same speed?

!14

• Work decomposition • based on loop counter

• Data decomposition • all work for a local portion

of the data is done by the local processor

• Domain decomposition • decomposition of work and

data

Distributing Work and/or Data

!15

do i=1,1001: i=1,252: i=26,503: i=51,754: i=76,100

A(1:10,1:25)A(1:10,26:50)A(11:20,1:25)A(11:20,25:50)

Synchronization

• Synchronization • causes overhead • idle time, when not all tasks are finished at the same time

!16

2. — Parallel Hardware Architectures and Parallel Programming Models — 2.2-9

Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 17 / 54

Distributing Work & Data

do i=1,100 i=1,25

i=26,50i=51,75i=76,100

Work decomposition• based on loop decomposition

Domain decomposition• decomposition of work and

data is done in a higher model,e.g. in the reality

A( 1:20, 1: 50)A( 1:20, 51:100)A(21:40, 1: 50)A(21:40, 51:100)

Data decomposition• all work for a local portion

of the data is done by thelocal processor

Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 18 / 54

Synchronization

• Synchronization– is necessary– may cause

• idle time on some processors• overhead to execute the synchronization primitive

i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors

i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors

BARRIER synchronization

Do i=1,100a(i) = b(i)+c(i)

EnddoDo i=1,100

d(i) = 2*a(101-i)Enddo

Communication

• communication is necessary on the boundaries

• domain decomposition

!17

do i=2,99 b(i) = b(i) + h*(a(i-1)-2*a(i)+a(i+1))end do

e.g. b(26) = b(26) + h*(a(25)-2*a(26)+a(27))

A(1:25)A(26:50)A(51:75)A(76:100)

31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-6

Slide 11Domain Decomposition

Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch

Replication versus Communication (II)

• Normally replicate the values

– Consider how many calculations you can execute while onlysending 1 Bit from one process to another(6 µs, 1.0 Gflop/s 6000 operations)

– Sending 16 kByte (20x20x5) doubles(with 300 MB/s bandwidth 53.3 µs 53 300 operations)

– very often blocks have to wait for their neighbours

– but extra work limits parallel efficiency

• Communication should only be used if one is quite sure that this isthe best solution

Slide 12Domain Decomposition

Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch

2- Dimensional DD with two Halo Cells

Mesh Partitioning

Subdomain for each Process

38a. — Parallelization of Explicit and Implicit Solvers — 38a.38a-10

Rolf RabenseifnerParallelization and Iterative SolversSlide 19 of 51 Höchstleistungsrechenzentrum Stuttgart

Unstructured Grids

• Mesh partitioning with special load balancing libraries– Metis (George Karypis, University of Minnesota)– ParMetis (internally parallel version of Metis)

• http://www.cs.umn.edu/~karypis/metis/metis.html• http://www.hlrs.de/organization/par/services/tools/loadbalancer/metis.html

– Jostle (Chris Walshaw, University of Greenwich)• http://www.gre.ac.uk/jostle• http://www.hlrs.de/organization/par/services/tools/loadbalancer/jostle.html

– Goals:• Same work load in

each sub-domain• Minimizing the

maximal number of neighbor-connectionsbetween sub-domains

0 2

31

45

6 10

98

7 11

12 15

1613

17 14

21 20

2322

19

18

Parallelization

of Explicit and Im

plicit Solvers [38a]

Rolf RabenseifnerParallelization and Iterative SolversSlide 20 of 51 Höchstleistungsrechenzentrum Stuttgart

Halo

• Stencil:– To calculate a new grid point ( ),

old data from the stencil grid points ( ) are needed• E.g., 9 point stencil

• Halo– To calculate the new grid points of a sub-domain,

additional grid points from other sub-domains are needed.– They are stored in halos (ghost cells, shadows)– Halo depends on form of stencil

Load Imbalance

• Load imbalance is the time that some processors in the system are idle due to:

• less parallelism than processors • unequal sized tasks together with too little parallelism • unequal processors

!18

Examples of work distribution

• Domain Decomposition • Master Slave • Task Decomposition

!19

Domain Decomposition

• First, decide how data elements should be divided among processors

• Second, decide which tasks each processor should be doing


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Find the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Domain DecompositionFind the largest element of an array

Core 0 Core 1 Core 2 Core 3


5 6 68 83


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 645 6 68 83

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64



13 49 12 51

13 49 68 83


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


1 34 98 94

13 49 98 94


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


9 50 16 27

13 50 98 94


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


26 22 78 74

26 50 98 94

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64



26 50 98 94

13 12 31 64


26

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


26 50 98 94


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


26 50 98 94

50


5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64


26 50 98 94

98

Core 2

Core 1

Core 3

T

T

T

Master Slave

!33

Get a heap of work done

Core 0 (master)

T

• The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

• Divide computation based on natural set of independent tasks ‣ Assign data for each task as needed

• Example: pipeline seismic data pre-processing • static-correction • deconvolution • nmo correction • stacking • ….

Functional Decomposition

Task/Functional Decompositionf()

s()

r()q()h()

g()

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Shared Memory Parallelism

• Introduction to Threads • Exercise: Racecondition

• OpenMP Programming Model • Scope of Variables: Exercise 1 • Synchronisation: Exercise 2

• Scheduling • Exercise: OpenMP scheduling

• Reduction • Exercise: Pi

• Shared variables • Exercise: CacheTrash

• Tasks • Future of OpenMP

!41

Threads: “processes” sharing memory

• Process == address space • Thread == program counter / stream of instructions • Two examples

• Three processes, each with one thread • One process with three threads

!42

Kernel KernelThreadsThreads

Systemspace

Userspace

Process 1 Process 2 Process 3 Process 1

The Shared-Memory Model

Shared Memory

Core

Private Memory

Core

Private Memory

Core

Private Memory

What Are Threads Good For?

Making programs easier to understand

Overlapping computation and I/O

Improving responsiveness of GUIs

Improving performance through parallel execution ‣ with the help of OpenMP

Fork/Join Programming Model

• When program begins execution, only master thread active

• Master thread executes sequential portions of program

• For parallel portions of program, master thread forks (creates or awakens) additional threads

• At join (end of parallel section of code), extra threads are suspended or die

Relating Fork/Join to Code (OpenMP)

for {

}

for {

}

Sequential code

Parallel code

Sequential code

Parallel code

Sequential code

Domain Decomposition Using Threads

Shared Memory

Thread 0 Thread 2

Thread 1

f ( )

f ( )

f ( )

Shared Memory

Task Decomposition Using ThreadsThread 0 Thread 1

e ( )

g ( )h ( )

f ( )

Shared versus Private Variables

Shared Variables

Private Variables

Private Variables

Thread

Thread

Parallel threads can “race” against each other to update resources

Race conditions occur when execution order is assumed but not guaranteed

Example: un-synchronised access to bank account

Race Conditions

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000 Final balance = ?

Race Conditions

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000 Final balance = ?

Time Withdrawal Deposit

T0 Load (balance = $1000)

T1 Subtract $100 Load (balance = $1000)

T2 Store (balance = $900) Add $100

T3 Store (balance = $1100)

Code Example in OpenMP exercise: RaceCondition

for (i=0; i<NMAX; i++) { a[i] = 1; b[i] = 2;}#pragma omp parallel for shared(a,b)for (i=0; i<12; i++) { a[i+1] = a[i]+b[i];}

1: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 3.0, 5.0, 7.0, 9.0, 11.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0

!52

Code Example in OpenMP

thread computation 0 a[1] = a[0] + b[0]0 a[2] = a[1] + b[1]0 a[3] = a[2] + b[2] <--| Problem 1 a[4] = a[3] + b[3] <--| Problem1 a[5] = a[4] + b[4]1 a[6] = a[5] + b[5] <--| Problem 2 a[7] = a[6] + b[6] <--| Problem2 a[8] = a[7] + b[7]2 a[9] = a[8] + b[8] <--| Problem 3 a[10] = a[9] + b[9] <--| Problem3 a[11] = a[10] + b[10]

!53

How to Avoid Data Races

• Scope variables to be local to threads • Variables declared within threaded functions • Allocate on thread’s stack • TLS (Thread Local Storage)

• Control shared access with critical regions • Mutual exclusion and synchronization • Lock, semaphore, event, critical section, mutex…

Examples variables

!55


Sequential Code:

int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i);


Sequential Code:


Thread 0: for (i = 0; i < 500; i++) a[i] = foo(i);



Sequential Code:




SharedPrivate

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(k); m = g(k); ... }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }


main () { int x[10], j, k, m; j = f(k); m = g(k); }



Thread 0

Thread 1





Thread 0

Thread 1

Static variable: Shared


main () { int x[10], j, k, m; j = f(x, k); m = g(x, k); }



Thread 0

Thread 1

Heap variable: Shared





Thread 0

Thread 1

Function’s local variables: Private

Shared and Private Variables

• Shared variables • Static variables • Heap variables • Contents of run-time stack at time of call

• Private variables • Loop index variables • Run-time stack of functions invoked by thread

3

What Is OpenMP?

• Compiler directives for multithreaded programming

• Easy to create threaded Fortran and C/C++ codes

• Supports data parallelism model

• Portable and Standard

• Incremental parallelism ➡Combines serial and parallel code in single source

OpenMP is not ...

Not Automatic parallelization

– User explicitly specifies parallel execution

– Compiler does not ignore user directives even if wrong

Not just loop level parallelism

– Functionality to enable general parallel parallelism

Not a new language

– Structured as extensions to the base

– Minimal functionality with opportunities for extension

!67

Directive based

• Directives are special comments in the language

– Fortran fixed form: !$OMP, C$OMP, *$OMP

– Fortran free form: !$OMP

Special comments are interpreted by OpenMP

compilers

w = 1.0/n

sum = 0.0

!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum)

do I=1,n

x = w*(I-0.5)

sum = sum + f(x)

end do

pi = w*sum

print *,pi

end

!68

Comment in Fortran but interpreted by OpenMP compilers

C example

#pragma omp directives in C

– Ignored by non-OpenMP compilers w = 1.0/n;

sum = 0.0;

#pragma omp parallel for private(x) reduction(+:sum)

for(i=0, i<n, i++) {

x = w*((double)i+0.5);

sum += f(x);

}

pi = w*sum;

printf(“pi=%g\n”, pi);

}

!69

• Control runtime • schedule type • max threads • nested parallelism • throughput mode

Architecture of OpenMP

• Control structures • Work sharing • Synchronization • Data scope attributes

• private • shared • reduction

• Orphaning

!70

• Control & query routines • number of threads • throughput mode • nested parallism

• Lock API

Directives, Pragmas

Runtime library routines

Environment variables

Programming Model

• Fork-join parallelism: ‣ Master thread spawns a team of threads as needed

‣ Parallelism is added incrementally: the sequential program evolves into a parallel program

Parallel Regions

Master Thread

�71

Threads are assigned an independent set of iterations

Threads must wait at the end of work-sharing construct

i = 0

i = 1

i = 2

i = 3

i = 4

i = 5

i = 6

i = 7

i = 8

i = 9

i = 10

i = 11

Work-sharing Construct

T#pragma omp parallel

T

T

#pragma omp for

Implicit barrier

#pragma omp parallel #pragma omp for for(i = 0; i < 12; i++) c[i] = a[i] + b[i]

�72

Combining pragmas

These two code segments are equivalent

#pragma omp parallel { #pragma omp for for (i=0; i< MAX; i++) { res[i] = huge(); } }

#pragma omp parallel for for (i=0; i< MAX; i++) { res[i] = huge(); }

�73

18

IWOMP 2009TU Dresden

June 3-5, 2009

An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009

TID = 0for (i=0,1,2,3,4)

TID = 1for (i=5,6,7,8,9)

Example - Matrix times vector

i = 0 i = 5

a[0] = sum a[5] = sumsum = Σ b[i=0][j]*c[j] sum = Σ b[i=5][j]*c[j]

i = 1 i = 6

a[1] = sum a[6] = sumsum = Σ b[i=1][j]*c[j] sum = Σ b[i=6][j]*c[j]

... etc ...

��

��

�� !�� "��#�!��$�$�� %�� $�$�$�$�

= *

j

i

Matrix-vector example

!74

19


June 3-5, 2009


� ��

�

��

��

��

��

��

��

��

��

OpenMP Performance Example

Memory Footprint (KByte)

Perfo

rman

ce (M

flop/

s)

Matrix too small *

*) With the IF-clause in OpenMP this performance degradation can be avoided

scales

Performance is matrix size dependent

!75

OpenMP: EXERCISE_1

• Matrix-matrix multiplication

• Learn to use the compiler and insert OpenMP directives

• Modern compilers can be very ‘smart’ in optimising code

!76

OpenMP parallelization

• OpenMP Team := Master + Workers • A Parallel Region is a block of code executed by all

threads simultaneously • The master thread always has thread ID 0 • Thread adjustment (if enabled) is only done before entering a parallel

region • Parallel regions can be nested, but support for this is implementation

dependent • An "if" clause can be used to guard the parallel region; in case the

condition evaluates to "false", the code is executed serially • A work-sharing construct divides the execution of the enclosed

code region among the members of the team; in other words: they split the work

!77

Data Environment

• OpenMP uses a shared-memory programming model

• Most variables are shared by default.

• Global variables are shared among threads C/C++: File scope variables, static

• Not everything is shared, there is often a need for “local” data as well

�78

... not everything is shared...

• Stack variables in functions called from parallel regions are PRIVATE

• Automatic variables within a statement block are PRIVATE

• Loop index variables are private (with exceptions) • C/C+: The first loop index variable in nested loops following a

#pragma omp for

Data Environment

�79

About Variables in SMP

• Shared variables Can be accessed by every thread thread. Independent read/write operations can take place.

• Private variables Every thread has it’s own copy of the variables that are created/destroyed upon entering/leaving the procedure. They are not visible to other threads.

!80

serial code global auto local static dynamic

parallel code shared local use with care use with care

attribute clauses

•default(shared)

•shared(varname,…)

private(varname,…)

Data Scope clauses

�81

The Private Clause

Reproduces the variable for each thread • Variables are un-initialised; C++ object is default constructed • Any value external to the parallel region is undefined

void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } }

�82

Synchronization

• Barriers

• Critical sections

• Lock library routines

!83

#pragma omp barrier

#pragma omp critical()

omp_set_lock(omp_lock_t *lock)

omp_unset_lock(omp_lock_t *lock)

....

#pragma omp critical [(lock_name)]

Defines a critical region on a structured block

OpenMP Critical Construct

float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical consum (B, &R1); A = bigger_job(i); #pragma omp critical consum (A, &R2); } }

All threads execute the code, but only one at a time. Only one calls consum() thereby protecting R1 and R2 from race conditions. Naming the critical constructs is optional, but may increase performance.

(R1_lock)

(R2_lock)

�84

OpenMP Critical

!85

Day 3: OpenMP 2010 – Course MT1

OpenMP critical directive: Explicit Synchronization

• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables

• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.

• Mutual exclusion synchronization is provided by the critical directive of OpenMP

• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.

• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.

29

fork

join

- critical region

int x x=0; #pragma omp parallel shared(x) { #pragma omp critical x = 2*x + 1; } /* omp end parallel */

All threads execute the code, but only one at a time. Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.


Simple Example: critical

30

cnt = 0; f = 7; #pragma omp parallel { #pragma omp for for (i=0;i<20;i++){ if(b[i] == 0){ #pragma omp critical cnt ++; } /* end if */ a[i]=b[i]+f*(i+1); } /* end for */ } /* omp end parallel */

cnt=0 f=7

i =0,4 i=5,9 i= 20,24 i= 10,14

if … if …

if … i= 20,24

cnt++

cnt++

cnt++ cnt++ a[i]=b[

i]+…

a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…

Critical Example 1

!86

Critical Example 2

int i;

#pragma omp parallel for for (i = 0; i < 100; i++) {

s = s + a[i]; }

!87

RZ: Christian Terboven Folie 12

Synchronization (2/4)

do i = 0, 24 s = s + a(i) end do

do i = 25, 49 s = s + a(i) end do

do i = 50, 74 s = s + a(i) end do

do i = 75, 99 s = s + a(i) end do

A(0) . . .

A(99)

S

Pseudo-Code Here: 4 Threads

Memory

do i = 0, 99 s = s + a(i) end do

Critical Example 2

!88

OpenMP Single Construct

• Only one thread in the team executes the enclosed code

• The Format is:

• The supported clauses on the single directive are:

!89

#pragma omp single [nowait][clause, ..]{ “block”}

private (list)firstprivate (list)

NOWAIT: the other threads will not wait at the end single directive

OpenMP Master directive

• All threads but the master, skip the enclosed section of code and continue

• There is no implicit barrier on entry or exit !

• Each thread waits until all others in the team have reached this point.

!90

#pragma omp master { “code”}

#pragma omp barrier

#pragma omp parallel{ ....#pragma omp single [nowait]{ ....}#pragma omp master{ ....} ....#pragma omp barrier}

Work Sharing: Single Master

!91

Single processor

!92

62

Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011An Overview of OpenMP

Single processor region/2

time

single processor region

Other threads wait if there is a barrier here

OpenMP: EXERCISE_2

• Dot product

• Code is parallel, but something is wrong….

• Try to fix the code and maintain good performance

!93

Work Sharing: Orphaning

• Worksharing constructs may be outside lexical scope of the parallel region

!94

#pragma omp parallel{ .... dowork( ) ....} ....

void dowork( ){ #pragma omp for for (i=0; i<n; i++) { .... }}

Scheduling the work

• schedule ( static | dynamic | guided | auto [, chunk] ) schedule (runtime)

static [, chunk]• Distribute iterations in blocks of size "chunk" over the threads in a

round-robin fashion • In absence of "chunk", each thread executes approx. N/P chunks for

a loop of length N and P threads

!95

66


June 3-5, 2009


The schedule clause/2

Thread 0 1 2 3�� *-K L-M N-*7 *F-*O

��!�� *-7 F-K L-O P-MN-*� **-*7 *F-*K *L-*O

Example static scheduleLoop of length 16, 4 threads:

*) The precise distribution is implementation defined

!96

• dynamic [, chunk]• Fixed portions of work; size is controlled by the value of chunk • When a thread finishes, it starts on the next portion of work

• guided [, chunk]• Same dynamic behavior as "dynamic", but size of the portion of work

decreases exponentially

runtime• Iteration scheduling scheme is set at runtime through environment

variable OMP_SCHEDULE

68


June 3-5, 2009


The experiment

0 100 200 300 400 500 600

3210

3210

3210

static

dynamic, 5

guided, 5

Iteration Number

Thre

ad ID

500 iterations on 4 threadsExample scheduling

!97

Environment Variables

• The names of the OpenMP environment variables must be UPPERCASE

• The values assigned to them are case insensitive

!98

OMP_NUM_THREADS

OMP_SCHEDULE “schedule [chunk]”

OMP_NESTED { TRUE | FALSE }

Exercise: OpenMP scheduling

• Download code from: http://www.xs4all.nl/~janth/HPCourse/OMP_schedule.tar

• Two loops • Parallel code with omp sections • Check what the auto-parallelisation of the compiler has done • Insert OpenMP directives to try out different scheduling

strategies • c$omp& schedule(runtime) • export OMP_SCHEDULE=“static,10" • export OMP_SCHEDULE=“guided,100" • export OMP_SCHEDULE=“dynamic,1”

!99

reduction (op : list)

The variables in “list” must be shared in the enclosing parallel region

Inside parallel or work-sharing construct: ‣ A PRIVATE copy of each list variable is created and initialized depending on

the “op”

‣ These copies are updated locally by threads

‣ At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

13

OpenMP Reduction Clause


OpenMP: Reduction

• performs reduction on shared variables in list based on the operator provided. • for C/C++ operator can be any one of :

– +, *, -, ^, |, ||, & or && – At the end of a reduction, the shared variable contains the result obtained upon

combination of the list of variables processed using the operator specified.

32

sum = 0.0 #pragma omp parallel for reduction(+:sum) for (i=0; i < 20; i++) sum = sum + (a[i] * b[i]);

sum=0

i=0,4 i=5,9 i=10,14 i=15,19

sum=.. sum=.. sum=.. sum=..

∑sum

sum=0

Local copy of sum for each thread All local copies of sum added together and stored in “global” variable

Reduction Example

#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }

�101

A range of associative and commutative operators can be used with reduction Initial values are the ones that make sense

C/C++ Reduction Operations

Operator Initial Value

+ 0

* 1

- 0

^ 0

Operator Initial Value

& ~0

| 0

&& 1

|| 0

FORTRAN:intrinsic is one of MAX, MIN, IAND, IOR, IEOR operator is one of +, *, -, .AND., .OR., .EQV., .NEQV.

�102

Numerical Integration Example

TTTT

4.0

2.0

1.00.0 X

� 1

0

41 + x2

dx = �static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0;

step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

�103

static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0;

step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

Numerical Integration to Compute Pi

Parallelize the numerical integration code using OpenMP

What variables can be shared?

What variables need to be private?

What variables should be set up for reductions?

step, num_steps

x, i

sum

�104

Solution to Computing Pi

static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(x) reduction(+:sum) for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

�105

Let’s try it out

• Go to example MPI_pi.tar and work with openmp_pi2.c

!106

Exercise: PI with MPI and OpenMP

!107

cores OpenMP

1 9.617728

2 4.874539

4 2.455036

6 1.627149

8 1.214713

12 0.820746

16 0.616482

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling

Amdahl 1.0OpenMP Barcelona 2.2 GHz


!108

© 2007 IBM Corporation

Multi-PF Solutions

CUDA: computing PICuda Computing PI

!109

Exercise: Shared Cache Trashing

• Let’s do the exercise: CacheTrash

!110

About local and shared data

• Consider the following example:

• Let’s assume we run this on 2 processors:

• processor 1 for i=0,2,4,6,8 • processor 2 for i=1,3,5,7,9

!111

for (i=0; i<10; i++){ a[i] = b[i] + c[i];}

i1


!112

for (i1=0,2,4,6,8){ a[i1] = b[i1] + c[i1];}

for (i2=1,3,5,7,9){ a[i2] = b[i2] + c[i2];}

i2A B C

P1 P2

private area private areashared area

Processor 1 Processor 2


processor 1 for i=0,2,4,6,8 processor 2 for i=1,3,5,7,9

• This is not an efficient way to do this!

Why?

!113

Doing it the bad way

• Because of cache line usage

• b[] and c[]: we use half of the data

• a[]: false sharing

!114

for (i=0; i<10; i++){ a[i] = b[i] + c[i];}

False sharing and scalability

• The Cause: Updates on independent data elements that happen to be part of the same cache line.

• The Impact: Non-scalable parallel applications

• The Remedy: False sharing is often quite simple to solve

!115

Poor cache line utilization

!116


B(0)

B(1)

B(2)

B(3)

B(4)

B(5)

B(6)

B(7)

B(8)

B(9)

B(0)

B(1)

B(2)

B(3)

B(4)

B(5)

B(6)

B(7)

B(8)

B(9) The same holds for array C

cache line

Both processors read the same cache lines

used data not used data

False Sharing

!117


a[1] = b[1] + c[0];a[0] = b[0] + c[0];Write into the line containing a[0] This marks the cache line containing a[0] as ‘dirty’

Detects the line with a[0] is ‘dirty’

Get a fresh copy (from processor 1) Write into the line containing a[1] This marks the cache line containing a[1] as ‘dirty’

a[2] = b[2] + c[2];

time

a[3] = b[3] + c[3];Detects the line with a[3] is ‘dirty’

Detects the line with a[2] is ‘dirty’

Get a fresh copy (from processor 2) Write into the line containing a[2] This marks the cache line containing a[2] as ‘dirty’

False Sharing results

!118

Iterations per thread

Thre

ads

1 4 16 64 256 1K 4K 16K 64K 256K1

2

3

4

5

6

7

8

9

10

3

4

5

6

7

8

9

10

time in seconds

OpenMP tasks

• What are tasks • Tasks are independent units of work • Threads are assigned to perform the work of each task.

- Tasks may be deferred - Tasks may be executed immediately - The runtime system decides which of the above

• Why tasks? • The basic idea is to set up a task queue: when a thread

encounters a task directive, it arranges for some thread to execute the associated block at some time. The first thread can continue.

!119

OpenMP 3.0 and Tasks

• What are tasks? – Tasks are independent units of work – Threads are assigned to perform the work

of each task. • Tasks may be deferred • Tasks may be executed immediately • The runtime system decides which of the

above • Why task?

– The basic idea is to set up a task queue: when a thread encounters a task directive, it arranges for some thread to execute the associated block – at some time. The first thread can continue.

4

!120

122

Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011An Overview of OpenMP

The Tasking Example

Developer specifies tasks in applicationRun-time system executes tasks

Encountering thread adds

task(s) to pool

Threads execute tasks in the pool

OpenMP tasks

Tasks allow to parallelize irregular problems – Unbounded loops – Recursive algorithms – Manger/work schemes

A task has – Code to execute – Data environment (It owns its data)– Internal control variables – An assigned thread that executes the code and the data

!121

OpenMP has always had tasks, but they were not called “task”. – A thread encountering a parallel construct, e.g., “for”, packages up a set of implicit tasks, one per thread. – A team of threads is created. – Each thread is assigned to one of the tasks. – Barrier holds master thread till all implicit tasks are finished.

!122

OpenMP tasks

#pragmaompparallel#pragmaompsingle{...#pragmaomptask{...}…#pragmaomptaskwait}

!123

-> A parallel region creates a team of threads;

-> One thread enters the execution

-> the other threads wait at the end of the single

-> pick up threads „from the work queue“

OpenMP: EXERCISE_4

• Jacobi method

• use Intel’s profiler to find out the most time consuming parts (2) • amplx-cl and amplx-gui of intel

• parallelise the most time consuming loop • directly on the loop itself (3) • making the parallel region larger and more efficient (4)

!124

Summary

• First tune single-processor performance

• Tuning parallel programs • Has the program been properly parallelized?

• Is enough of the program parallelized (Amdahl’s law)? • Is the load well-balanced?

• location of memory • Cache friendly programs: no special placement needed • Non-cache friendly programs

• False sharing?

• Use of OpenMP • try to avoid synchronization (barrier, critical, single, ordered)

!125

19

Plenty of Other OpenMP Stuff

Scheduling clauses

Atomic

Barrier

Master & Single

Sections

Tasks (OpenMP 3.0)

API routines

Compiling and running OpenMP

• Compile with -openmp flag (intel compiler) or -fopenmp (GNU)

• Run program with variable:

export OMP_NUM_THREADS=4

!127

OpenACC

• New set of directives to support accelerators • GPU’s • Intel’s MIC • AMD Fusions processors

!128

OpenACC examplevoidconvolution_SM_N(typeToUseA[M][N],typeToUseB[M][N]) {

inti,j,k; intm=M,n=N; //OpenACCkernelregion //Definearegionoftheprogramtobecompiledintoasequenceofkernels //forexecutionontheacceleratordevice #pragmaacckernelspcopyin(A[0:m])pcopy(B[0:m]) { typeToUsec11,c12,c13,c21,c22,c23,c31,c32,c33; c11=+2.0f;c21=+5.0f;c31=-8.0f; c12=-3.0f;c22=+6.0f;c32=-9.0f; c13=+4.0f;c23=+7.0f;c33=+10.0f; //TheOpenACCloopgangclausetellsthecompilerthattheiterationsoftheloops //aretobeexecutedinparallelacrossthegangs. //Theargumentspecifieshowmanygangstousetoexecutetheiterationsofthisloop. #pragmaaccloopgang(64) for(inti=1;i<M-1;++i){ //TheOpenACCloopworkerclausespecifiesthattheiterationoftheassociatedlooparetobe //executedinparallelacrosstheworkerswithinthegangscreated. //Theargumentspecifieshowmanyworkerstousetoexecutetheiterationsofthisloop. #pragmaaccloopworker(128) for(intj=1;j<N-1;++j) { B[i][j]=c11*A[i-1][j-1]+c12*A[i+0][j-1]+c13*A[i+1][j-1] +c21*A[i-1][j+0]+c22*A[i+0][j+0]+c23*A[i+1][j+0] +c31*A[i-1][j+1]+c32*A[i+0][j+1]+c33*A[i+1][j+1]; } } }//kernelsregion }

!129

MPI

!130

Message Passing

• Point-to-Point

• Requires explicit commands in program • Send, Receive

• Must be synchronized among different processors • Sends and Receives must match • Avoid Deadlock -- all processors waiting, none able to communicate

• Multi-processor communications • e.g. broadcast, reduce

!132

MPI advantages

• Mature and well understood • Backed by widely-supported formal standard (1992) • Porting is “easy”

• Efficiently matches the hardware • Vendor and public implementations available

• User interface: • Efficient and simple • Buffer handling • Allow high-level abstractions

• Performance

!133

MPI disadvantages

• MPI 2.0 includes many features beyond message passing

• Execution control environment depends on implementation

Learning curve

T

5

Programming Model

• Explicit parallelism: ‣ All processes starts at the same time at the same point in the code

‣ Full parallelism: there is no sequential part in the program

Parallel Region

processes

Work Distribution

• All processors run the same executable.

• Parallel work distribution must be explicitly done by the programmer:

• domain decomposition • master slave

!135

!136

A Minimal MPI Program (C)

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[] ){ MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); return 0;}

!137

A Minimal MPI Program (Fortran 90)

program mainuse MPIinteger ierr

call MPI_INIT( ierr )print *, 'Hello, world!'call MPI_FINALIZE( ierr )end

Starting the MPI Environment

• MPI_INIT ( ) Initializes MPI environment. This function must be called and must be the first MPI function called in a program (exception: MPI_INITIALIZED) Syntax

int MPI_Init ( int *argc, char ***argv )

MPI_INIT ( IERROR )

INTEGER IERROR

NOTE: Both C and Fortran return error codes for all calls.

!138

Exiting the MPI Environment

• MPI_FINALIZE ( ) Cleans up all MPI state. Once this routine has been called, no MPI routine ( even MPI_INIT ) may be called

Syntaxint MPI_Finalize ( );

MPI_FINALIZE ( IERROR )

INTEGER IERROR

MUST call MPI_FINALIZE when you exit from an MPI program

!139

C and Fortran Language Considerations

• Bindings

– C

• All MPI names have an MPI_ prefix

• Defined constants are in all capital letters

• Defined types and functions have one capital letter after the prefix; the remaining letters are lowercase

– Fortran

• All MPI names have an MPI_ prefix

• No capitalization rules apply

• last argument is an returned error value

!140

!141

Finding Out About the Environment

• Two important questions that arise early in a parallel program are: • How many processes are participating in this computation? • Which one am I?

• MPI provides functions to answer these questions: – MPI_Comm_size reports the number of processes. – MPI_Comm_rank reports the rank, a number between 0 and size-1,

identifying the calling process

!142

Better Hello (C)

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[] ){ int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0;}

!143

Better Hello (Fortran)

program mainuse MPIinteger ierr, rank, size

call MPI_INIT( ierr )call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )print *, 'I am ', rank, ' of ', sizecall MPI_FINALIZE( ierr )end

!144

Some Basic Concepts

• Processes can be collected into groups. • Each message is sent in a context, and must be received in the

same context. • A group and context together form a communicator. • A process is identified by its rank in the group associated with a

communicator. • There is a default communicator whose group contains all initial

processes, called MPI_COMM_WORLD.

Day 2: MPI 2010 – Course MT1

MPI Communicators

• Communicator is an internal object • MPI Programs are made up of communicating

processes • Each process has its own address space containing its

own attributes such as rank, size (and argc, argv, etc.) • MPI provides functions to interact with it • Default communicator is MPI_COMM_WORLD

– All processes are its members – It has a size (the number of processes) – Each process has a rank within it – One can think of it as an ordered list of processes

• Additional communicator(s) can co-exist • A process can belong to more than one communicator • Within a communicator, each process has a unique

rank

MPI_COMM_WORLD

0

1 2

5

3

4

6

7

14

Communicator

• Communication in MPI takes place with respect to communicators

• MPI_COMM_WORLD is one such predefined communicator (something of type “MPI_COMM”) and contains group and context information

• MPI_COMM_RANK and MPI_COMM_SIZE return information based on the communicator passed in as the first argument

• Processes may belong to many different communicators

0 1 2 3 4 5 6 7

MPI_COMM_WORLD

Rank-->

!145

MPI Basic Send/Receive• Basic message passing process. Send data from one process

to another

• Questions – To whom is data sent? – Where is the data? – What type of data is sent? – How much of data is sent? – How does the receiver identify it?

A:

Send Receive

B:

Process 1Process 0

!146

!147

MPI Basic Send/Receive

• Data transfer plus synchronization

• Requires co-operation of sender and receiver • Co-operation not always apparent in code • Communication and synchronization are combined

DataProcess 0

Process 1

May I Send?

Yes

DataData

DataData

DataData

DataData

Time


Message Envelope

• Communication across processes is performed using messages.

• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope : – Envelope comprises source,

destination, tag, communicator – Message comprises Envelope +

data • Communicator refers to the

namespace associated with the group of related processes

21

MPI_COMM_WORLD

0

1 2

5

3

4

6

7

Source : process0 Destination : process1 Tag : 1234 Communicator : MPI_COMM_WORLD

Message Organization in MPI

• Message is divided into data and envelope

• data – buffer – count – datatype

• envelope – process identifier (source/destination) – message tag – communicator

!148

MPI Basic Send/Receive• Thus the basic (blocking) send has become:

MPI_Send ( start, count, datatype, dest, tag, comm )– Blocking means the function does not return until it is safe to reuse

the data in buffer. The message may not have been received by the target process.

• And the receive has become:MPI_Recv( start, count, datatype, source, tag, comm, status )

- The source, tag, and the count of the message actually received can be retrieved from status

!149

MPI C Datatypes

!150

MPI Fortran Datatypes

!151

!152

MPI Tags

• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.

• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.

Process Naming and Message Tags• Naming a process

– destination is specified by ( rank, group )

– Processes are named according to their rank in the group

– Groups are defined by their distinct “communicator”

– MPI_ANY_SOURCE wildcard rank permitted in a receive Tags are integer variables or constants used to uniquely identify individual messages

• Tags allow programmers to deal with the arrival of messages in an orderly manner

• MPI tags are guaranteed to range from 0 to 32767 by MPI-1

– Vendors are free to increase the range in their implementations

• MPI_ANY_TAG can be used as a wildcard value

!153

!154

Retrieving Further Information• Status is a data structure allocated in the user’s program. • In C:

int recvd_tag, recvd_from, recvd_count;

MPI_Status status;

MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )

recvd_tag = status.MPI_TAG;

recvd_from = status.MPI_SOURCE;

MPI_Get_count( &status, datatype, &recvd_count );

• In Fortran: integer recvd_tag, recvd_from, recvd_count

integer status(MPI_STATUS_SIZE)

call MPI_RECV(..., MPI_ANY_SOURCE, MPI_ANY_TAG, .. status, ierr)

tag_recvd = status(MPI_TAG)

recvd_from = status(MPI_SOURCE)

call MPI_GET_COUNT(status, datatype, recvd_count, ierr)

Getting Information About a Message (Fortran)

• The following functions can be used to get information about a messageINTEGER status(MPI_STATUS_SIZE) call MPI_Recv( . . . , status, ierr ) tag_of_received_message = status(MPI_TAG) src_of_received_message = status(MPI_SOURCE) call MPI_Get_count(status, datatype, count, ierr)

• MPI_TAG and MPI_SOURCE are primarily of use when MPI_ANY_TAG and/or MPI_ANY_SOURCE is used in the receive

• The function MPI_GET_COUNT may be used to determine how much data of a particular type was received

!155

!156

Simple Fortran Example - 1/3 program main use MPI

integer rank, size, to, from, tag, count, i, ierr integer src, dest integer st_source, st_tag, st_count integer status(MPI_STATUS_SIZE) double precision data(10)

call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr ) print *, 'Process ', rank, ' of ', size, ' is alive' dest = size - 1 src = 0

!157

if (rank .eq. 0) then do i=1,10 data(i) = i done call MPI_SEND( data, 10, MPI_DOUBLE_PRECISION, + dest, 2001, MPI_COMM_WORLD, ierr)

else if (rank .eq. dest) then tag = MPI_ANY_TAG source = MPI_ANY_SOURCE call MPI_RECV( data, 10, MPI_DOUBLE_PRECISION, + source, tag, MPI_COMM_WORLD, + status, ierr)

Simple Fortran Example - 2/3

!158

call MPI_GET_COUNT( status, MPI_DOUBLE_PRECISION, st_count, ierr ) st_source = status( MPI_SOURCE ) st_tag = status( MPI_TAG ) print *, 'status info: source = ', st_source, + ' tag = ', st_tag, 'count = ', st_count endif

call MPI_FINALIZE( ierr ) end

Simple Fortran Example - 3/3

Is MPI Large or Small?

• Is MPI large (128 functions) or small (6 functions)? – MPI’s extensive functionality requires many functions – Number of functions not necessarily a measure of complexity – Many programs can be written with just 6 basic functions MPI_INIT MPI_COMM_SIZE MPI_SEND MPI_FINALIZE MPI_COMM_RANK MPI_RECV

• MPI is just right – A small number of concepts – Large number of functions provides flexibility, robustness,

efficiency, modularity, and convenience – One need not master all parts of MPI to use it

!159

#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){

int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) {

if (myid == 0) {printf("Enter the number of intervals: (0 quits) ");scanf("%d",&n);

}MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);if (n == 0) break;h = 1.0 / (double) n;sum = 0.0;for (i = myid + 1; i <= n; i += numprocs) {

x = h * ((double)i - 0.5);sum += 4.0 / (1.0 + x*x);

}mypi = h * sum;MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);if (myid == 0)

printf("pi is approximately %.16f, Error is %.16f\n",pi, fabs(pi - PI25DT));}MPI_Finalize();return 0;

}

!160

Example: PI in Fortran 90 and C

work distribution

work distribution

program main use MPI double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr c function to integrate f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) 10 if ( myid .eq. 0 ) then write(6,98) 98 format('Enter the number of intervals: (0 quits)') read(5,’(i10)’) n endif call MPI_BCAST( n, 1, MPI_INTEGER, 0,MPI_COMM_WORLD, ierr) if ( n .le. 0 ) goto 30 h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum call MPI_REDUCE( mypi, pi, 1, MPI_DOUBLE_PRECISION, + MPI_SUM, 0, MPI_COMM_WORLD,ierr) if (myid .eq. 0) then write(6, 97) pi, abs(pi - PI25DT) 97 format(' pi is approximately: ', F18.16, + ' Error is: ', F18.16) endif goto 10 30 call MPI_FINALIZE(ierr) end

!161


• Compile and run for computing PI parallel

• Download code from: http://www.xs4all.nl/~janth/HPCourse/MPI_pi.tar

• Unpack tar file and check the README for instructions.

• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know.

• Use qsub to submit the mpi job (an example script is provided) to a queue.

!162


!163

cores OpenMP marken linux

1 9.617728 14.10798 22.15252

2 4.874539 7.071287 9.661745

4 2.455036 3.532871 5.730912

6 1.627149 2.356928 3.547961

8 1.214713 1.832055 2.804715

12 0.820746 1.184123

16 0.616482 0.955704

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling


MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz


!164

Blocking Communication• So far we have discussed blocking communication

– MPI_SEND does not complete until buffer is empty (available for reuse) – MPI_RECV does not complete until buffer is full (available for use)

• A process sending data will be blocked until data in the send buffer is emptied

• A process receiving data will be blocked until the receive buffer is filled • Completion of communication generally depends on the message size

and the system buffer size • Blocking communication is simple to use but can be prone to

deadlocks If (my_proc.eq.0) Then Call mpi_send(..)

Call mpi_recv(…) Usually deadlocks--> Else

Call mpi_send(…) <--- UNLESS you reverse send/recv Call mpi_recv(….)

Endif

!166

!167

• Send a large message from process 0 to process 1 • If there is insufficient storage at the destination, the send must

wait for the user to provide the memory space (through a receive)

• What happens with

Sources of Deadlocks

Process 0

Send(1) Recv(1)

Process 1

Send(0) Recv(0)

• This is called “unsafe” because it depends on the availability of system buffers.

!168

Some Solutions to the “unsafe” Problem

• Order the operations more carefully:

Process 0

Send(1) Recv(1)

Process 1

Recv(0) Send(0)

• Use non-blocking operations:

Process 0

Isend(1) Irecv(1) Waitall

Process 1

Isend(0) Irecv(0) Waitall

Blocking Send-Receive Diagram (Receive before Send)

send side receive side

T1:MPI_Send

T4

T2

Once receive is called @ T0, buffer unavailable to user

Receive returns @ T4, buffer filled

It is important to receive before sending, for highest performance.

T0: MPI_Recv

sender returns @ T2, buffer can be reused T3: Transfer Complete

Internal completion is soon followed by return of MPI_Recv

!169

T7: transfer

finishes

sender returns @ T3 buffer unavailable

Non-Blocking Send-Receive Diagram

send side receive side

T2: MPI_Isend

T8

T3

Once receive is called @ T0, buffer unavailable to user

MPI_Wait, returns @ T8 here, receive buffer filled

High Performance Implementations Offer Low Overhead for Non-blocking Calls

T0: MPI_Irecv

Internal completion is soon followed by return of MPI_Wait

sender completes @ T5 buffer available after MPI_Wait

T4: MPI_Wait called

T6: MPI_Wait

T1: Returns

T5

T9: Wait returns

T6

!170

Message Completion and Buffering

• A send has completed when the user supplied buffer can be reused

• The send mode used (standard, ready, synchronous, buffered) may provide additional information

• Just because the send completes does not mean that the receive has completed – Message may be buffered by the system – Message may still be in transit

*buf = 3; MPI_Send ( buf, 1, MPI_INT, ... ); *buf = 4; /* OK, receiver will always receive 3 */

*buf = 3; MPI_Isend(buf, 1, MPI_INT, ...); *buf = 4; /* Undefined whether the receiver will get 3 or 4 */ MPI_Wait ( ... );

!171

Non-Blocking Communication• Non-blocking (asynchronous) operations return (immediately)

‘‘request handles” that can be waited on and queried MPI_ISEND( start, count, datatype, dest, tag, comm,

request )

MPI_IRECV( start, count, datatype, src, tag, comm,

request )

MPI_WAIT( request, status )

• Non-blocking operations allow overlapping computation and communication.

• One can also test without waiting using MPI_TESTMPI_TEST( request, flag, status )

• Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait

• Combinations of blocking and non-blocking sends/receives can be used to synchronize execution instead of barriers

!172

Multiple Completion’s

• It is often desirable to wait on multiple requests

• An example is a worker/manager program, where the manager waits for one or more workers to send it a message MPI_WAITALL( count, array_of_requests, array_of_statuses )

MPI_WAITANY( count, array_of_requests, index, status )MPI_WAITSOME( incount, array_of_requests, outcount, array_of_indices, array_of_statuses )

• There are corresponding versions of test for each of these

!173

Collective Communications in MPI• Communication is co-ordinated among a group of processes,

as specified by communicator, not on all processes

• All collective operations are blocking and no message tags are used (in MPI-1)

• All processes in the communicator group must call the collective operation

• Collective and point-to-point messaging are separated by different “contexts”

• Three classes of collective operations

– Data movement

– Collective computation

– Synchronization

!174

Collective calls

!175

• MPI has several collective communication calls, the most frequently used are:

• Synchronization • Barrier

• Communication • Broadcast • Gather Scatter • All Gather

• Reduction • Reduce • All Reduce


40

MPI Collective Calls: Barrier

Function: MPI_Barrier()

int MPI_Barrier ( MPI_Comm comm )

Description: Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html

P0

P1

P2

P3

MPI

_Bar

rier(

)

P0

P1

P2

P3

MPI_Barrier()

!176

Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.

MPI Basic Collective Operations

• Two simple collective operations

MPI_BCAST( start, count, datatype, root, comm )

MPI_REDUCE( start, result, count, datatype, operation,

root, comm )

• The routine MPI_BCAST sends data from one process to all others

• The routine MPI_REDUCE combines data from all processes, using a specified operation, and returns the result to a single process

• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

!177

Reduce (root=0, op)A0B1C2

D3

X0?1?2

?3

X=A op B op C op D

Process Ranks

Send buffer

Process Ranks

Receive buffer

Broadcast and Reduce

Bcast (root=0)A0?1?2

?3

Process Ranks

Send buffer

A0A1A2

A3

Process Ranks

Send buffer

!178

X0X1X2

X3

AllReduce (comm_world, op)

Scatter and Gather

Scatter (root=0)

Process Ranks

Send buffer

A0B1C2D3

Gather (root=0)

Process Ranks

Receive buffer

Process Ranks

Send buffer

Process Ranks

Receive buffer

ABCD0123

????????????

ABCD0123

????????????

A0B1C2D3

!179

ABCD0123

ABCDABCDABCD

AllGather (comm_world)

MPI Collective Routines

• Several routines: MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST

MPI_ALLTOALL MPI_ALLTOALLVMPI_GATHER MPI_GATHERVMPI_REDUCE_SCATTER MPI_REDUCE MPI_ALLREDUCEMPI_SCATTERV MPI_SCATTER

• All versions deliver results to all participating processes • “V” versions allow the chunks to have different sizes

• MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and take both built-in and user-defined combination functions

!180

Built-In Collective Computation Operations

!181

!182

Extending the Message-Passing Interface

• Dynamic Process Management • Dynamic process startup • Dynamic establishment of connections

• One-sided communication • Put/get • Other operations

• Parallel I/O

• Other MPI-2 features • Generalized requests • Bindings for C++/ Fortran-90; inter-language issues

!183

When to use MPI

• Portability and Performance

• Irregular Data Structures

• Building Tools for Others • Libraries

• Need to Manage memory on a per processor basis

!184

When not to use MPI

• Number of cores is limited and OpenMP is doing well on that number of cores

• typically 16-32 cores in SMP

•


1. Break up the domain into blocks.

2. Assign blocks to MPI-processes one-to-one.

3. Provide a "map" of neighbors to each process.

4. Write or modify your code so it only updates a single block.

5. Insert communication subroutine calls where needed.

6. Adjust the boundary conditions code.

7. Use "guard cells".

!185

for example: wave equation in a 2D domain.

6. — A Heat-Transfer Example with MPI — 6.6-5

Rolf RabenseifnerA Heat-Transfer Example with MPISlide 9 Höchstleistungsrechenzentrum Stuttgart

Heat: How to parallelize with MPI

• Block data decomposition of arrays phi, phin– array sizes is variable, depends on processor number

• Halo (shadow, ghost cells, ...)– to store values of the neighbors– width of the halo := needs of the numerical formula (heat: 1)

• Communication:– fill the halo with values of the neighbors– after each iteration (dt)

• global execution of the abort criterion– collective operation


Block data decomposition — in which dimensions?

Splitting in

• one dimension:communication

= n2*2*w *1

• two dimensions:communication

= n2*2*w *2 / p1/2

• three dimensions:communication

= n2*2*w *3 / p2/3

w=width of halon3 =size of matrixp = number of processorscyclic boundary—> two neighbors in each direction

optimal for p>11

-1- Break up domain in blocks

!186

-2- Assign blocks to MPI processes

!187

[0, 0] [1, 0]

[0, 1] [1, 1]

nx

nz

[0, 0] MPI 0

[1, 0] MPI 1

[0, 1] MPI 2

[1, 1] MPI 3

nx_block=nx/2nz_block=nz/2

nx_block

nz_block

-3- Make a map of neighbors

!188

lef=10 rig=40 bot=26 top=24

MPI 40

MPI 26 MPI 41

MPI 10

MPI 11

MPI 24

-4- Write code for single MPI block

!189

for (ix=0; ix<nx_block; ix++) { for (iz=0; iz<nz_block; iz++) { vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz])); } }

-5- Communication calls

• Examine the algorithm • Does it need data from its neighbors? • Yes? => Insert communication calls to get this data

• Need p[-1][iz] and p[-2][iz] to update vx[0][iz]

• p[-1][iz] corresponds to p[nx_block-1][iz] from left neighbor • p[-2][iz] corresponds to p[nx_block-2][iz] from left neighbor

• Get this data from the neighbor with MPI.

!190

communication

!191

for (ix=0; ix<nx_block; ix++) { for (iz=0; iz<nz_block; iz++) { vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz])); } }

MPI 10

p[nx_block-1][iz]

MPI 12

i=nx_block-1 i=1i=0

vx[0][iz] -= rox[0][iz]*( c1*(p[0][iz] - p[i-1][iz]) + c2*(p[1][iz] - p[i-2][iz])

pseudocode

if (my_rank==12) { call MPI_Receive(p[-2:-1][iz], .., 10, ..);

vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz]));}

if (my_rank==10) { call MPI_Send(p[-2:-1][iz], .., 12, ..);}

!192

Update of all blocks [i][j]

!193

[i-1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)

[i+2][j] receive(-1) receive(-2) send(nx-1) send(nx-2)

[i+1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)

[i][j] receive(-1) receive(-2) send(nx-1) send(nx-2)

mpi_sendrecv(p(nx-1:nx-2), 2 ,.., my_neighbor_right,... p(-2:-1), 2 ,.., my_neighbor_left,...)

-6- Domain boundaries

• What if my block is at a physical boundary?

• Periodic boundary conditions

• Non-periodic boundary conditions

!194

Periodic Boundary Conditions

!195

MPI 25 MPI 38MPI 47 left=38 right=0

MPI 10MPI 0 left=47

right=10

Non-Periodic Boundary Conditions

!196

MPI 25 MPI 38MPI 47 left=38

right=NUMPI 10

MPI 0 left=NUL right=10

1. Avoid sending or receiving data from a non-existent block. MPI_PROC_NULL

2. Apply boundary conditions.

if (my_neighbor_right == MPI_PROC_NULL) & [ APPLY APPROPRIATE BOUNDARY CONDITIONS ]

if (my_rank == 12) & my_neighbor_right = MPI_PROC_NULL . call mpi_send(......my_neighbor_right...) !no data is sent by proc. 12*

6. — A Heat-Transfer Example with MPI — 6.6-6


Heat: How to parallelize with MPI - Block Data Decomposit.

• Block data decomposition• Cartesian process topology

– two-dimensionalsize = idim * kdim

– process coordinatesicoord = 0 .. idim-1kccord = 0 .. kdim-1

– see heat-mpi-big.f, lines 18-42• Static array declaration

–> dynamic array allocation– phi(0:80,0:80) –> phi(is:ie,ks:ke)– additional subroutine “algorithm”– see heat-mpi-big.f, lines 186+192

• Computation of is, ie, ks, ke– see heat-mpi-big.f, lines 44-98

0 80

b1=1

is ievalues read in each iteration

values computed in each iteration

phi(is:ie,ks:ke)

inner area

outer area

halo


Heat: Communication Method - Non-blocking

• Communicate halo in each iteration• Ranks via MPI_CART_SHIFT 114 call MPI_CART_SHIFT(comm,

0, 1, left, right, ierror) 115 call MPI_CART_SHIFT(comm,

1, 1, lower,upper, ierror)• non-blocking point-to-point

• Caution: corners must not be sentsimultaneously in two directions

• vertical boundaries:– non-contiguous data– MPI derived datatypes used

118 call MPI_TYPE_VECTOR( kinner,b1,iouter,MPI_DOUBLE_PRECISION, vertical_border, ierror)

120 call MPI_TYPE_COMMIT( vertical_border, ierror)

221 call MPI_IRECV(phi(is,ks+b1), 1,vertical_border, left,MPI_ANY_TAG,comm, req(1),ierror)

225 call MPI_ISEND(phi(ie-b1,ks+b1), 1,vertical_border, right,0,comm, req(3),ierror)

from left to rightfrom right to leftfrom upper to lowerfrom lower to upper

• Allocate local block including overlapping areas to store values received from neighbor communication.

• This allows tidy coding of your loops

-7- guard cells

!197

P[i][iz]

i0-2 -1 1

guard cells

Memory use increases with guard

• Example 1000x1000, N tasks, H square width of the task, h halo-points. The guard size per domain is

• halosize = 4*h*H+4*corners(h*h)

• Halosize limits minimum task size: H >= 2h, hence limits maximum number of tasks N <= (Nx/2h)**2

• N=1089=33*33 then H=1000/33 ~ 30, h=8 • halosize = 4*8*30+4*8*8 = 1216 • actual work is 30*30

!198

Exercise: MPI_Cart

• Do I have to make a map of MPI processes myself?

• You can use MPI_Cart to create a domain decomposition for you.

• Exercise sets up 2D domain with one • periodic and • non-periodic boundary condition

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_Cart.tar

• Unpack tar file and check the README for instructions

!199

PE- 6: x=1, y=2: Sum = 32 neighb0rs: left=2 right=10 top=5 bottom=7

PE- 4: x=1, y=0: Sum = 24 neighbors: left=0 right=8 top=-1 bottom=5

PE- 2: x=0, y=2: Sum = 32 neighbors: left=14 right=6 top=1 bottom=3

PE- 7: x=1, y=3: Sum = 36 neighbors: left=3 right=11 top=6 bottom=-1













answer 16 MPI tasks

!200

!201

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

Example: 3D-Finite Difference

• 3D domain decomposition to distribute the work • MPI_Cart_create to set up cartesian domain

• Hide communication of halo-areas with computational work

• MPI-IO for writing snapshots to disk • use local and global derived types to exclude halo areas write in

global snapshot file

• Tutorial about domain decomposition: http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/

!202

MPI_Comm_size(MPI_COMM_WORLD, &size);

/* divide 3D cube among available processors */ dims[0]=0; dims[1]=0; dims[2]=0; MPI_Dims_create(size, 3, dims); npz = 2*halo+nz/dims[0]; npx = 2*halo+nx/dims[1]; npy = 2*halo+ny/dims[2];

/* create cartesian domain decomposition */ MPI_Cart_create(MPI_COMM_WORLD, 3, dims, period, reorder, &COMM_CART);

/* find out neighbors of the rank, MPI_PROC_NULL is a hard boundary of the model */ MPI_Cart_shift(COMM_CART, 1, 1, &left, &right); MPI_Cart_shift(COMM_CART, 2, 1, &top, &bottom); MPI_Cart_shift(COMM_CART, 0, 1, &front, &back);

fprintf(stderr, "Rank %d has LR neighbours %d %d FB %d %d TB %d %d\n",

my_rank, left, right, front, back, top, bottom);

!203

!204

nx px

ny py

lnx

lny

subarrays for IOlocal

global

!205

/* create subarrays(excluding halo areas) to write to file with MPI-IO */ /* data in the local array */ sizes[0]=npz; sizes[1]=npx; sizes[2]=npy;

subsizes[0]=sizes[0]-2*halo; subsizes[1]=sizes[1]-2*halo; subsizes[2]=sizes[2]-2*halo;

starts[0]=halo; starts[1]=halo; starts[2]=halo; MPI_Type_create_subarray(3, sizes, subsizes, starts, MPI_ORDER_C, MPI_FLOAT, &local_array); MPI_Type_commit(&local_array);

/* data in the global array */ gsizes[0]=nz; gsizes[1]=nx; gsizes[2]=ny;

gstarts[0]=subsizes[0]*coord[0]; gstarts[1]=subsizes[1]*coord[1]; gstarts[2]=subsizes[2]*coord[2];

MPI_Type_create_subarray(3, gsizes, subsizes, gstarts, MPI_ORDER_C, MPI_FLOAT, &global_array); MPI_Type_commit(&global_array);

!206

for (it=0; it<nt; it++) {

/* copy of halo areas */ copyHalo(vx, vz, p, npx, npy, npz, halo, leftSend, rightSend, frontSend, backSend, topSend, bottomSend);

/* start a-synchronous communication of halo areas to neighbours */ exchangeHalo(leftRecv, leftSend, rightRecv, rightSend, 3*halosizex, left, right, &reqRecv[0], &reqSend[0], 0); exchangeHalo(frontRecv, frontSend, backRecv, backSend, 3*halosizey, front, back, &reqRecv[2], &reqSend[2], 4); exchangeHalo(topRecv, topSend, bottomRecv, bottomSend, 3*halosizez, top, bottom, &reqRecv[4], &reqSend[4], 8);

/* update on grid values excluding halo areas */ updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz); updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);

!207

/* wait for halo communications to be finished */ waitForHalo(&reqRecv[0], &reqSend[0]); t1 = wallclock_time(); tcomm += t1-t2;

newHalo(vx, vz, p, npx, npy, npz, halo, leftRecv, rightRecv, frontRecv, backRecv, topRecv, bottomRecv); t2 = wallclock_time(); tcopy += t2-t1;

/* compute field on the grid of the halo areas */ updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz); updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);

!208

/* write snapshots to file */ if (time == snapshottime) {

MPI_Info_create(&fileinfo); MPI_Info_set(fileinfo, "striping_factor", STRIPE_COUNT); MPI_Info_set(fileinfo, "striping_unit", STRIPE_SIZE); MPI_File_delete(filename, MPI_INFO_NULL);

rc = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDWR|MPI_MODE_CREATE, fileinfo, &fh);

disp = 0; rc = MPI_File_set_view(fh, disp, MPI_FLOAT, global_array, "native", fileinfo);

rc = MPI_File_write_all(fh, p, 1, local_array, status);

MPI_File_close(&fh); }

} /* end of time loop */

Summary

!209

MPI Summary

•MPI Standard widely accepted by vendors and programmers

• MPI implementations available on most modern platforms • Several MPI applications deployed • Several tools exist to trace and tune MPI applications

• Simple applications use point-to-point and collective communication operations

•Advanced applications use point-to-point, collective, communicators, datatypes, one-sided, and topology operations

!210

No overlapping computation/communication

MPI only out of OpenMP code. Only master thread with

MPI communication.

Hybrid systems programming hierarchy

Hybrid System

Pure MPIHybrid MPI/OpenMP

OpenMP in SMP nodes MPI across the nodes

OpenMP shared memory

Overlapping computation/communication

MPI inside OpenMP code

Hybrid OpenMP/MPI

• Natural paradigm for clusters of SMP’s • May offer considerable advantages when application mapping and

load balancing is tough • Benefits with slower interconnection networks (overlapping

computation/communication)

‣ Requires work and code analysis to change pure MPI codes ‣ Start with auto parallelization? ‣ Link shared memory libraries…check various thread/MPI

processes combinations ‣ Study carefully the underlying architecture

• What is the future of this model? Could it be consolidated in new languages?

• Connection with many-core?

!212

!213

P0 P1

P2 P3

Type to enter

Type to enter te

C. BEKAS

T T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T T

Suppose we wish to solve the PDE

Using the Jacobi method: the value of u at each discretization point is given by a certain average among its neighbors, until convergence. Distributing the mesh to SMP clusters by Domain Decomposition, it is clear that the GREEN nodes can proceed without any comm., while the Blue nodes have to communicate first and calculate later.

Overlapping computation/communication: Example

MPI/OpenMPI: Overlapping computation/communication

!214

Not only the master but other threads communicate. Call MPI primitives in OpenMP code regions.

if (my_thread_id < # ){MPI_… (communicate needed data)

} else/* Perform computations that to not need communication */..

}/* All threads execute code that requires

communication */..

!215

for (k=0; k < MAXITER; k++){/* Start parallel region here */#pragma omp parallel private(){

my_id = omp_get_thread_num();

if (my_id is given “halo points”)MPI_SendRecv(“From neighboring MPI process”);

else{for (i=0; i < # allocated points; i++)

newval[i] = avg(oldval[i]);}

if (there are still points I need to do) /* Thifor (i=0; i< # remaining points; i++)

newval[i] = avg(oldval[i]);

}for (i=0; i<(all_my_points); i++)

oldval[i] = newval[i];}MPI_Barrier(); /* Synchronize all MPI processes here */

}

Text

Hiding IO with IO-Server

!216

Compute Node do i=1,time_steps compute(j) checkpoint(data) end do

subroutine checkpoint(data) MPI_Wait(send_req) buffer = data MPI_Isend(IO_SERVER, buffer) end subroutine

I/O Server do i=1,time_steps do j=1,compute_nodes MPI_Recv(j, buffer) write(buffer) end do end do

Use more nodes to act as IO-Servers (pseudo code)

PGAS

• What is PGAS?

• How to make use of PGAS as a programmer?

!217

PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 44

Partitioned Global Address SpacePartitioned Global Address Space

! Explicitly parallel, shared-memory like programming model

! Global addressable space

" Allows programmers to declare and “directly” access data distributed across the machine

! Partitioned address space

" Memory is logically partitioned between local and remote (a two-level hierarchy)" Forces the programmer to pay attention to data locality, by exposing the inherent

NUMA-ness of current architectures

! Single Processor Multiple Data (SPMD) execution model

" All threads of control execute the same program" Number of threads fixed at startup" Newer languages such as X10 escape this model, allowing fine-grain threading

! Different language implementations:

" UPC (C-based), CoArray Fortran (Fortran-based), Titanium and X10 (Java-based)

Partitioned Global Address Space

!218

PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 55

! Computation is performed in multiple places.

! A place contains data that can be operated on remotely.

! Data lives in the place it was created, for its lifetime.

! A datum in one place may point to a datum in another place.

! Data-structures (e.g. arrays) may be distributed across many places.

A place expresses locality.

Address Space

Shared Memory

OpenMP

PGAS

UPC, CAF, X10Message passing

MPI

Process/Thread

Partitioned Global Address SpacePartitioned Global Address SpacePartitioned Global Address Space

!219

Shared Memory (OpenMP)

• Multiple threads share global memory • Most common variant: OpenMP

• Program loop iterations distributed to threads, more recent task features

• Each thread has a means to refer to private objects within a parallel context

• Terminology • Thread, thread team

• Implementation • Threads map to user threads running on one SMP node • Extensions to multiple servers not so successful

!220

OpenMP

!221

memory

threads

OpenMP: work distribution

!222

memory

threads

!$OMP PARALLEL do i=1,32 a(i)=a(i)*2 end do1-8 9-16 17-24 25-32

OpenMP: implementation

!223

memory

threads

cpus

process

Message Passing (MPI)

• Participating processes communicate using a message-passing API

• Remote data can only be communicated (sent or received) via the API.

• MPI (the Message Passing Interface) is the standard

• Implementation: • MPI processes map to processes within one SMP node or across

multiple networked nodes • API provides process numbering, point-to-point and collective

messaging operations

!224

MPI

!225

memory

cpu

processes

memory

cpu

memory

cpu

memory

cpu

MPI

!226

memory

cpu

process 0

MPI_Send(a,...,1,…)

process 1

MPI_Recv(a,...,0,…)

Partitioned Global Address Space

• Shortened to PGAS

• Participating processes/threads have access to local memory via standard program mechanisms

• Access to remote memory is directly supported by the PGAS language

!227

PGAS

!228

memory

cpu

process

memory

cpu

memory

cpu

process process

Partitioned Global Address Space (PGAS):

� Global address space – any process can address memory on any processor

� Partitioned GAS – retain information about locality

� Core idea– hardest part of writing parallel code is managing data distribution and communication; make that simple and explicit

� PGAS Languages try to simplify parallel programming (increase programmer productivity).

5HPCC PTRANS and PGAS Languages

PGAS languages

Data Parallel Languages

• Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines. http://upc.lbl.gov/

• Co-array Fortran (CAF) is part of Fortran 2008 standard. It is a simple, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. http://www.co-array.org

• both need a global address space (which is not equal to SMP)

!230

� Unified Parallel C: � An extension of C99 � An evolution of AC, PCP, and Split-C

� Features � SPMD parallelism via replication of threads� Shared and private address spaces � Multiple memory consistency models

� Benefits � Global view of data � One-sided communication

HPCC PTRANS and PGAS Languages 7

UPC

� Co-Array Fortran: � An extension of Fortran 95 and part of “Fortran 2008” � The language formerly known as F--

� Features � SPMD parallelism via replication of images� Co-arrays for distributed shared data

� Benefits � Syntactically transparent communication � One-sided communication � Multi-dimensional arrays � Array operations

HPCC PTRANS and PGAS Languages 9

Co-Array Fortran

Basic execution model co-array F--

• Program executes as if replicated to multiple copies with each copy executing asynchronously (SPMD)

• Each copy (called an image) executes as a normal Fortran application

• New object indexing with [] can be used to access objects on other images.

• New features to inquire about image index, number of images and to synchronize

!233

CAF (F--)

REAL, DIMENSION(N)[*] :: X,Y X(:) = Y(:)[Q]

Array indices in parentheses follow the normal Fortran rules within one memory image.

Array indices in square brackets provide an equally convenient notation for accessing objects across images and follow similar rules.

!234

parallel

Coarray execution model

!235

memory

cpu

Image 1

memory

cpu

memory

cpu

Image 2 Image 3

Remote access with square bracket indexing: a(:)[2]

coarrays

Basic coarray declaration and usage

!236

integer :: b integer :: a(4)[*] !coarray

1 8 1 5a

image 1

b 1

1 7 9 9a

image 2

b 3

1 7 9 4a

image 3

b 6

b=a(2)


!237


b=a(2)

1 8 1 5a

image 1

b 8

1 7 9 9a

image 2

b 7

1 7 9 4a

image 3

b 7

Text


!238


b=a(4)[3]

1 8 1 5a

image 1

b 1

1 7 9 9a

image 2

b 3

1 7 9 4a

image 3

b 6

Text


!239


b=a(4)[3]

1 8 1 5a

image 1

b 4

1 7 9 9a

image 2

b 4

1 7 9 4a

image 3

b 4

Feb-5-08 56

Performance tuning

S

p

first attempt

second attempt

ideal

Performance tuning of parallel applications is an iterative process

Performance Tuning

!241

Compiling MPI Programs

!242

Compiling and Starting MPI Jobs• Compiling:

• Need to link with appropriate MPI and communication subsystem libraries and set path to MPI Include files

• Most vendors provide scripts or wrappers for this (mpxlf, mpif77, mpicc, etc)

• Starting jobs:

• Most implementations use a special loader named mpirun

– mpirun -np <no_of_processors> <prog_name>

• In MPI-2 it is recommended to use

– mpiexec -n <no_of_processors> <prog_name>

!243

!244

MPICH: a Portable MPI Environment

• MPICH is a high-performance portable implementation of MPI (both 1 and 2).

• It runs on MPP's, clusters, and heterogeneous networks of workstations.

• The CH comes from Chameleon, the portability layer used in the original MPICH to provide portability to the existing message-passing systems.

• In a wide variety of environments, one can do: mpicc -mpitrace myprog.c mpirun -np 10 myprog upshot myprog.log

to build, compile, run, and analyze performance.

MPICH2

MPICH2 is an all-new implementation of the MPI Standard, designed to implement all of the MPI-2 additions to MPI.

‣ separation between process management and communications ‣ use daemons (mpd) on nodes ‣ dynamic process management, ‣ one-sided operations, ‣ parallel I/O, and others

• http://www.mcs.anl.gov/research/projects/mpich2/

!245

!246

Compiling MPI programs

• From a command line: mpicc -o prog prog.c

• Use profiling options (specific to mpich) • -mpilog Generate log files of MPI calls • -mpitrace Trace execution of MPI calls • -mpianim Real-time animation of MPI (not available on all

systems) • --help Find list of available options

• The environment variables MPICH_CC, MPICH_CXX, MPICH_F77, and MPICH_F90 may be used to specify alternate C, C++, Fortran 77, and Fortran 90 compilers, respectively.

�

!247

example hello.c

#include "mpi.h"#include <stdio.h>int main(int argc ,char *argv[]){

int myrank;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);fprintf(stdout, "Hello World, I am process %d\n", myrank);MPI_Finalize();return 0;

}

!248

Example: using MPICH1

-bash-3.1$ mpicc -o hello hello.c-bash-3.1$ mpirun -np 4 helloHello World, I am process 0Hello World, I am process 2Hello World, I am process 3Hello World, I am process 1

!249

Example: details

• If using frontend and compute nodes in machines file use mpirun -np 2 -machinefile machines hello

• If using only compute nodes in machine file use mpirun -nolocal -np 2 -machinefile machines hello

• -nolocal - don’t start job on frontend • -np 2 - start job on 2 nodes • -machinefile machines - nodes are specified in machines file • hello - start program hello

!250

Notes on clusters

•Make sure you have access to the compute node (ssh keys are generated ssh-keygen) and ask your system administrator.

•Which mpicc are you using •$ which mpicc

•command line arguments are not always passed to mpirun/mpiexec (depending on your version). In that case make a script which calls your program with all its arguments

MPICH2 daemons• mpdtrace: output a list of nodes on which you can run MPI programs (runs mpd daemons). ‣ The -l option lists full hostnames and the port where the mpd is

listening.

• mpd starts an mpd daemon.

• mpdboot starts a set of mpd’s on a list of machines.

• mpdlistjobs lists the jobs that the mpd’s are running.

• mpdkilljob kills a job specified by the name returned by mpdlistjobs

mpdsigjob delivers a signal to the named job.

!251

MPI - Message Passing Interface

• MPI or MPI-1 is a library specification for message-passing.

• MPI-2: Adds in Parallel I/O, Dynamic Process management, Remote Memory Operation, C++ & F90 extension …

• MPI Standard: http://www-unix.mcs.anl.gov/mpi/standard.html

• MPI Standard 1.1 Index: http://www.mpi-forum.org/docs/mpi-11-html/node182.html

• MPI-2 Standard Index: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/node306.htm

• MPI Forum Home Page: http://www.mpi-forum.org/index.html

MPI tutorials

• http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/mpi_II_tutorial.html

• https://fs.hlrs.de/projects/par/par_prog_ws/

!253

!254

MPI Sources

• The Standard (3.0) itself: • at http://www.mpi-forum.org • All MPI official releases, in both PDF and HTML

• Books: – Using MPI: Portable Parallel Programming with the Message-

Passing Interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994. – Parallel Programming with MPI, by Peter Pacheco, Morgan-

Kaufmann, 1997.

• Other information on Web: • at http://www.mcs.anl.gov/mpi • pointers to lots of stuff, including other talks and tutorials, a FAQ,

other MPI pages • http://mpi.deino.net/mpi_functions/index.htm

Job Management and queuing

!255


• On a large system many users are running simultaneously.

• What to do when:

• The system is full and you want to run your 512 CPU job? • You want to run 16 jobs, should others wait on that? • Have bigger jobs priority over smaller jobs? • Have longer jobs lower/higher priority?

• The job manager and queue system takes care of it.

!256

MPICH1 in PBS#!/bin/bash#PBS -N Flank#PBS -V#PBS -l nodes=11:ppn=2#PBS -j eo

cd $PBS_O_WORKDIRexport nb=`wc -w < $PBS_NODEFILE`echo $nb

mpirun -np $nb -machinefile $PBS_NODEFILE ~/bin/migr_mpi \ file_vel=$PBS_O_WORKDIR/grad_salt_rot.su \ file_shot=$PBS_O_WORKDIR/cfp2_sx.su

!257


• PBS (maui): • http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php • Torque resource manager http://www.clusterresources.com/pages/products/

torque-resource-manager.php

• Sun Grid Engine • http://gridengine.sunsource.net/

• SLURM • http://slurm.schedmd.com

• Luckily their interface is very similar (qsub, qstat, ...)

!258

queuing commands

• qsub

• qstat

• qdel

• xpbsmon

qsub

#PBS -l nodes=10:ppn=1#PBS -l mem=20mb#PBS -l walltime=1:00:00#PBS -j eo

submit:

qsub -q normal job.scr

output:

jobname.ejobidjobname.ojobid

qstat

• available queue and resources

qstat -q

• queued and running jobs

qstat (-a)

!261

qdel

• deletes job from queue and stops all running executables

qdel jobid

!262

Submitting many jobs in one script#!/bin/bash -f#export xsrc1=93export xsrc2=1599export dxsrc=3xsrc=$xsrc1

while (( xsrc <= xsrc2 ))doecho ' modeling shot at x=' $xsrc

cat << EOF > jobs/pbs_${xsrc}.job#!/bin/bash##PBS -N fahud_${xsrc}#PBS -q verylong#PBS -l nodes=1:ppn=1#PBS -V

program with arguments

EOF

qsub jobs/pbs_${xsrc}.job(( xsrc = $xsrc + $dxsrc))done

!263

Be careful !



• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar


• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know.

• Use qsub to submit the mpi job (an example script is provided) to a queue.

!265


!266

cores OpenMP marken (MPI) linux (MPI)

1 9.617728 14.10798 22.15252

2 4.874539 7.071287 9.661745

4 2.455036 3.532871 5.730912

6 1.627149 2.356928 3.547961

8 1.214713 1.832055 2.804715

12 0.820746 1.184123

16 0.616482 0.955704

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling


MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz


!267

Exercise: OpenMP Max


• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar


• The exercise focus on using the available number of cores in an efficient way.

• Also inspect the code and see how the reductions are done, is there another way of doing the reductions?

!268

Exercise: OpenMP details

• More details is using OpenMp and shared memory parallelisation

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/OpenMP.tar


• These are ‘old’ exercises from SGI and give insight in problems you can encounter using OpenMP.

• It requires already some knowledge about OpenMP. The OpenMP F-Summary.pdf from the website can be helpful.

!269

!270

Extend the Exercises

• Modify the pi program to use send/receive instead of bcast/reduce.

• Write a program that sends a message around a ring. That is, process 0 reads a line from the terminal and sends it to process 1, who sends it to process 2, etc. The last process sends it back to process 0, who prints it.

• Time programs with MPI_WTIME().

END

!271

Parallelization - XS4ALL Klantenservice

Documents

Transcript of Parallelization - XS4ALL Klantenservice