Parallelization - XS4ALL Klantenservice

271
Type to enter text Challenge the future Delft University of Technology Parallelization programming Jan Thorbecke

Transcript of Parallelization - XS4ALL Klantenservice

Page 1: Parallelization - XS4ALL Klantenservice

Type to enter text

Challenge the future

Delft University of Technology

Parallelizationprogramming

Jan Thorbecke

Page 2: Parallelization - XS4ALL Klantenservice

!2

Contents

• Introduction • Programming

• OpenMP • MPI

• Issues with parallel computing • Running programs • Examples

Page 3: Parallelization - XS4ALL Klantenservice

• Describes the relation between the parallel portion of your code and the expected speedup

• P = parallel portion • N = number of processors used in parallel part

• P/N is the ideal parallel speed-up, it will always be less

Amdahl’s Law

!3

speedup =1

(1� P ) + PN

Page 4: Parallelization - XS4ALL Klantenservice

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Spee

dup

Number of Processors

1.00.990.98

Amdahl’s Law

!4

Page 5: Parallelization - XS4ALL Klantenservice

0

20

40

60

80

100

1 2 3 4 5 7 10 20 30 40 50 70 90

Max

imum

Spe

edup

Sequential Portion in %

maxspeedup(x)

Amdahl’s Law

!5

Page 6: Parallelization - XS4ALL Klantenservice

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Spee

dup

Number of Processors

1.00.990.96

super linear

Super Linear speed up

!6

Page 7: Parallelization - XS4ALL Klantenservice

Concurrency and Parallelism

• Concurrency and parallelism are often used synonymously.

Concurrency: The independence of parts of an algorithm (= independent of each other).

Parallelism(also parallel execution): Two or more parts of a program are executed at the same moment in time.

Concurrency is a necessary prerequisite for parallel execution for parallel execution but Parallel execution is only one possible consequence of concurrency.

!7

Page 8: Parallelization - XS4ALL Klantenservice

Concurrency vs. Parallelism

• Concurrency: two or more threads are in progress at the same time:

• Parallelism: two or more threads are executing at the same time

Multiple cores needed

Thread 1Thread 2

Thread 1Thread 2

Page 9: Parallelization - XS4ALL Klantenservice

Learning

Classical CPU’s are sequential

There is an enormous sequential programming knowledge build into compilers and know by most programmers.

Parallel Programming is requiring new skills and new tools.

Start to parallelise simple problems and keep on learning along the way to complex real-world problems.

!9

Page 10: Parallelization - XS4ALL Klantenservice

Recognizing Sequential Processes

•Time is inherently sequential • Dynamics and real-time, event driven applications are often difficult to

parallelize effectively • Many games fall into this category

• Iterative processes • The results of an iteration depend on the preceding iteration

• Audio encoders fall into this category

Page 11: Parallelization - XS4ALL Klantenservice

Parallel Programming Models

• Parallel programming models exist as an abstraction above hardware and memory architectures.

• Which model to use is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better implementations of some models over others.

Page 12: Parallelization - XS4ALL Klantenservice

Parallel Programming Models

• Shared Memory • tasks share a common address space, which they read and write

asynchronously.

• Threads (functional) • a single process can have multiple, concurrent execution paths.

Example implementations: POSIX threads & OpenMP

• Message Passing • tasks exchange data through communications by sending and

receiving messages. Example: MPI-2 specification.

• Data Parallel languages • tasks perform the same operation on their partition of work.

Example: Co-array Fortran (CAF), Unified Parallel C (UPC), Chapel

• Hybrid

Page 13: Parallelization - XS4ALL Klantenservice

Programming Models

!13

Hardware Layer

Memory

Interconnect

Interconnect

System Software

Programming Model

UserNew to parallel programming Experienced programmer

Shared Memory Message passing Hybrid

Operating system compilers

Distributed Memory Shared Memory

Page 14: Parallelization - XS4ALL Klantenservice

Parallel Programming Concepts

• Work distribution • Which parallel task is doing what? • Which data is used by which task?

• Synchronization • Do the different parallel tasks meet?

• Communication • Is communication between parallel parts needed?

• Load Balance • Does every task has the same amount of work? • Are all processors of the same speed?

!14

Page 15: Parallelization - XS4ALL Klantenservice

• Work decomposition • based on loop counter

• Data decomposition • all work for a local portion

of the data is done by the local processor

• Domain decomposition • decomposition of work and

data

Distributing Work and/or Data

!15

do i=1,1001: i=1,252: i=26,503: i=51,754: i=76,100

A(1:10,1:25)A(1:10,26:50)A(11:20,1:25)A(11:20,25:50)

Page 16: Parallelization - XS4ALL Klantenservice

Synchronization

• Synchronization • causes overhead • idle time, when not all tasks are finished at the same time

!16

2. — Parallel Hardware Architectures and Parallel Programming Models — 2.2-9

Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 17 / 54

Distributing Work & Data

do i=1,100 i=1,25

i=26,50i=51,75i=76,100

Work decomposition• based on loop decomposition

Domain decomposition• decomposition of work and

data is done in a higher model,e.g. in the reality

A( 1:20, 1: 50)A( 1:20, 51:100)A(21:40, 1: 50)A(21:40, 51:100)

Data decomposition• all work for a local portion

of the data is done by thelocal processor

Höchstleistungsrechenzentrum StuttgartHardware Architectures & Parallel Programming Models[2] Slide 18 / 54

Synchronization

• Synchronization– is necessary– may cause

• idle time on some processors• overhead to execute the synchronization primitive

i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors

i=1..25 | 26..50 | 51..75 | 76..100execute on the 4 processors

BARRIER synchronization

Do i=1,100a(i) = b(i)+c(i)

EnddoDo i=1,100

d(i) = 2*a(101-i)Enddo

Page 17: Parallelization - XS4ALL Klantenservice

Communication

• communication is necessary on the boundaries

• domain decomposition

!17

do i=2,99 b(i) = b(i) + h*(a(i-1)-2*a(i)+a(i+1))end do

e.g. b(26) = b(26) + h*(a(25)-2*a(26)+a(27))

A(1:25)A(26:50)A(51:75)A(76:100)

31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-6

Slide 11Domain Decomposition

Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch

Replication versus Communication (II)

• Normally replicate the values

– Consider how many calculations you can execute while onlysending 1 Bit from one process to another(6 µs, 1.0 Gflop/s 6000 operations)

– Sending 16 kByte (20x20x5) doubles(with 300 MB/s bandwidth 53.3 µs 53 300 operations)

– very often blocks have to wait for their neighbours

– but extra work limits parallel efficiency

• Communication should only be used if one is quite sure that this isthe best solution

Slide 12Domain Decomposition

Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch

2- Dimensional DD with two Halo Cells

Mesh Partitioning

Subdomain for each Process

Page 18: Parallelization - XS4ALL Klantenservice

38a. — Parallelization of Explicit and Implicit Solvers — 38a.38a-10

Rolf RabenseifnerParallelization and Iterative SolversSlide 19 of 51 Höchstleistungsrechenzentrum Stuttgart

Unstructured Grids

• Mesh partitioning with special load balancing libraries– Metis (George Karypis, University of Minnesota)– ParMetis (internally parallel version of Metis)

• http://www.cs.umn.edu/~karypis/metis/metis.html• http://www.hlrs.de/organization/par/services/tools/loadbalancer/metis.html

– Jostle (Chris Walshaw, University of Greenwich)• http://www.gre.ac.uk/jostle• http://www.hlrs.de/organization/par/services/tools/loadbalancer/jostle.html

– Goals:• Same work load in

each sub-domain• Minimizing the

maximal number of neighbor-connectionsbetween sub-domains

0 2

31

45

6 10

98

7 11

12 15

1613

17 14

21 20

2322

19

18

Parallelization

of Explicit and Im

plicit Solvers [38a]

Rolf RabenseifnerParallelization and Iterative SolversSlide 20 of 51 Höchstleistungsrechenzentrum Stuttgart

Halo

• Stencil:– To calculate a new grid point ( ),

old data from the stencil grid points ( ) are needed• E.g., 9 point stencil

• Halo– To calculate the new grid points of a sub-domain,

additional grid points from other sub-domains are needed.– They are stored in halos (ghost cells, shadows)– Halo depends on form of stencil

Load Imbalance

• Load imbalance is the time that some processors in the system are idle due to:

• less parallelism than processors • unequal sized tasks together with too little parallelism • unequal processors

!18

Page 19: Parallelization - XS4ALL Klantenservice

Examples of work distribution

• Domain Decomposition • Master Slave • Task Decomposition

!19

Page 20: Parallelization - XS4ALL Klantenservice

Domain Decomposition

• First, decide how data elements should be divided among processors

• Second, decide which tasks each processor should be doing

Page 21: Parallelization - XS4ALL Klantenservice

Domain Decomposition

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Find the largest element of an array

Page 22: Parallelization - XS4ALL Klantenservice

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Domain DecompositionFind the largest element of an array

Core 0 Core 1 Core 2 Core 3

Page 23: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 6 68 83

Core 0 Core 1 Core 2 Core 3

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 645 6 68 83

Page 24: Parallelization - XS4ALL Klantenservice

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Domain DecompositionFind the largest element of an array

Core 0 Core 1 Core 2 Core 3

13 49 12 51

13 49 68 83

Page 25: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

1 34 98 94

13 49 98 94

Page 26: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

9 50 16 27

13 50 98 94

Page 27: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

26 22 78 74

26 50 98 94

Page 28: Parallelization - XS4ALL Klantenservice

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Domain DecompositionFind the largest element of an array

Core 0 Core 1 Core 2 Core 3

26 50 98 94

13 12 31 64

Page 29: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

26

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

26 50 98 94

Page 30: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

26 50 98 94

50

Page 31: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

26 50 98 94

98

Page 32: Parallelization - XS4ALL Klantenservice

Domain DecompositionFind the largest element of an array

5 13 1 9 26 13 6 49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64

Core 0 Core 1 Core 2 Core 3

26 50 98 94

98

Page 33: Parallelization - XS4ALL Klantenservice

Core 2

Core 1

Core 3

T

T

T

Master Slave

!33

Get a heap of work done

Core 0 (master)

T

Page 34: Parallelization - XS4ALL Klantenservice

• The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

• Divide computation based on natural set of independent tasks ‣ Assign data for each task as needed

• Example: pipeline seismic data pre-processing • static-correction • deconvolution • nmo correction • stacking • ….

Functional Decomposition

Page 35: Parallelization - XS4ALL Klantenservice

Task/Functional Decompositionf()

s()

r()q()h()

g()

Page 36: Parallelization - XS4ALL Klantenservice

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Page 37: Parallelization - XS4ALL Klantenservice

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Page 38: Parallelization - XS4ALL Klantenservice

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Page 39: Parallelization - XS4ALL Klantenservice

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Page 40: Parallelization - XS4ALL Klantenservice

Task Decompositionf()

s()

r()q()h()

g()

Core 0

Core 2

Core 1

Page 41: Parallelization - XS4ALL Klantenservice

Shared Memory Parallelism

• Introduction to Threads • Exercise: Racecondition

• OpenMP Programming Model • Scope of Variables: Exercise 1 • Synchronisation: Exercise 2

• Scheduling • Exercise: OpenMP scheduling

• Reduction • Exercise: Pi

• Shared variables • Exercise: CacheTrash

• Tasks • Future of OpenMP

!41

Page 42: Parallelization - XS4ALL Klantenservice

Threads: “processes” sharing memory

• Process == address space • Thread == program counter / stream of instructions • Two examples

• Three processes, each with one thread • One process with three threads

!42

Kernel KernelThreadsThreads

Systemspace

Userspace

Process 1 Process 2 Process 3 Process 1

Page 43: Parallelization - XS4ALL Klantenservice

The Shared-Memory Model

Shared Memory

Core

Private Memory

Core

Private Memory

Core

Private Memory

Page 44: Parallelization - XS4ALL Klantenservice

What Are Threads Good For?

Making programs easier to understand

Overlapping computation and I/O

Improving responsiveness of GUIs

Improving performance through parallel execution ‣ with the help of OpenMP

Page 45: Parallelization - XS4ALL Klantenservice

Fork/Join Programming Model

• When program begins execution, only master thread active

• Master thread executes sequential portions of program

• For parallel portions of program, master thread forks (creates or awakens) additional threads

• At join (end of parallel section of code), extra threads are suspended or die

Page 46: Parallelization - XS4ALL Klantenservice

Relating Fork/Join to Code (OpenMP)

for {

}

for {

}

Sequential code

Parallel code

Sequential code

Parallel code

Sequential code

Page 47: Parallelization - XS4ALL Klantenservice

Domain Decomposition Using Threads

Shared Memory

Thread 0 Thread 2

Thread 1

f ( )

f ( )

f ( )

Page 48: Parallelization - XS4ALL Klantenservice

Shared Memory

Task Decomposition Using ThreadsThread 0 Thread 1

e ( )

g ( )h ( )

f ( )

Page 49: Parallelization - XS4ALL Klantenservice

Shared versus Private Variables

Shared Variables

Private Variables

Private Variables

Thread

Thread

Page 50: Parallelization - XS4ALL Klantenservice

Parallel threads can “race” against each other to update resources

Race conditions occur when execution order is assumed but not guaranteed

Example: un-synchronised access to bank account

Race Conditions

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000 Final balance = ?

Page 51: Parallelization - XS4ALL Klantenservice

Race Conditions

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000 Final balance = ?

Time Withdrawal Deposit

T0 Load (balance = $1000)

T1 Subtract $100 Load (balance = $1000)

T2 Store (balance = $900) Add $100

T3 Store (balance = $1100)

Page 52: Parallelization - XS4ALL Klantenservice

Code Example in OpenMP exercise: RaceCondition

for (i=0; i<NMAX; i++) { a[i] = 1; b[i] = 2;}#pragma omp parallel for shared(a,b)for (i=0; i<12; i++) { a[i+1] = a[i]+b[i];}

1: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 3.0, 5.0, 7.0, 9.0, 11.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.04: a= 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0

!52

Page 53: Parallelization - XS4ALL Klantenservice

Code Example in OpenMP

thread computation 0 a[1] = a[0] + b[0]0 a[2] = a[1] + b[1]0 a[3] = a[2] + b[2] <--| Problem 1 a[4] = a[3] + b[3] <--| Problem1 a[5] = a[4] + b[4]1 a[6] = a[5] + b[5] <--| Problem 2 a[7] = a[6] + b[6] <--| Problem2 a[8] = a[7] + b[7]2 a[9] = a[8] + b[8] <--| Problem 3 a[10] = a[9] + b[9] <--| Problem3 a[11] = a[10] + b[10]

!53

Page 54: Parallelization - XS4ALL Klantenservice

How to Avoid Data Races

• Scope variables to be local to threads • Variables declared within threaded functions • Allocate on thread’s stack • TLS (Thread Local Storage)

• Control shared access with critical regions • Mutual exclusion and synchronization • Lock, semaphore, event, critical section, mutex…

Page 55: Parallelization - XS4ALL Klantenservice

Examples variables

!55

Page 56: Parallelization - XS4ALL Klantenservice

Domain Decomposition

Sequential Code:

int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i);

Page 57: Parallelization - XS4ALL Klantenservice

Domain Decomposition

Sequential Code:

int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i);

Thread 0: for (i = 0; i < 500; i++) a[i] = foo(i);

Thread 1: for (i = 500; i < 1000; i++) a[i] = foo(i);

Page 58: Parallelization - XS4ALL Klantenservice

Domain Decomposition

Sequential Code:

int a[1000], i; for (i = 0; i < 1000; i++) a[i] = foo(i);

Thread 0: for (i = 0; i < 500; i++) a[i] = foo(i);

Thread 1: for (i = 500; i < 1000; i++) a[i] = foo(i);

SharedPrivate

Page 59: Parallelization - XS4ALL Klantenservice

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(k); m = g(k); ... }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }

Page 60: Parallelization - XS4ALL Klantenservice

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(k); m = g(k); }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }

Thread 0

Thread 1

Page 61: Parallelization - XS4ALL Klantenservice

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(k); m = g(k); }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }

Thread 0

Thread 1

Static variable: Shared

Page 62: Parallelization - XS4ALL Klantenservice

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(x, k); m = g(x, k); }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }

Thread 0

Thread 1

Heap variable: Shared

Page 63: Parallelization - XS4ALL Klantenservice

Task Decompositionint e;

main () { int x[10], j, k, m; j = f(k); m = g(k); }

int f(int *x, int k) { int a; a = e * x[k] * x[k]; return a; }

int g(int *x, int k) { int a; k = k-1; a = e / x[k]; return a; }

Thread 0

Thread 1

Function’s local variables: Private

Page 64: Parallelization - XS4ALL Klantenservice

Shared and Private Variables

• Shared variables • Static variables • Heap variables • Contents of run-time stack at time of call

• Private variables • Loop index variables • Run-time stack of functions invoked by thread

Page 65: Parallelization - XS4ALL Klantenservice

!65

Page 66: Parallelization - XS4ALL Klantenservice

3

What Is OpenMP?

• Compiler directives for multithreaded programming

• Easy to create threaded Fortran and C/C++ codes

• Supports data parallelism model

• Portable and Standard

• Incremental parallelism ➡Combines serial and parallel code in single source

Page 67: Parallelization - XS4ALL Klantenservice

OpenMP is not ...

Not Automatic parallelization

– User explicitly specifies parallel execution

– Compiler does not ignore user directives even if wrong

Not just loop level parallelism

– Functionality to enable general parallel parallelism

Not a new language

– Structured as extensions to the base

– Minimal functionality with opportunities for extension

!67

Page 68: Parallelization - XS4ALL Klantenservice

Directive based

• Directives are special comments in the language

– Fortran fixed form: !$OMP, C$OMP, *$OMP

– Fortran free form: !$OMP

Special comments are interpreted by OpenMP

compilers

w = 1.0/n

sum = 0.0

!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum)

do I=1,n

x = w*(I-0.5)

sum = sum + f(x)

end do

pi = w*sum

print *,pi

end

!68

Comment in Fortran but interpreted by OpenMP compilers

Page 69: Parallelization - XS4ALL Klantenservice

C example

#pragma omp directives in C

– Ignored by non-OpenMP compilers w = 1.0/n;

sum = 0.0;

#pragma omp parallel for private(x) reduction(+:sum)

for(i=0, i<n, i++) {

x = w*((double)i+0.5);

sum += f(x);

}

pi = w*sum;

printf(“pi=%g\n”, pi);

}

!69

Page 70: Parallelization - XS4ALL Klantenservice

• Control runtime • schedule type • max threads • nested parallelism • throughput mode

Architecture of OpenMP

• Control structures • Work sharing • Synchronization • Data scope attributes

• private • shared • reduction

• Orphaning

!70

• Control & query routines • number of threads • throughput mode • nested parallism

• Lock API

Directives, Pragmas

Runtime library routines

Environment variables

Page 71: Parallelization - XS4ALL Klantenservice

Programming Model

• Fork-join parallelism: ‣ Master thread spawns a team of threads as needed

‣ Parallelism is added incrementally: the sequential program evolves into a parallel program

Parallel Regions

Master Thread

�71

Page 72: Parallelization - XS4ALL Klantenservice

Threads are assigned an independent set of iterations

Threads must wait at the end of work-sharing construct

i = 0

i = 1

i = 2

i = 3

i = 4

i = 5

i = 6

i = 7

i = 8

i = 9

i = 10

i = 11

Work-sharing Construct

T#pragma omp parallel

T

T

#pragma omp for

Implicit barrier

#pragma omp parallel #pragma omp for for(i = 0; i < 12; i++) c[i] = a[i] + b[i]

�72

Page 73: Parallelization - XS4ALL Klantenservice

Combining pragmas

These two code segments are equivalent

#pragma omp parallel { #pragma omp for for (i=0; i< MAX; i++) { res[i] = huge(); } }

#pragma omp parallel for for (i=0; i< MAX; i++) { res[i] = huge(); }

�73

Page 74: Parallelization - XS4ALL Klantenservice

18

IWOMP 2009TU Dresden

June 3-5, 2009

An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009

TID = 0for (i=0,1,2,3,4)

TID = 1for (i=5,6,7,8,9)

Example - Matrix times vector

i = 0 i = 5

a[0] = sum a[5] = sumsum = Σ b[i=0][j]*c[j] sum = Σ b[i=5][j]*c[j]

i = 1 i = 6

a[1] = sum a[6] = sumsum = Σ b[i=1][j]*c[j] sum = Σ b[i=6][j]*c[j]

... etc ...

��������������� ������������������������������� �������������������������������������

���

�������������������������� �����!����� �"����������������#�!���$�$��� ��%��� ��$�$�$�$�

= *

j

i

Matrix-vector example

!74

Page 75: Parallelization - XS4ALL Klantenservice

19

IWOMP 2009TU Dresden

June 3-5, 2009

An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009

� �� ��� ���� ����� ������ �������

���

����

����

����

����

������

�������

������

OpenMP Performance Example

Memory Footprint (KByte)

Perfo

rman

ce (M

flop/

s)

Matrix too small *

*) With the IF-clause in OpenMP this performance degradation can be avoided

scales

Performance is matrix size dependent

!75

Page 76: Parallelization - XS4ALL Klantenservice

OpenMP: EXERCISE_1

• Matrix-matrix multiplication

• Learn to use the compiler and insert OpenMP directives

• Modern compilers can be very ‘smart’ in optimising code

!76

Page 77: Parallelization - XS4ALL Klantenservice

OpenMP parallelization

• OpenMP Team := Master + Workers • A Parallel Region is a block of code executed by all

threads simultaneously • The master thread always has thread ID 0 • Thread adjustment (if enabled) is only done before entering a parallel

region • Parallel regions can be nested, but support for this is implementation

dependent • An "if" clause can be used to guard the parallel region; in case the

condition evaluates to "false", the code is executed serially • A work-sharing construct divides the execution of the enclosed

code region among the members of the team; in other words: they split the work

!77

Page 78: Parallelization - XS4ALL Klantenservice

Data Environment

• OpenMP uses a shared-memory programming model

• Most variables are shared by default.

• Global variables are shared among threads C/C++: File scope variables, static

• Not everything is shared, there is often a need for “local” data as well

�78

Page 79: Parallelization - XS4ALL Klantenservice

... not everything is shared...

• Stack variables in functions called from parallel regions are PRIVATE

• Automatic variables within a statement block are PRIVATE

• Loop index variables are private (with exceptions) • C/C+: The first loop index variable in nested loops following a

#pragma omp for

Data Environment

�79

Page 80: Parallelization - XS4ALL Klantenservice

About Variables in SMP

• Shared variables Can be accessed by every thread thread. Independent read/write operations can take place.

• Private variables Every thread has it’s own copy of the variables that are created/destroyed upon entering/leaving the procedure. They are not visible to other threads.

!80

serial code global auto local static dynamic

parallel code shared local use with care use with care

Page 81: Parallelization - XS4ALL Klantenservice

attribute clauses

•default(shared)

•shared(varname,…)

private(varname,…)

Data Scope clauses

�81

Page 82: Parallelization - XS4ALL Klantenservice

The Private Clause

Reproduces the variable for each thread • Variables are un-initialised; C++ object is default constructed • Any value external to the parallel region is undefined

void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } }

�82

Page 83: Parallelization - XS4ALL Klantenservice

Synchronization

• Barriers

• Critical sections

• Lock library routines

!83

#pragma omp barrier

#pragma omp critical()

omp_set_lock(omp_lock_t *lock)

omp_unset_lock(omp_lock_t *lock)

....

Page 84: Parallelization - XS4ALL Klantenservice

#pragma omp critical [(lock_name)]

Defines a critical region on a structured block

OpenMP Critical Construct

float R1, R2; #pragma omp parallel { float A, B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical consum (B, &R1); A = bigger_job(i); #pragma omp critical consum (A, &R2); } }

All threads execute the code, but only one at a time. Only one calls consum() thereby protecting R1 and R2 from race conditions. Naming the critical constructs is optional, but may increase performance.

(R1_lock)

(R2_lock)

�84

Page 85: Parallelization - XS4ALL Klantenservice

OpenMP Critical

!85

Day 3: OpenMP 2010 – Course MT1

OpenMP critical directive: Explicit Synchronization

• Race conditions can be avoided by controlling access to shared variables by allowing threads to have exclusive access to the variables

• Exclusive access to shared variables allows the thread to atomically perform read, modify and update operations on the variable.

• Mutual exclusion synchronization is provided by the critical directive of OpenMP

• Code block within the critical region defined by critical /end critical directives can be executed only by one thread at a time.

• Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.

29

fork

join

- critical region

int x x=0; #pragma omp parallel shared(x) { #pragma omp critical x = 2*x + 1; } /* omp end parallel */

All threads execute the code, but only one at a time. Other threads in the group must wait until the current thread exits the critical region. Thus only one thread can manipulate values in the critical region.

Page 86: Parallelization - XS4ALL Klantenservice

Day 3: OpenMP 2010 – Course MT1

Simple Example: critical

30

cnt = 0; f = 7; #pragma omp parallel { #pragma omp for for (i=0;i<20;i++){ if(b[i] == 0){ #pragma omp critical cnt ++; } /* end if */ a[i]=b[i]+f*(i+1); } /* end for */ } /* omp end parallel */

cnt=0 f=7

i =0,4 i=5,9 i= 20,24 i= 10,14

if  … if  …

if  … i= 20,24

cnt++

cnt++

cnt++ cnt++ a[i]=b[

i]+…

a[i]=b[i]+…

a[i]=b[i]+…

a[i]=b[i]+…

Critical Example 1

!86

Page 87: Parallelization - XS4ALL Klantenservice

Critical Example 2

int i;

#pragma omp parallel for for (i = 0; i < 100; i++) {

s = s + a[i]; }

!87

Page 88: Parallelization - XS4ALL Klantenservice

RZ: Christian Terboven Folie 12

Synchronization (2/4)

do i = 0, 24 s = s + a(i) end do

do i = 25, 49 s = s + a(i) end do

do i = 50, 74 s = s + a(i) end do

do i = 75, 99 s = s + a(i) end do

A(0) . . .

A(99)

S

Pseudo-Code Here: 4 Threads

Memory

do i = 0, 99 s = s + a(i) end do

Critical Example 2

!88

Page 89: Parallelization - XS4ALL Klantenservice

OpenMP Single Construct

• Only one thread in the team executes the enclosed code

• The Format is:

• The supported clauses on the single directive are:

!89

#pragma omp single [nowait][clause, ..]{ “block”}

private (list)firstprivate (list)

NOWAIT: the other threads will not wait at the end single directive

Page 90: Parallelization - XS4ALL Klantenservice

OpenMP Master directive

• All threads but the master, skip the enclosed section of code and continue

• There is no implicit barrier on entry or exit !

• Each thread waits until all others in the team have reached this point.

!90

#pragma omp master { “code”}

#pragma omp barrier

Page 91: Parallelization - XS4ALL Klantenservice

#pragma omp parallel{ ....#pragma omp single [nowait]{ ....}#pragma omp master{ ....} ....#pragma omp barrier}

Work Sharing: Single Master

!91

Page 92: Parallelization - XS4ALL Klantenservice

Single processor

!92

62

Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011An Overview of OpenMP

Single processor region/2

time

single processor region

Other threads wait if there is a barrier here

Page 93: Parallelization - XS4ALL Klantenservice

OpenMP: EXERCISE_2

• Dot product

• Code is parallel, but something is wrong….

• Try to fix the code and maintain good performance

!93

Page 94: Parallelization - XS4ALL Klantenservice

Work Sharing: Orphaning

• Worksharing constructs may be outside lexical scope of the parallel region

!94

#pragma omp parallel{ .... dowork( ) ....} ....

void dowork( ){ #pragma omp for for (i=0; i<n; i++) { .... }}

Page 95: Parallelization - XS4ALL Klantenservice

Scheduling the work

• schedule ( static | dynamic | guided | auto [, chunk] ) schedule (runtime)

static [, chunk]• Distribute iterations in blocks of size "chunk" over the threads in a

round-robin fashion • In absence of "chunk", each thread executes approx. N/P chunks for

a loop of length N and P threads

!95

66

IWOMP 2009TU Dresden

June 3-5, 2009

An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009

The schedule clause/2

Thread 0 1 2 3������ *-K L-M N-*7 *F-*O

����!�� *-7 F-K L-O P-MN-*� **-*7 *F-*K *L-*O

Example static scheduleLoop of length 16, 4 threads:

*) The precise distribution is implementation defined

Page 96: Parallelization - XS4ALL Klantenservice

!96

• dynamic [, chunk]• Fixed portions of work; size is controlled by the value of chunk • When a thread finishes, it starts on the next portion of work

• guided [, chunk]• Same dynamic behavior as "dynamic", but size of the portion of work

decreases exponentially

runtime• Iteration scheduling scheme is set at runtime through environment

variable OMP_SCHEDULE

Page 97: Parallelization - XS4ALL Klantenservice

68

IWOMP 2009TU Dresden

June 3-5, 2009

An Overview of OpenMP 3.0RvdP/V1 Tutorial IWOMP 2009 – TU Dresden, June 3, 2009

The experiment

0 100 200 300 400 500 600

3210

3210

3210

static

dynamic, 5

guided, 5

Iteration Number

Thre

ad ID

500 iterations on 4 threadsExample scheduling

!97

Page 98: Parallelization - XS4ALL Klantenservice

Environment Variables

• The names of the OpenMP environment variables must be UPPERCASE

• The values assigned to them are case insensitive

!98

OMP_NUM_THREADS

OMP_SCHEDULE “schedule [chunk]”

OMP_NESTED { TRUE | FALSE }

Page 99: Parallelization - XS4ALL Klantenservice

Exercise: OpenMP scheduling

• Download code from: http://www.xs4all.nl/~janth/HPCourse/OMP_schedule.tar

• Two loops • Parallel code with omp sections • Check what the auto-parallelisation of the compiler has done • Insert OpenMP directives to try out different scheduling

strategies • c$omp& schedule(runtime) • export OMP_SCHEDULE=“static,10" • export OMP_SCHEDULE=“guided,100" • export OMP_SCHEDULE=“dynamic,1”

!99

Page 100: Parallelization - XS4ALL Klantenservice

reduction (op : list)

The variables in “list” must be shared in the enclosing parallel region

Inside parallel or work-sharing construct: ‣ A PRIVATE copy of each list variable is created and initialized depending on

the “op”

‣ These copies are updated locally by threads

‣ At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

13

OpenMP Reduction Clause

Page 101: Parallelization - XS4ALL Klantenservice

Day 3: OpenMP 2010 – Course MT1

OpenMP: Reduction

• performs reduction on shared variables in list based on the operator provided. • for C/C++ operator can be any one of :

– +, *, -, ^, |, ||, & or && – At the end of a reduction, the shared variable contains the result obtained upon

combination of the list of variables processed using the operator specified.

32

sum = 0.0 #pragma omp parallel for reduction(+:sum) for (i=0; i < 20; i++) sum = sum + (a[i] * b[i]);

sum=0

i=0,4 i=5,9 i=10,14 i=15,19

sum=.. sum=.. sum=.. sum=..

∑sum

sum=0

Local copy of sum for each thread All local copies of sum added together and stored in “global” variable

Reduction Example

#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }

�101

Page 102: Parallelization - XS4ALL Klantenservice

A range of associative and commutative operators can be used with reduction Initial values are the ones that make sense

C/C++ Reduction Operations

Operator Initial Value

+ 0

* 1

- 0

^ 0

Operator Initial Value

& ~0

| 0

&& 1

|| 0

FORTRAN:intrinsic is one of MAX, MIN, IAND, IOR, IEOR operator is one of +, *, -, .AND., .OR., .EQV., .NEQV.

�102

Page 103: Parallelization - XS4ALL Klantenservice

Numerical Integration Example

TTTT

4.0

2.0

1.00.0 X

� 1

0

41 + x2

dx = �static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0;

step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

�103

Page 104: Parallelization - XS4ALL Klantenservice

static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0;

step = 1.0/(double) num_steps; for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

Numerical Integration to Compute Pi

Parallelize the numerical integration code using OpenMP

What variables can be shared?

What variables need to be private?

What variables should be set up for reductions?

step, num_steps

x, i

sum

�104

Page 105: Parallelization - XS4ALL Klantenservice

Solution to Computing Pi

static long num_steps=100000; double step, pi;

void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(x) reduction(+:sum) for (i=0; i<num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

�105

Page 106: Parallelization - XS4ALL Klantenservice

Let’s try it out

• Go to example MPI_pi.tar and work with openmp_pi2.c

!106

Page 107: Parallelization - XS4ALL Klantenservice

Exercise: PI with MPI and OpenMP

!107

cores OpenMP

1 9.617728

2 4.874539

4 2.455036

6 1.627149

8 1.214713

12 0.820746

16 0.616482

Page 108: Parallelization - XS4ALL Klantenservice

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling

Amdahl 1.0OpenMP Barcelona 2.2 GHz

Exercise: PI with MPI and OpenMP

!108

Page 109: Parallelization - XS4ALL Klantenservice

© 2007 IBM Corporation

Multi-PF Solutions

CUDA: computing PICuda Computing PI

!109

Page 110: Parallelization - XS4ALL Klantenservice

Exercise: Shared Cache Trashing

• Let’s do the exercise: CacheTrash

!110

Page 111: Parallelization - XS4ALL Klantenservice

About local and shared data

• Consider the following example:

• Let’s assume we run this on 2 processors:

• processor 1 for i=0,2,4,6,8 • processor 2 for i=1,3,5,7,9

!111

for (i=0; i<10; i++){ a[i] = b[i] + c[i];}

Page 112: Parallelization - XS4ALL Klantenservice

i1

About local and shared data

!112

for (i1=0,2,4,6,8){ a[i1] = b[i1] + c[i1];}

for (i2=1,3,5,7,9){ a[i2] = b[i2] + c[i2];}

i2A B C

P1 P2

private area private areashared area

Processor 1 Processor 2

Page 113: Parallelization - XS4ALL Klantenservice

About local and shared data

processor 1 for i=0,2,4,6,8 processor 2 for i=1,3,5,7,9

• This is not an efficient way to do this!

Why?

!113

Page 114: Parallelization - XS4ALL Klantenservice

Doing it the bad way

• Because of cache line usage

• b[] and c[]: we use half of the data

• a[]: false sharing

!114

for (i=0; i<10; i++){ a[i] = b[i] + c[i];}

Page 115: Parallelization - XS4ALL Klantenservice

False sharing and scalability

• The Cause: Updates on independent data elements that happen to be part of the same cache line.

• The Impact: Non-scalable parallel applications

• The Remedy: False sharing is often quite simple to solve

!115

Page 116: Parallelization - XS4ALL Klantenservice

Poor cache line utilization

!116

Processor 1 Processor 2

B(0)

B(1)

B(2)

B(3)

B(4)

B(5)

B(6)

B(7)

B(8)

B(9)

B(0)

B(1)

B(2)

B(3)

B(4)

B(5)

B(6)

B(7)

B(8)

B(9) The same holds for array C

cache line

Both processors read the same cache lines

used data not used data

Page 117: Parallelization - XS4ALL Klantenservice

False Sharing

!117

Processor 1 Processor 2

a[1] = b[1] + c[0];a[0] = b[0] + c[0];Write into the line containing a[0] This marks the cache line containing a[0] as ‘dirty’

Detects the line with a[0] is ‘dirty’

Get a fresh copy (from processor 1) Write into the line containing a[1] This marks the cache line containing a[1] as ‘dirty’

a[2] = b[2] + c[2];

time

a[3] = b[3] + c[3];Detects the line with a[3] is ‘dirty’

Detects the line with a[2] is ‘dirty’

Get a fresh copy (from processor 2) Write into the line containing a[2] This marks the cache line containing a[2] as ‘dirty’

Page 118: Parallelization - XS4ALL Klantenservice

False Sharing results

!118

Iterations per thread

Thre

ads

1 4 16 64 256 1K 4K 16K 64K 256K1

2

3

4

5

6

7

8

9

10

3

4

5

6

7

8

9

10

time in seconds

Page 119: Parallelization - XS4ALL Klantenservice

OpenMP tasks

• What are tasks • Tasks are independent units of work • Threads are assigned to perform the work of each task.

- Tasks may be deferred - Tasks may be executed immediately - The runtime system decides which of the above

• Why tasks? • The basic idea is to set up a task queue: when a thread

encounters a task directive, it arranges for some thread to execute the associated block at some time. The first thread can continue.

!119

OpenMP 3.0 and Tasks

• What are tasks? – Tasks are independent units of work – Threads are assigned to perform the work

of each task. • Tasks may be deferred • Tasks may be executed immediately • The runtime system decides which of the

above • Why task?

– The basic idea is to set up a task queue: when a thread encounters a task directive, it arranges for some thread to execute the associated block – at some time. The first thread can continue.

4

Page 120: Parallelization - XS4ALL Klantenservice

!120

122

Tutorial IWOMP 2011 - Chicago, IL, USA June 13, 2011An Overview of OpenMP

The Tasking Example

Developer specifies tasks in applicationRun-time system executes tasks

Encountering thread adds

task(s) to pool

Threads execute tasks in the pool

Page 121: Parallelization - XS4ALL Klantenservice

OpenMP tasks

Tasks allow to parallelize irregular problems – Unbounded loops – Recursive algorithms – Manger/work schemes

A task has – Code to execute – Data environment (It owns its data)– Internal control variables – An assigned thread that executes the code and the data

!121

Page 122: Parallelization - XS4ALL Klantenservice

OpenMP has always had tasks, but they were not called “task”. – A thread encountering a parallel construct, e.g., “for”, packages up a set of implicit tasks, one per thread. – A team of threads is created. – Each thread is assigned to one of the tasks. – Barrier holds master thread till all implicit tasks are finished.

!122

Page 123: Parallelization - XS4ALL Klantenservice

OpenMP tasks

#pragmaompparallel#pragmaompsingle{...#pragmaomptask{...}…#pragmaomptaskwait}

!123

-> A parallel region creates a team of threads;

-> One thread enters the execution

-> the other threads wait at the end of the single

-> pick up threads „from the work queue“

Page 124: Parallelization - XS4ALL Klantenservice

OpenMP: EXERCISE_4

• Jacobi method

• use Intel’s profiler to find out the most time consuming parts (2) • amplx-cl and amplx-gui of intel

• parallelise the most time consuming loop • directly on the loop itself (3) • making the parallel region larger and more efficient (4)

!124

Page 125: Parallelization - XS4ALL Klantenservice

Summary

• First tune single-processor performance

• Tuning parallel programs • Has the program been properly parallelized?

• Is enough of the program parallelized (Amdahl’s law)? • Is the load well-balanced?

• location of memory • Cache friendly programs: no special placement needed • Non-cache friendly programs

• False sharing?

• Use of OpenMP • try to avoid synchronization (barrier, critical, single, ordered)

!125

Page 126: Parallelization - XS4ALL Klantenservice

19

Plenty of Other OpenMP Stuff

Scheduling clauses

Atomic

Barrier

Master & Single

Sections

Tasks (OpenMP 3.0)

API routines

Page 127: Parallelization - XS4ALL Klantenservice

Compiling and running OpenMP

• Compile with -openmp flag (intel compiler) or -fopenmp (GNU)

• Run program with variable:

export OMP_NUM_THREADS=4

!127

Page 128: Parallelization - XS4ALL Klantenservice

OpenACC

• New set of directives to support accelerators • GPU’s • Intel’s MIC • AMD Fusions processors

!128

Page 129: Parallelization - XS4ALL Klantenservice

OpenACC examplevoidconvolution_SM_N(typeToUseA[M][N],typeToUseB[M][N]) {

inti,j,k; intm=M,n=N; //OpenACCkernelregion //Definearegionoftheprogramtobecompiledintoasequenceofkernels //forexecutionontheacceleratordevice #pragmaacckernelspcopyin(A[0:m])pcopy(B[0:m]) { typeToUsec11,c12,c13,c21,c22,c23,c31,c32,c33; c11=+2.0f;c21=+5.0f;c31=-8.0f; c12=-3.0f;c22=+6.0f;c32=-9.0f; c13=+4.0f;c23=+7.0f;c33=+10.0f; //TheOpenACCloopgangclausetellsthecompilerthattheiterationsoftheloops //aretobeexecutedinparallelacrossthegangs. //Theargumentspecifieshowmanygangstousetoexecutetheiterationsofthisloop. #pragmaaccloopgang(64) for(inti=1;i<M-1;++i){ //TheOpenACCloopworkerclausespecifiesthattheiterationoftheassociatedlooparetobe //executedinparallelacrosstheworkerswithinthegangscreated. //Theargumentspecifieshowmanyworkerstousetoexecutetheiterationsofthisloop. #pragmaaccloopworker(128) for(intj=1;j<N-1;++j) { B[i][j]=c11*A[i-1][j-1]+c12*A[i+0][j-1]+c13*A[i+1][j-1] +c21*A[i-1][j+0]+c22*A[i+0][j+0]+c23*A[i+1][j+0] +c31*A[i-1][j+1]+c32*A[i+0][j+1]+c33*A[i+1][j+1]; } } }//kernelsregion }

!129

Page 130: Parallelization - XS4ALL Klantenservice

MPI

!130

Page 131: Parallelization - XS4ALL Klantenservice

Message Passing

• Point-to-Point

• Requires explicit commands in program • Send, Receive

• Must be synchronized among different processors • Sends and Receives must match • Avoid Deadlock -- all processors waiting, none able to communicate

• Multi-processor communications • e.g. broadcast, reduce

Page 132: Parallelization - XS4ALL Klantenservice

!132

MPI advantages

• Mature and well understood • Backed by widely-supported formal standard (1992) • Porting is “easy”

• Efficiently matches the hardware • Vendor and public implementations available

• User interface: • Efficient and simple • Buffer handling • Allow high-level abstractions

• Performance

Page 133: Parallelization - XS4ALL Klantenservice

!133

MPI disadvantages

• MPI 2.0 includes many features beyond message passing

• Execution control environment depends on implementation

Learning curve

Page 134: Parallelization - XS4ALL Klantenservice

T

5

Programming Model

• Explicit parallelism: ‣ All processes starts at the same time at the same point in the code

‣ Full parallelism: there is no sequential part in the program

Parallel Region

processes

Page 135: Parallelization - XS4ALL Klantenservice

Work Distribution

• All processors run the same executable.

• Parallel work distribution must be explicitly done by the programmer:

• domain decomposition • master slave

!135

Page 136: Parallelization - XS4ALL Klantenservice

!136

A Minimal MPI Program (C)

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[] ){ MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); return 0;}

Page 137: Parallelization - XS4ALL Klantenservice

!137

A Minimal MPI Program (Fortran 90)

program mainuse MPIinteger ierr

call MPI_INIT( ierr )print *, 'Hello, world!'call MPI_FINALIZE( ierr )end

Page 138: Parallelization - XS4ALL Klantenservice

Starting the MPI Environment

• MPI_INIT ( ) Initializes MPI environment. This function must be called and must be the first MPI function called in a program (exception: MPI_INITIALIZED) Syntax

int MPI_Init ( int *argc, char ***argv )

MPI_INIT ( IERROR )

INTEGER IERROR

NOTE: Both C and Fortran return error codes for all calls.

!138

Page 139: Parallelization - XS4ALL Klantenservice

Exiting the MPI Environment

• MPI_FINALIZE ( ) Cleans up all MPI state. Once this routine has been called, no MPI routine ( even MPI_INIT ) may be called

Syntaxint MPI_Finalize ( );

MPI_FINALIZE ( IERROR )

INTEGER IERROR

MUST call MPI_FINALIZE when you exit from an MPI program

!139

Page 140: Parallelization - XS4ALL Klantenservice

C and Fortran Language Considerations

• Bindings

– C

• All MPI names have an MPI_ prefix

• Defined constants are in all capital letters

• Defined types and functions have one capital letter after the prefix; the remaining letters are lowercase

– Fortran

• All MPI names have an MPI_ prefix

• No capitalization rules apply

• last argument is an returned error value

!140

Page 141: Parallelization - XS4ALL Klantenservice

!141

Finding Out About the Environment

• Two important questions that arise early in a parallel program are: • How many processes are participating in this computation? • Which one am I?

• MPI provides functions to answer these questions: – MPI_Comm_size reports the number of processes. – MPI_Comm_rank reports the rank, a number between 0 and size-1,

identifying the calling process

Page 142: Parallelization - XS4ALL Klantenservice

!142

Better Hello (C)

#include "mpi.h"#include <stdio.h>

int main( int argc, char *argv[] ){ int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0;}

Page 143: Parallelization - XS4ALL Klantenservice

!143

Better Hello (Fortran)

program mainuse MPIinteger ierr, rank, size

call MPI_INIT( ierr )call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )print *, 'I am ', rank, ' of ', sizecall MPI_FINALIZE( ierr )end

Page 144: Parallelization - XS4ALL Klantenservice

!144

Some Basic Concepts

• Processes can be collected into groups. • Each message is sent in a context, and must be received in the

same context. • A group and context together form a communicator. • A process is identified by its rank in the group associated with a

communicator. • There is a default communicator whose group contains all initial

processes, called MPI_COMM_WORLD.

Day 2: MPI 2010 – Course MT1

MPI Communicators

• Communicator is an internal object • MPI Programs are made up of communicating

processes • Each process has its own address space containing its

own attributes such as rank, size (and argc, argv, etc.) • MPI provides functions to interact with it • Default communicator is MPI_COMM_WORLD

– All processes are its members – It has a size (the number of processes) – Each process has a rank within it – One can think of it as an ordered list of processes

• Additional communicator(s) can co-exist • A process can belong to more than one communicator • Within a communicator, each process has a unique

rank

MPI_COMM_WORLD

0

1 2

5

3

4

6

7

14

Page 145: Parallelization - XS4ALL Klantenservice

Communicator

• Communication in MPI takes place with respect to communicators

• MPI_COMM_WORLD is one such predefined communicator (something of type “MPI_COMM”) and contains group and context information

• MPI_COMM_RANK and MPI_COMM_SIZE return information based on the communicator passed in as the first argument

• Processes may belong to many different communicators

0 1 2 3 4 5 6 7

MPI_COMM_WORLD

Rank-->

!145

Page 146: Parallelization - XS4ALL Klantenservice

MPI Basic Send/Receive• Basic message passing process. Send data from one process

to another

• Questions – To whom is data sent? – Where is the data? – What type of data is sent? – How much of data is sent? – How does the receiver identify it?

A:

Send Receive

B:

Process 1Process 0

!146

Page 147: Parallelization - XS4ALL Klantenservice

!147

MPI Basic Send/Receive

• Data transfer plus synchronization

• Requires co-operation of sender and receiver • Co-operation not always apparent in code • Communication and synchronization are combined

DataProcess 0

Process 1

May I Send?

Yes

DataData

DataData

DataData

DataData

Time

Page 148: Parallelization - XS4ALL Klantenservice

Day 2: MPI 2010 – Course MT1

Message Envelope

• Communication across processes is performed using messages.

• Each message consists of a fixed number of fields that is used to distinguish them, called the Message Envelope : – Envelope comprises source,

destination, tag, communicator – Message comprises Envelope +

data • Communicator refers to the

namespace associated with the group of related processes

21

MPI_COMM_WORLD

0

1 2

5

3

4

6

7

Source : process0 Destination : process1 Tag : 1234 Communicator : MPI_COMM_WORLD

Message Organization in MPI

• Message is divided into data and envelope

• data – buffer – count – datatype

• envelope – process identifier (source/destination) – message tag – communicator

!148

Page 149: Parallelization - XS4ALL Klantenservice

MPI Basic Send/Receive• Thus the basic (blocking) send has become:

MPI_Send ( start, count, datatype, dest, tag, comm )– Blocking means the function does not return until it is safe to reuse

the data in buffer. The message may not have been received by the target process.

• And the receive has become:MPI_Recv( start, count, datatype, source, tag, comm, status )

- The source, tag, and the count of the message actually received can be retrieved from status

!149

Page 150: Parallelization - XS4ALL Klantenservice

MPI C Datatypes

!150

Page 151: Parallelization - XS4ALL Klantenservice

MPI Fortran Datatypes

!151

Page 152: Parallelization - XS4ALL Klantenservice

!152

MPI Tags

• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.

• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.

Page 153: Parallelization - XS4ALL Klantenservice

Process Naming and Message Tags• Naming a process

– destination is specified by ( rank, group )

– Processes are named according to their rank in the group

– Groups are defined by their distinct “communicator”

– MPI_ANY_SOURCE wildcard rank permitted in a receive Tags are integer variables or constants used to uniquely identify individual messages

• Tags allow programmers to deal with the arrival of messages in an orderly manner

• MPI tags are guaranteed to range from 0 to 32767 by MPI-1

– Vendors are free to increase the range in their implementations

• MPI_ANY_TAG can be used as a wildcard value

!153

Page 154: Parallelization - XS4ALL Klantenservice

!154

Retrieving Further Information• Status is a data structure allocated in the user’s program. • In C:

int recvd_tag, recvd_from, recvd_count;

MPI_Status status;

MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )

recvd_tag = status.MPI_TAG;

recvd_from = status.MPI_SOURCE;

MPI_Get_count( &status, datatype, &recvd_count );

• In Fortran: integer recvd_tag, recvd_from, recvd_count

integer status(MPI_STATUS_SIZE)

call MPI_RECV(..., MPI_ANY_SOURCE, MPI_ANY_TAG, .. status, ierr)

tag_recvd = status(MPI_TAG)

recvd_from = status(MPI_SOURCE)

call MPI_GET_COUNT(status, datatype, recvd_count, ierr)

Page 155: Parallelization - XS4ALL Klantenservice

Getting Information About a Message (Fortran)

• The following functions can be used to get information about a messageINTEGER status(MPI_STATUS_SIZE) call MPI_Recv( . . . , status, ierr ) tag_of_received_message = status(MPI_TAG) src_of_received_message = status(MPI_SOURCE) call MPI_Get_count(status, datatype, count, ierr)

• MPI_TAG and MPI_SOURCE are primarily of use when MPI_ANY_TAG and/or MPI_ANY_SOURCE is used in the receive

• The function MPI_GET_COUNT may be used to determine how much data of a particular type was received

!155

Page 156: Parallelization - XS4ALL Klantenservice

!156

Simple Fortran Example - 1/3 program main use MPI

integer rank, size, to, from, tag, count, i, ierr integer src, dest integer st_source, st_tag, st_count integer status(MPI_STATUS_SIZE) double precision data(10)

call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr ) print *, 'Process ', rank, ' of ', size, ' is alive' dest = size - 1 src = 0

Page 157: Parallelization - XS4ALL Klantenservice

!157

if (rank .eq. 0) then do i=1,10 data(i) = i done call MPI_SEND( data, 10, MPI_DOUBLE_PRECISION, + dest, 2001, MPI_COMM_WORLD, ierr)

else if (rank .eq. dest) then tag = MPI_ANY_TAG source = MPI_ANY_SOURCE call MPI_RECV( data, 10, MPI_DOUBLE_PRECISION, + source, tag, MPI_COMM_WORLD, + status, ierr)

Simple Fortran Example - 2/3

Page 158: Parallelization - XS4ALL Klantenservice

!158

call MPI_GET_COUNT( status, MPI_DOUBLE_PRECISION, st_count, ierr ) st_source = status( MPI_SOURCE ) st_tag = status( MPI_TAG ) print *, 'status info: source = ', st_source, + ' tag = ', st_tag, 'count = ', st_count endif

call MPI_FINALIZE( ierr ) end

Simple Fortran Example - 3/3

Page 159: Parallelization - XS4ALL Klantenservice

Is MPI Large or Small?

• Is MPI large (128 functions) or small (6 functions)? – MPI’s extensive functionality requires many functions – Number of functions not necessarily a measure of complexity – Many programs can be written with just 6 basic functions MPI_INIT MPI_COMM_SIZE MPI_SEND MPI_FINALIZE MPI_COMM_RANK MPI_RECV

• MPI is just right – A small number of concepts – Large number of functions provides flexibility, robustness,

efficiency, modularity, and convenience – One need not master all parts of MPI to use it

!159

Page 160: Parallelization - XS4ALL Klantenservice

#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){

int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) {

if (myid == 0) {printf("Enter the number of intervals: (0 quits) ");scanf("%d",&n);

}MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);if (n == 0) break;h = 1.0 / (double) n;sum = 0.0;for (i = myid + 1; i <= n; i += numprocs) {

x = h * ((double)i - 0.5);sum += 4.0 / (1.0 + x*x);

}mypi = h * sum;MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);if (myid == 0)

printf("pi is approximately %.16f, Error is %.16f\n",pi, fabs(pi - PI25DT));}MPI_Finalize();return 0;

}

!160

Example: PI in Fortran 90 and C

work distribution

Page 161: Parallelization - XS4ALL Klantenservice

work distribution

program main use MPI double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr c function to integrate f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) 10 if ( myid .eq. 0 ) then write(6,98) 98 format('Enter the number of intervals: (0 quits)') read(5,’(i10)’) n endif call MPI_BCAST( n, 1, MPI_INTEGER, 0,MPI_COMM_WORLD, ierr) if ( n .le. 0 ) goto 30 h = 1.0d0/n sum = 0.0d0 do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum call MPI_REDUCE( mypi, pi, 1, MPI_DOUBLE_PRECISION, + MPI_SUM, 0, MPI_COMM_WORLD,ierr) if (myid .eq. 0) then write(6, 97) pi, abs(pi - PI25DT) 97 format(' pi is approximately: ', F18.16, + ' Error is: ', F18.16) endif goto 10 30 call MPI_FINALIZE(ierr) end

!161

Page 162: Parallelization - XS4ALL Klantenservice

Exercise: PI with MPI and OpenMP

• Compile and run for computing PI parallel

• Download code from: http://www.xs4all.nl/~janth/HPCourse/MPI_pi.tar

• Unpack tar file and check the README for instructions.

• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know.

• Use qsub to submit the mpi job (an example script is provided) to a queue.

!162

Page 163: Parallelization - XS4ALL Klantenservice

Exercise: PI with MPI and OpenMP

!163

cores OpenMP marken linux

1 9.617728 14.10798 22.15252

2 4.874539 7.071287 9.661745

4 2.455036 3.532871 5.730912

6 1.627149 2.356928 3.547961

8 1.214713 1.832055 2.804715

12 0.820746 1.184123

16 0.616482 0.955704

Page 164: Parallelization - XS4ALL Klantenservice

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling

Amdahl 1.0OpenMP Barcelona 2.2 GHz

MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz

Exercise: PI with MPI and OpenMP

!164

Page 165: Parallelization - XS4ALL Klantenservice

!165

Page 166: Parallelization - XS4ALL Klantenservice

Blocking Communication• So far we have discussed blocking communication

– MPI_SEND does not complete until buffer is empty (available for reuse) – MPI_RECV does not complete until buffer is full (available for use)

• A process sending data will be blocked until data in the send buffer is emptied

• A process receiving data will be blocked until the receive buffer is filled • Completion of communication generally depends on the message size

and the system buffer size • Blocking communication is simple to use but can be prone to

deadlocks If (my_proc.eq.0) Then Call mpi_send(..)

Call mpi_recv(…) Usually deadlocks--> Else

Call mpi_send(…) <--- UNLESS you reverse send/recv Call mpi_recv(….)

Endif

!166

Page 167: Parallelization - XS4ALL Klantenservice

!167

• Send a large message from process 0 to process 1 • If there is insufficient storage at the destination, the send must

wait for the user to provide the memory space (through a receive)

• What happens with

Sources of Deadlocks

Process 0

Send(1) Recv(1)

Process 1

Send(0) Recv(0)

• This is called “unsafe” because it depends on the availability of system buffers.

Page 168: Parallelization - XS4ALL Klantenservice

!168

Some Solutions to the “unsafe” Problem

• Order the operations more carefully:

Process 0

Send(1) Recv(1)

Process 1

Recv(0) Send(0)

• Use non-blocking operations:

Process 0

Isend(1) Irecv(1) Waitall

Process 1

Isend(0) Irecv(0) Waitall

Page 169: Parallelization - XS4ALL Klantenservice

Blocking Send-Receive Diagram (Receive before Send)

send side receive side

T1:MPI_Send

T4

T2

Once receive is called @ T0, buffer unavailable to user

Receive returns @ T4, buffer filled

It is important to receive before sending, for highest performance.

T0: MPI_Recv

sender returns @ T2, buffer can be reused T3: Transfer Complete

Internal completion is soon followed by return of MPI_Recv

!169

Page 170: Parallelization - XS4ALL Klantenservice

T7: transfer

finishes

sender returns @ T3 buffer unavailable

Non-Blocking Send-Receive Diagram

send side receive side

T2: MPI_Isend

T8

T3

Once receive is called @ T0, buffer unavailable to user

MPI_Wait, returns @ T8 here, receive buffer filled

High Performance Implementations Offer Low Overhead for Non-blocking Calls

T0: MPI_Irecv

Internal completion is soon followed by return of MPI_Wait

sender completes @ T5 buffer available after MPI_Wait

T4: MPI_Wait called

T6: MPI_Wait

T1: Returns

T5

T9: Wait returns

T6

!170

Page 171: Parallelization - XS4ALL Klantenservice

Message Completion and Buffering

• A send has completed when the user supplied buffer can be reused

• The send mode used (standard, ready, synchronous, buffered) may provide additional information

• Just because the send completes does not mean that the receive has completed – Message may be buffered by the system – Message may still be in transit

*buf = 3; MPI_Send ( buf, 1, MPI_INT, ... ); *buf = 4; /* OK, receiver will always receive 3 */

*buf = 3; MPI_Isend(buf, 1, MPI_INT, ...); *buf = 4; /* Undefined whether the receiver will get 3 or 4 */ MPI_Wait ( ... );

!171

Page 172: Parallelization - XS4ALL Klantenservice

Non-Blocking Communication• Non-blocking (asynchronous) operations return (immediately)

‘‘request handles” that can be waited on and queried MPI_ISEND( start, count, datatype, dest, tag, comm,

request )

MPI_IRECV( start, count, datatype, src, tag, comm,

request )

MPI_WAIT( request, status )

• Non-blocking operations allow overlapping computation and communication.

• One can also test without waiting using MPI_TESTMPI_TEST( request, flag, status )

• Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait

• Combinations of blocking and non-blocking sends/receives can be used to synchronize execution instead of barriers

!172

Page 173: Parallelization - XS4ALL Klantenservice

Multiple Completion’s

• It is often desirable to wait on multiple requests

• An example is a worker/manager program, where the manager waits for one or more workers to send it a message MPI_WAITALL( count, array_of_requests, array_of_statuses )

MPI_WAITANY( count, array_of_requests, index, status )MPI_WAITSOME( incount, array_of_requests, outcount, array_of_indices, array_of_statuses )

• There are corresponding versions of test for each of these

!173

Page 174: Parallelization - XS4ALL Klantenservice

Collective Communications in MPI• Communication is co-ordinated among a group of processes,

as specified by communicator, not on all processes

• All collective operations are blocking and no message tags are used (in MPI-1)

• All processes in the communicator group must call the collective operation

• Collective and point-to-point messaging are separated by different “contexts”

• Three classes of collective operations

– Data movement

– Collective computation

– Synchronization

!174

Page 175: Parallelization - XS4ALL Klantenservice

Collective calls

!175

• MPI has several collective communication calls, the most frequently used are:

• Synchronization • Barrier

• Communication • Broadcast • Gather Scatter • All Gather

• Reduction • Reduce • All Reduce

Page 176: Parallelization - XS4ALL Klantenservice

Day 2: MPI 2010 – Course MT1

40

MPI Collective Calls: Barrier

Function: MPI_Barrier()

int MPI_Barrier ( MPI_Comm comm )

Description: Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.

http://www-unix.mcs.anl.gov/mpi/www/www3/MPI_Barrier.html

P0

P1

P2

P3

MPI

_Bar

rier(

)

P0

P1

P2

P3

MPI_Barrier()

!176

Creates barrier synchronization in a communicator group comm. Each process, when reaching the MPI_Barrier call, blocks until all the processes in the group reach the same MPI_Barrier call.

Page 177: Parallelization - XS4ALL Klantenservice

MPI Basic Collective Operations

• Two simple collective operations

MPI_BCAST( start, count, datatype, root, comm )

MPI_REDUCE( start, result, count, datatype, operation,

root, comm )

• The routine MPI_BCAST sends data from one process to all others

• The routine MPI_REDUCE combines data from all processes, using a specified operation, and returns the result to a single process

• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

!177

Page 178: Parallelization - XS4ALL Klantenservice

Reduce (root=0, op)A0B1C2

D3

X0?1?2

?3

X=A op B op C op D

Process Ranks

Send buffer

Process Ranks

Receive buffer

Broadcast and Reduce

Bcast (root=0)A0?1?2

?3

Process Ranks

Send buffer

A0A1A2

A3

Process Ranks

Send buffer

!178

X0X1X2

X3

AllReduce (comm_world, op)

Page 179: Parallelization - XS4ALL Klantenservice

Scatter and Gather

Scatter (root=0)

Process Ranks

Send buffer

A0B1C2D3

Gather (root=0)

Process Ranks

Receive buffer

Process Ranks

Send buffer

Process Ranks

Receive buffer

ABCD0123

????????????

ABCD0123

????????????

A0B1C2D3

!179

ABCD0123

ABCDABCDABCD

AllGather (comm_world)

Page 180: Parallelization - XS4ALL Klantenservice

MPI Collective Routines

• Several routines: MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST

MPI_ALLTOALL MPI_ALLTOALLVMPI_GATHER MPI_GATHERVMPI_REDUCE_SCATTER MPI_REDUCE MPI_ALLREDUCEMPI_SCATTERV MPI_SCATTER

• All versions deliver results to all participating processes • “V” versions allow the chunks to have different sizes

• MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and take both built-in and user-defined combination functions

!180

Page 181: Parallelization - XS4ALL Klantenservice

Built-In Collective Computation Operations

!181

Page 182: Parallelization - XS4ALL Klantenservice

!182

Extending the Message-Passing Interface

• Dynamic Process Management • Dynamic process startup • Dynamic establishment of connections

• One-sided communication • Put/get • Other operations

• Parallel I/O

• Other MPI-2 features • Generalized requests • Bindings for C++/ Fortran-90; inter-language issues

Page 183: Parallelization - XS4ALL Klantenservice

!183

When to use MPI

• Portability and Performance

• Irregular Data Structures

• Building Tools for Others • Libraries

• Need to Manage memory on a per processor basis

Page 184: Parallelization - XS4ALL Klantenservice

!184

When not to use MPI

• Number of cores is limited and OpenMP is doing well on that number of cores

• typically 16-32 cores in SMP

Page 185: Parallelization - XS4ALL Klantenservice

Domain Decomposition

1. Break up the domain into blocks.

2. Assign blocks to MPI-processes one-to-one.

3. Provide a "map" of neighbors to each process.

4. Write or modify your code so it only updates a single block.

5. Insert communication subroutine calls where needed.

6. Adjust the boundary conditions code.

7. Use "guard cells".

!185

for example: wave equation in a 2D domain.

Page 186: Parallelization - XS4ALL Klantenservice

6. — A Heat-Transfer Example with MPI — 6.6-5

Rolf RabenseifnerA Heat-Transfer Example with MPISlide 9 Höchstleistungsrechenzentrum Stuttgart

Heat: How to parallelize with MPI

• Block data decomposition of arrays phi, phin– array sizes is variable, depends on processor number

• Halo (shadow, ghost cells, ...)– to store values of the neighbors– width of the halo := needs of the numerical formula (heat: 1)

• Communication:– fill the halo with values of the neighbors– after each iteration (dt)

• global execution of the abort criterion– collective operation

Rolf RabenseifnerA Heat-Transfer Example with MPISlide 10 Höchstleistungsrechenzentrum Stuttgart

Block data decomposition — in which dimensions?

Splitting in

• one dimension:communication

= n2*2*w *1

• two dimensions:communication

= n2*2*w *2 / p1/2

• three dimensions:communication

= n2*2*w *3 / p2/3

w=width of halon3 =size of matrixp = number of processorscyclic boundary—> two neighbors in each direction

optimal for p>11

-1- Break up domain in blocks

!186

Page 187: Parallelization - XS4ALL Klantenservice

-2- Assign blocks to MPI processes

!187

[0, 0] [1, 0]

[0, 1] [1, 1]

nx

nz

[0, 0] MPI 0

[1, 0] MPI 1

[0, 1] MPI 2

[1, 1] MPI 3

nx_block=nx/2nz_block=nz/2

nx_block

nz_block

Page 188: Parallelization - XS4ALL Klantenservice

-3- Make a map of neighbors

!188

lef=10 rig=40 bot=26 top=24

MPI 40

MPI 26 MPI 41

MPI 10

MPI 11

MPI 24

Page 189: Parallelization - XS4ALL Klantenservice

-4- Write code for single MPI block

!189

for (ix=0; ix<nx_block; ix++) { for (iz=0; iz<nz_block; iz++) { vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz])); } }

Page 190: Parallelization - XS4ALL Klantenservice

-5- Communication calls

• Examine the algorithm • Does it need data from its neighbors? • Yes? => Insert communication calls to get this data

• Need p[-1][iz] and p[-2][iz] to update vx[0][iz]

• p[-1][iz] corresponds to p[nx_block-1][iz] from left neighbor • p[-2][iz] corresponds to p[nx_block-2][iz] from left neighbor

• Get this data from the neighbor with MPI.

!190

Page 191: Parallelization - XS4ALL Klantenservice

communication

!191

for (ix=0; ix<nx_block; ix++) { for (iz=0; iz<nz_block; iz++) { vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz])); } }

MPI 10

p[nx_block-1][iz]

MPI 12

i=nx_block-1 i=1i=0

vx[0][iz] -= rox[0][iz]*( c1*(p[0][iz] - p[i-1][iz]) + c2*(p[1][iz] - p[i-2][iz])

Page 192: Parallelization - XS4ALL Klantenservice

pseudocode

if (my_rank==12) { call MPI_Receive(p[-2:-1][iz], .., 10, ..);

vx[ix][iz] -= rox[ix][iz]*( c1*(p[ix][iz] - p[ix-1][iz]) + c2*(p[ix+1][iz] - p[ix-2][iz]));}

if (my_rank==10) { call MPI_Send(p[-2:-1][iz], .., 12, ..);}

!192

Page 193: Parallelization - XS4ALL Klantenservice

Update of all blocks [i][j]

!193

[i-1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)

[i+2][j] receive(-1) receive(-2) send(nx-1) send(nx-2)

[i+1][j] send(nx-1) send(nx-2) receive(-1) receive(-2)

[i][j] receive(-1) receive(-2) send(nx-1) send(nx-2)

mpi_sendrecv(p(nx-1:nx-2), 2 ,.., my_neighbor_right,... p(-2:-1), 2 ,.., my_neighbor_left,...)

Page 194: Parallelization - XS4ALL Klantenservice

-6- Domain boundaries

• What if my block is at a physical boundary?

• Periodic boundary conditions

• Non-periodic boundary conditions

!194

Page 195: Parallelization - XS4ALL Klantenservice

Periodic Boundary Conditions

!195

MPI 25 MPI 38MPI 47 left=38 right=0

MPI 10MPI 0 left=47

right=10

Page 196: Parallelization - XS4ALL Klantenservice

Non-Periodic Boundary Conditions

!196

MPI 25 MPI 38MPI 47 left=38

right=NUMPI 10

MPI 0 left=NUL right=10

1. Avoid sending or receiving data from a non-existent block. MPI_PROC_NULL

2. Apply boundary conditions.

if (my_neighbor_right == MPI_PROC_NULL) & [ APPLY APPROPRIATE BOUNDARY CONDITIONS ]

if (my_rank == 12) & my_neighbor_right = MPI_PROC_NULL . call mpi_send(......my_neighbor_right...) !no data is sent by proc. 12*

Page 197: Parallelization - XS4ALL Klantenservice

6. — A Heat-Transfer Example with MPI — 6.6-6

Rolf RabenseifnerA Heat-Transfer Example with MPISlide 11 Höchstleistungsrechenzentrum Stuttgart

Heat: How to parallelize with MPI - Block Data Decomposit.

• Block data decomposition• Cartesian process topology

– two-dimensionalsize = idim * kdim

– process coordinatesicoord = 0 .. idim-1kccord = 0 .. kdim-1

– see heat-mpi-big.f, lines 18-42• Static array declaration

–> dynamic array allocation– phi(0:80,0:80) –> phi(is:ie,ks:ke)– additional subroutine “algorithm”– see heat-mpi-big.f, lines 186+192

• Computation of is, ie, ks, ke– see heat-mpi-big.f, lines 44-98

0 80

b1=1

is ievalues read in each iteration

values computed in each iteration

phi(is:ie,ks:ke)

inner area

outer area

halo

Rolf RabenseifnerA Heat-Transfer Example with MPISlide 12 Höchstleistungsrechenzentrum Stuttgart

Heat: Communication Method - Non-blocking

• Communicate halo in each iteration• Ranks via MPI_CART_SHIFT 114 call MPI_CART_SHIFT(comm,

0, 1, left, right, ierror) 115 call MPI_CART_SHIFT(comm,

1, 1, lower,upper, ierror)• non-blocking point-to-point

• Caution: corners must not be sentsimultaneously in two directions

• vertical boundaries:– non-contiguous data– MPI derived datatypes used

118 call MPI_TYPE_VECTOR( kinner,b1,iouter,MPI_DOUBLE_PRECISION, vertical_border, ierror)

120 call MPI_TYPE_COMMIT( vertical_border, ierror)

221 call MPI_IRECV(phi(is,ks+b1), 1,vertical_border, left,MPI_ANY_TAG,comm, req(1),ierror)

225 call MPI_ISEND(phi(ie-b1,ks+b1), 1,vertical_border, right,0,comm, req(3),ierror)

from left to rightfrom right to leftfrom upper to lowerfrom lower to upper

• Allocate local block including overlapping areas to store values received from neighbor communication.

• This allows tidy coding of your loops

-7- guard cells

!197

P[i][iz]

i0-2 -1 1

guard cells

Page 198: Parallelization - XS4ALL Klantenservice

Memory use increases with guard

• Example 1000x1000, N tasks, H square width of the task, h halo-points. The guard size per domain is

• halosize = 4*h*H+4*corners(h*h)

• Halosize limits minimum task size: H >= 2h, hence limits maximum number of tasks N <= (Nx/2h)**2

• N=1089=33*33 then H=1000/33 ~ 30, h=8 • halosize = 4*8*30+4*8*8 = 1216 • actual work is 30*30

!198

Page 199: Parallelization - XS4ALL Klantenservice

Exercise: MPI_Cart

• Do I have to make a map of MPI processes myself?

• You can use MPI_Cart to create a domain decomposition for you.

• Exercise sets up 2D domain with one • periodic and • non-periodic boundary condition

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_Cart.tar

• Unpack tar file and check the README for instructions

!199

Page 200: Parallelization - XS4ALL Klantenservice

PE- 6: x=1, y=2: Sum = 32 neighb0rs: left=2 right=10 top=5 bottom=7

PE- 4: x=1, y=0: Sum = 24 neighbors: left=0 right=8 top=-1 bottom=5

PE- 2: x=0, y=2: Sum = 32 neighbors: left=14 right=6 top=1 bottom=3

PE- 7: x=1, y=3: Sum = 36 neighbors: left=3 right=11 top=6 bottom=-1

PE- 0: x=0, y=0: Sum = 24 neighbors: left=12 right=4 top=-1 bottom=1

PE- 5: x=1, y=1: Sum = 28 neighbors: left=1 right=9 top=4 bottom=6

PE- 1: x=0, y=1: Sum = 28 neighbors: left=13 right=5 top=0 bottom=2

PE- 3: x=0, y=3: Sum = 36 neighbors: left=15 right=7 top=2 bottom=-1

PE- 14: x=3, y=2: Sum = 32 neighbors: left=10 right=2 top=13 bottom=15

PE- 9: x=2, y=1: Sum = 28 neighbors: left=5 right=13 top=8 bottom=10

PE- 8: x=2, y=0: Sum = 24 neighbors: left=4 right=12 top=-1 bottom=9

PE- 11: x=2, y=3: Sum = 36 neighbors: left=7 right=15 top=10 bottom=-1

PE- 15: x=3, y=3: Sum = 36 neighbors: left=11 right=3 top=14 bottom=-1

PE- 10: x=2, y=2: Sum = 32 neighbors: left=6 right=14 top=9 bottom=11

PE- 12: x=3, y=0: Sum = 24 neighbors: left=8 right=0 top=-1 bottom=13

PE- 13: x=3, y=1: Sum = 28 neighbors: left=9 right=1 top=12 bottom=14

answer 16 MPI tasks

!200

Page 201: Parallelization - XS4ALL Klantenservice

!201

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

Page 202: Parallelization - XS4ALL Klantenservice

Example: 3D-Finite Difference

• 3D domain decomposition to distribute the work • MPI_Cart_create to set up cartesian domain

• Hide communication of halo-areas with computational work

• MPI-IO for writing snapshots to disk • use local and global derived types to exclude halo areas write in

global snapshot file

• Tutorial about domain decomposition: http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/

!202

Page 203: Parallelization - XS4ALL Klantenservice

MPI_Comm_size(MPI_COMM_WORLD, &size);

/* divide 3D cube among available processors */ dims[0]=0; dims[1]=0; dims[2]=0; MPI_Dims_create(size, 3, dims); npz = 2*halo+nz/dims[0]; npx = 2*halo+nx/dims[1]; npy = 2*halo+ny/dims[2];

/* create cartesian domain decomposition */ MPI_Cart_create(MPI_COMM_WORLD, 3, dims, period, reorder, &COMM_CART);

/* find out neighbors of the rank, MPI_PROC_NULL is a hard boundary of the model */ MPI_Cart_shift(COMM_CART, 1, 1, &left, &right); MPI_Cart_shift(COMM_CART, 2, 1, &top, &bottom); MPI_Cart_shift(COMM_CART, 0, 1, &front, &back);

fprintf(stderr, "Rank %d has LR neighbours %d %d FB %d %d TB %d %d\n",

my_rank, left, right, front, back, top, bottom);

!203

Page 204: Parallelization - XS4ALL Klantenservice

!204

nx px

ny py

lnx

lny

subarrays for IOlocal

global

Page 205: Parallelization - XS4ALL Klantenservice

!205

/* create subarrays(excluding halo areas) to write to file with MPI-IO */ /* data in the local array */ sizes[0]=npz; sizes[1]=npx; sizes[2]=npy;

subsizes[0]=sizes[0]-2*halo; subsizes[1]=sizes[1]-2*halo; subsizes[2]=sizes[2]-2*halo;

starts[0]=halo; starts[1]=halo; starts[2]=halo; MPI_Type_create_subarray(3, sizes, subsizes, starts, MPI_ORDER_C, MPI_FLOAT, &local_array); MPI_Type_commit(&local_array);

/* data in the global array */ gsizes[0]=nz; gsizes[1]=nx; gsizes[2]=ny;

gstarts[0]=subsizes[0]*coord[0]; gstarts[1]=subsizes[1]*coord[1]; gstarts[2]=subsizes[2]*coord[2];

MPI_Type_create_subarray(3, gsizes, subsizes, gstarts, MPI_ORDER_C, MPI_FLOAT, &global_array); MPI_Type_commit(&global_array);

Page 206: Parallelization - XS4ALL Klantenservice

!206

for (it=0; it<nt; it++) {

/* copy of halo areas */ copyHalo(vx, vz, p, npx, npy, npz, halo, leftSend, rightSend, frontSend, backSend, topSend, bottomSend);

/* start a-synchronous communication of halo areas to neighbours */ exchangeHalo(leftRecv, leftSend, rightRecv, rightSend, 3*halosizex, left, right, &reqRecv[0], &reqSend[0], 0); exchangeHalo(frontRecv, frontSend, backRecv, backSend, 3*halosizey, front, back, &reqRecv[2], &reqSend[2], 4); exchangeHalo(topRecv, topSend, bottomRecv, bottomSend, 3*halosizez, top, bottom, &reqRecv[4], &reqSend[4], 8);

/* update on grid values excluding halo areas */ updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz); updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);

Page 207: Parallelization - XS4ALL Klantenservice

!207

/* wait for halo communications to be finished */ waitForHalo(&reqRecv[0], &reqSend[0]); t1 = wallclock_time(); tcomm += t1-t2;

newHalo(vx, vz, p, npx, npy, npz, halo, leftRecv, rightRecv, frontRecv, backRecv, topRecv, bottomRecv); t2 = wallclock_time(); tcopy += t2-t1;

/* compute field on the grid of the halo areas */ updateVelocities(vx, vy, vz, p, ro, ixs, ixe, iys, iye, izs, ize, npx, npy, npz); updatePressure(vx, vy, vz, p, l2m, ixs, ixe, iys, iye, izs, ize, npx, npy, npz);

Page 208: Parallelization - XS4ALL Klantenservice

!208

/* write snapshots to file */ if (time == snapshottime) {

MPI_Info_create(&fileinfo); MPI_Info_set(fileinfo, "striping_factor", STRIPE_COUNT); MPI_Info_set(fileinfo, "striping_unit", STRIPE_SIZE); MPI_File_delete(filename, MPI_INFO_NULL);

rc = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDWR|MPI_MODE_CREATE, fileinfo, &fh);

disp = 0; rc = MPI_File_set_view(fh, disp, MPI_FLOAT, global_array, "native", fileinfo);

rc = MPI_File_write_all(fh, p, 1, local_array, status);

MPI_File_close(&fh); }

} /* end of time loop */

Page 209: Parallelization - XS4ALL Klantenservice

Summary

!209

Page 210: Parallelization - XS4ALL Klantenservice

MPI Summary

•MPI Standard widely accepted by vendors and programmers

• MPI implementations available on most modern platforms • Several MPI applications deployed • Several tools exist to trace and tune MPI applications

• Simple applications use point-to-point and collective communication operations

•Advanced applications use point-to-point, collective, communicators, datatypes, one-sided, and topology operations

!210

Page 211: Parallelization - XS4ALL Klantenservice

No overlapping computation/communication

MPI only out of OpenMP code. Only master thread with

MPI communication.

Hybrid systems programming hierarchy

Hybrid System

Pure MPIHybrid MPI/OpenMP

OpenMP in SMP nodes MPI across the nodes

OpenMP shared memory

Overlapping computation/communication

MPI inside OpenMP code

Page 212: Parallelization - XS4ALL Klantenservice

Hybrid OpenMP/MPI

• Natural paradigm for clusters of SMP’s • May offer considerable advantages when application mapping and

load balancing is tough • Benefits with slower interconnection networks (overlapping

computation/communication)

‣ Requires work and code analysis to change pure MPI codes ‣ Start with auto parallelization? ‣ Link shared memory libraries…check various thread/MPI

processes combinations ‣ Study carefully the underlying architecture

• What is the future of this model? Could it be consolidated in new languages?

• Connection with many-core?

!212

Page 213: Parallelization - XS4ALL Klantenservice

!213

P0 P1

P2 P3

Type to enter

Type to enter te

C. BEKAS

T T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T TT T T T T T T T

Suppose we wish to solve the PDE

Using the Jacobi method: the value of u at each discretization point is given by a certain average among its neighbors, until convergence. Distributing the mesh to SMP clusters by Domain Decomposition, it is clear that the GREEN nodes can proceed without any comm., while the Blue nodes have to communicate first and calculate later.

Overlapping computation/communication: Example

Page 214: Parallelization - XS4ALL Klantenservice

MPI/OpenMPI: Overlapping computation/communication

!214

Not only the master but other threads communicate. Call MPI primitives in OpenMP code regions.

if (my_thread_id < # ){MPI_… (communicate needed data)

} else/* Perform computations that to not need communication */..

}/* All threads execute code that requires

communication */..

Page 215: Parallelization - XS4ALL Klantenservice

!215

for (k=0; k < MAXITER; k++){/* Start parallel region here */#pragma omp parallel private(){

my_id = omp_get_thread_num();

if (my_id is given “halo points”)MPI_SendRecv(“From neighboring MPI process”);

else{for (i=0; i < # allocated points; i++)

newval[i] = avg(oldval[i]);}

if (there are still points I need to do) /* Thifor (i=0; i< # remaining points; i++)

newval[i] = avg(oldval[i]);

}for (i=0; i<(all_my_points); i++)

oldval[i] = newval[i];}MPI_Barrier(); /* Synchronize all MPI processes here */

}

Page 216: Parallelization - XS4ALL Klantenservice

Text

Hiding IO with IO-Server

!216

Compute Node do i=1,time_steps compute(j) checkpoint(data) end do

subroutine checkpoint(data) MPI_Wait(send_req) buffer = data MPI_Isend(IO_SERVER, buffer) end subroutine

I/O Server do i=1,time_steps do j=1,compute_nodes MPI_Recv(j, buffer) write(buffer) end do end do

Use more nodes to act as IO-Servers (pseudo code)

Page 217: Parallelization - XS4ALL Klantenservice

PGAS

• What is PGAS?

• How to make use of PGAS as a programmer?

!217

Page 218: Parallelization - XS4ALL Klantenservice

PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 44

Partitioned Global Address SpacePartitioned Global Address Space

! Explicitly parallel, shared-memory like programming model

! Global addressable space

" Allows programmers to declare and “directly” access data distributed across the machine

! Partitioned address space

" Memory is logically partitioned between local and remote (a two-level hierarchy)" Forces the programmer to pay attention to data locality, by exposing the inherent

NUMA-ness of current architectures

! Single Processor Multiple Data (SPMD) execution model

" All threads of control execute the same program" Number of threads fixed at startup" Newer languages such as X10 escape this model, allowing fine-grain threading

! Different language implementations:

" UPC (C-based), CoArray Fortran (Fortran-based), Titanium and X10 (Java-based)

Partitioned Global Address Space

!218

Page 219: Parallelization - XS4ALL Klantenservice

PRACE Winter School 2009PRACE Winter School 2009Montse Farreras [email protected] Farreras [email protected] 55

! Computation is performed in multiple places.

! A place contains data that can be operated on remotely.

! Data lives in the place it was created, for its lifetime.

! A datum in one place may point to a datum in another place.

! Data-structures (e.g. arrays) may be distributed across many places.

A place expresses locality.

Address Space

Shared Memory

OpenMP

PGAS

UPC, CAF, X10Message passing

MPI

Process/Thread

Partitioned Global Address SpacePartitioned Global Address SpacePartitioned Global Address Space

!219

Page 220: Parallelization - XS4ALL Klantenservice

Shared Memory (OpenMP)

• Multiple threads share global memory • Most common variant: OpenMP

• Program loop iterations distributed to threads, more recent task features

• Each thread has a means to refer to private objects within a parallel context

• Terminology • Thread, thread team

• Implementation • Threads map to user threads running on one SMP node • Extensions to multiple servers not so successful

!220

Page 221: Parallelization - XS4ALL Klantenservice

OpenMP

!221

memory

threads

Page 222: Parallelization - XS4ALL Klantenservice

OpenMP: work distribution

!222

memory

threads

!$OMP PARALLEL do i=1,32 a(i)=a(i)*2 end do1-8 9-16 17-24 25-32

Page 223: Parallelization - XS4ALL Klantenservice

OpenMP: implementation

!223

memory

threads

cpus

process

Page 224: Parallelization - XS4ALL Klantenservice

Message Passing (MPI)

• Participating processes communicate using a message-passing API

• Remote data can only be communicated (sent or received) via the API.

• MPI (the Message Passing Interface) is the standard

• Implementation: • MPI processes map to processes within one SMP node or across

multiple networked nodes • API provides process numbering, point-to-point and collective

messaging operations

!224

Page 225: Parallelization - XS4ALL Klantenservice

MPI

!225

memory

cpu

processes

memory

cpu

memory

cpu

Page 226: Parallelization - XS4ALL Klantenservice

memory

cpu

MPI

!226

memory

cpu

process 0

MPI_Send(a,...,1,…)

process 1

MPI_Recv(a,...,0,…)

Page 227: Parallelization - XS4ALL Klantenservice

Partitioned Global Address Space

• Shortened to PGAS

• Participating processes/threads have access to local memory via standard program mechanisms

• Access to remote memory is directly supported by the PGAS language

!227

Page 228: Parallelization - XS4ALL Klantenservice

PGAS

!228

memory

cpu

process

memory

cpu

memory

cpu

process process

Page 229: Parallelization - XS4ALL Klantenservice

Partitioned Global Address Space (PGAS):

� Global address space – any process can address memory on any processor

� Partitioned GAS – retain information about locality

� Core idea– hardest part of writing parallel code is managing data distribution and communication; make that simple and explicit

� PGAS Languages try to simplify parallel programming (increase programmer productivity).

5HPCC PTRANS and PGAS Languages

PGAS languages

Page 230: Parallelization - XS4ALL Klantenservice

Data Parallel Languages

• Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines. http://upc.lbl.gov/

• Co-array Fortran (CAF) is part of Fortran 2008 standard. It is a simple, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax. http://www.co-array.org

• both need a global address space (which is not equal to SMP)

!230

Page 231: Parallelization - XS4ALL Klantenservice

� Unified Parallel C: � An extension of C99 � An evolution of AC, PCP, and Split-C

� Features � SPMD parallelism via replication of threads� Shared and private address spaces � Multiple memory consistency models

� Benefits � Global view of data � One-sided communication

HPCC PTRANS and PGAS Languages 7

UPC

Page 232: Parallelization - XS4ALL Klantenservice

� Co-Array Fortran: � An extension of Fortran 95 and part of “Fortran 2008” � The language formerly known as F--

� Features � SPMD parallelism via replication of images� Co-arrays for distributed shared data

� Benefits � Syntactically transparent communication � One-sided communication � Multi-dimensional arrays � Array operations

HPCC PTRANS and PGAS Languages 9

Co-Array Fortran

Page 233: Parallelization - XS4ALL Klantenservice

Basic execution model co-array F--

• Program executes as if replicated to multiple copies with each copy executing asynchronously (SPMD)

• Each copy (called an image) executes as a normal Fortran application

• New object indexing with [] can be used to access objects on other images.

• New features to inquire about image index, number of images and to synchronize

!233

Page 234: Parallelization - XS4ALL Klantenservice

CAF (F--)

REAL, DIMENSION(N)[*] :: X,Y X(:) = Y(:)[Q]

Array indices in parentheses follow the normal Fortran rules within one memory image.

Array indices in square brackets provide an equally convenient notation for accessing objects across images and follow similar rules.

!234

parallel

Page 235: Parallelization - XS4ALL Klantenservice

Coarray execution model

!235

memory

cpu

Image 1

memory

cpu

memory

cpu

Image 2 Image 3

Remote access with square bracket indexing: a(:)[2]

coarrays

Page 236: Parallelization - XS4ALL Klantenservice

Basic coarray declaration and usage

!236

integer :: b integer :: a(4)[*] !coarray

1 8 1 5a

image 1

b 1

1 7 9 9a

image 2

b 3

1 7 9 4a

image 3

b 6

b=a(2)

Page 237: Parallelization - XS4ALL Klantenservice

Basic coarray declaration and usage

!237

integer :: b integer :: a(4)[*] !coarray

b=a(2)

1 8 1 5a

image 1

b 8

1 7 9 9a

image 2

b 7

1 7 9 4a

image 3

b 7

Page 238: Parallelization - XS4ALL Klantenservice

Text

Basic coarray declaration and usage

!238

integer :: b integer :: a(4)[*] !coarray

b=a(4)[3]

1 8 1 5a

image 1

b 1

1 7 9 9a

image 2

b 3

1 7 9 4a

image 3

b 6

Page 239: Parallelization - XS4ALL Klantenservice

Text

Basic coarray declaration and usage

!239

integer :: b integer :: a(4)[*] !coarray

b=a(4)[3]

1 8 1 5a

image 1

b 4

1 7 9 9a

image 2

b 4

1 7 9 4a

image 3

b 4

Page 240: Parallelization - XS4ALL Klantenservice

!240

Page 241: Parallelization - XS4ALL Klantenservice

Feb-5-08 56

Performance tuning

S

p

first attempt

second attempt

ideal

Performance tuning of parallel applications is an iterative process

Performance Tuning

!241

Page 242: Parallelization - XS4ALL Klantenservice

Compiling MPI Programs

!242

Page 243: Parallelization - XS4ALL Klantenservice

Compiling and Starting MPI Jobs• Compiling:

• Need to link with appropriate MPI and communication subsystem libraries and set path to MPI Include files

• Most vendors provide scripts or wrappers for this (mpxlf, mpif77, mpicc, etc)

• Starting jobs:

• Most implementations use a special loader named mpirun

– mpirun -np <no_of_processors> <prog_name>

• In MPI-2 it is recommended to use

– mpiexec -n <no_of_processors> <prog_name>

!243

Page 244: Parallelization - XS4ALL Klantenservice

!244

MPICH: a Portable MPI Environment

• MPICH is a high-performance portable implementation of MPI (both 1 and 2).

• It runs on MPP's, clusters, and heterogeneous networks of workstations.

• The CH comes from Chameleon, the portability layer used in the original MPICH to provide portability to the existing message-passing systems.

• In a wide variety of environments, one can do: mpicc -mpitrace myprog.c mpirun -np 10 myprog upshot myprog.log

to build, compile, run, and analyze performance.

Page 245: Parallelization - XS4ALL Klantenservice

MPICH2

MPICH2 is an all-new implementation of the MPI Standard, designed to implement all of the MPI-2 additions to MPI.

‣ separation between process management and communications ‣ use daemons (mpd) on nodes ‣ dynamic process management, ‣ one-sided operations, ‣ parallel I/O, and others

• http://www.mcs.anl.gov/research/projects/mpich2/

!245

Page 246: Parallelization - XS4ALL Klantenservice

!246

Compiling MPI programs

• From a command line: mpicc -o prog prog.c

• Use profiling options (specific to mpich) • -mpilog Generate log files of MPI calls • -mpitrace Trace execution of MPI calls • -mpianim Real-time animation of MPI (not available on all

systems) • --help Find list of available options

• The environment variables MPICH_CC, MPICH_CXX, MPICH_F77, and MPICH_F90 may be used to specify alternate C, C++, Fortran 77, and Fortran 90 compilers, respectively.

Page 247: Parallelization - XS4ALL Klantenservice

!247

example hello.c

#include "mpi.h"#include <stdio.h>int main(int argc ,char *argv[]){

int myrank;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);fprintf(stdout, "Hello World, I am process %d\n", myrank);MPI_Finalize();return 0;

}

Page 248: Parallelization - XS4ALL Klantenservice

!248

Example: using MPICH1

-bash-3.1$ mpicc -o hello hello.c-bash-3.1$ mpirun -np 4 helloHello World, I am process 0Hello World, I am process 2Hello World, I am process 3Hello World, I am process 1

Page 249: Parallelization - XS4ALL Klantenservice

!249

Example: details

• If using frontend and compute nodes in machines file use mpirun -np 2 -machinefile machines hello

• If using only compute nodes in machine file use mpirun -nolocal -np 2 -machinefile machines hello

• -nolocal - don’t start job on frontend • -np 2 - start job on 2 nodes • -machinefile machines - nodes are specified in machines file • hello - start program hello

Page 250: Parallelization - XS4ALL Klantenservice

!250

Notes on clusters

•Make sure you have access to the compute node (ssh keys are generated ssh-keygen) and ask your system administrator.

•Which mpicc are you using •$ which mpicc

•command line arguments are not always passed to mpirun/mpiexec (depending on your version). In that case make a script which calls your program with all its arguments

Page 251: Parallelization - XS4ALL Klantenservice

MPICH2 daemons• mpdtrace: output a list of nodes on which you can run MPI programs (runs mpd daemons). ‣ The -l option lists full hostnames and the port where the mpd is

listening.

• mpd starts an mpd daemon.

• mpdboot starts a set of mpd’s on a list of machines.

• mpdlistjobs lists the jobs that the mpd’s are running.

• mpdkilljob kills a job specified by the name returned by mpdlistjobs

mpdsigjob delivers a signal to the named job.

!251

Page 252: Parallelization - XS4ALL Klantenservice

MPI - Message Passing Interface

• MPI or MPI-1 is a library specification for message-passing.

• MPI-2: Adds in Parallel I/O, Dynamic Process management, Remote Memory Operation, C++ & F90 extension …

• MPI Standard: http://www-unix.mcs.anl.gov/mpi/standard.html

• MPI Standard 1.1 Index: http://www.mpi-forum.org/docs/mpi-11-html/node182.html

• MPI-2 Standard Index: http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/node306.htm

• MPI Forum Home Page: http://www.mpi-forum.org/index.html

Page 253: Parallelization - XS4ALL Klantenservice

MPI tutorials

• http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/mpi_II_tutorial.html

• https://fs.hlrs.de/projects/par/par_prog_ws/

!253

Page 254: Parallelization - XS4ALL Klantenservice

!254

MPI Sources

• The Standard (3.0) itself: • at http://www.mpi-forum.org • All MPI official releases, in both PDF and HTML

• Books: – Using MPI: Portable Parallel Programming with the Message-

Passing Interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994. – Parallel Programming with MPI, by Peter Pacheco, Morgan-

Kaufmann, 1997.

• Other information on Web: • at http://www.mcs.anl.gov/mpi • pointers to lots of stuff, including other talks and tutorials, a FAQ,

other MPI pages • http://mpi.deino.net/mpi_functions/index.htm

Page 255: Parallelization - XS4ALL Klantenservice

Job Management and queuing

!255

Page 256: Parallelization - XS4ALL Klantenservice

Job Management and queuing

• On a large system many users are running simultaneously.

• What to do when:

• The system is full and you want to run your 512 CPU job? • You want to run 16 jobs, should others wait on that? • Have bigger jobs priority over smaller jobs? • Have longer jobs lower/higher priority?

• The job manager and queue system takes care of it.

!256

Page 257: Parallelization - XS4ALL Klantenservice

MPICH1 in PBS#!/bin/bash#PBS -N Flank#PBS -V#PBS -l nodes=11:ppn=2#PBS -j eo

cd $PBS_O_WORKDIRexport nb=`wc -w < $PBS_NODEFILE`echo $nb

mpirun -np $nb -machinefile $PBS_NODEFILE ~/bin/migr_mpi \ file_vel=$PBS_O_WORKDIR/grad_salt_rot.su \ file_shot=$PBS_O_WORKDIR/cfp2_sx.su

!257

Page 258: Parallelization - XS4ALL Klantenservice

Job Management and queuing

• PBS (maui): • http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php • Torque resource manager http://www.clusterresources.com/pages/products/

torque-resource-manager.php

• Sun Grid Engine • http://gridengine.sunsource.net/

• SLURM • http://slurm.schedmd.com

• Luckily their interface is very similar (qsub, qstat, ...)

!258

Page 259: Parallelization - XS4ALL Klantenservice

queuing commands

• qsub

• qstat

• qdel

• xpbsmon

Page 260: Parallelization - XS4ALL Klantenservice

qsub

#PBS -l nodes=10:ppn=1#PBS -l mem=20mb#PBS -l walltime=1:00:00#PBS -j eo

submit:

qsub -q normal job.scr

output:

jobname.ejobidjobname.ojobid

Page 261: Parallelization - XS4ALL Klantenservice

qstat

• available queue and resources

qstat -q

• queued and running jobs

qstat (-a)

!261

Page 262: Parallelization - XS4ALL Klantenservice

qdel

• deletes job from queue and stops all running executables

qdel jobid

!262

Page 263: Parallelization - XS4ALL Klantenservice

Submitting many jobs in one script#!/bin/bash -f#export xsrc1=93export xsrc2=1599export dxsrc=3xsrc=$xsrc1

while (( xsrc <= xsrc2 ))doecho ' modeling shot at x=' $xsrc

cat << EOF > jobs/pbs_${xsrc}.job#!/bin/bash##PBS -N fahud_${xsrc}#PBS -q verylong#PBS -l nodes=1:ppn=1#PBS -V

program with arguments

EOF

qsub jobs/pbs_${xsrc}.job(( xsrc = $xsrc + $dxsrc))done

!263

Be careful !

Page 264: Parallelization - XS4ALL Klantenservice

!264

Page 265: Parallelization - XS4ALL Klantenservice

Exercise: PI with MPI and OpenMP

• Compile and run for computing PI parallel

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar

• Unpack tar file and check the README for instructions.

• It is assumed that you use mpich1 and has PBS installed as a job scheduler. If you have mpich2 let me know.

• Use qsub to submit the mpi job (an example script is provided) to a queue.

!265

Page 266: Parallelization - XS4ALL Klantenservice

Exercise: PI with MPI and OpenMP

!266

cores OpenMP marken (MPI) linux (MPI)

1 9.617728 14.10798 22.15252

2 4.874539 7.071287 9.661745

4 2.455036 3.532871 5.730912

6 1.627149 2.356928 3.547961

8 1.214713 1.832055 2.804715

12 0.820746 1.184123

16 0.616482 0.955704

Page 267: Parallelization - XS4ALL Klantenservice

0

2

4

6

8

10

12

14

16

1 2 4 6 12 16

spee

d up

number of CPUs

Pi Scaling

Amdahl 1.0OpenMP Barcelona 2.2 GHz

MPI marken Xeon 3.1 GHzMPI linux Xeon 2.4 GHz

Exercise: PI with MPI and OpenMP

!267

Page 268: Parallelization - XS4ALL Klantenservice

Exercise: OpenMP Max

• Compile and run for computing PI parallel

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/MPI_pi.tar

• Unpack tar file and check the README for instructions.

• The exercise focus on using the available number of cores in an efficient way.

• Also inspect the code and see how the reductions are done, is there another way of doing the reductions?

!268

Page 269: Parallelization - XS4ALL Klantenservice

Exercise: OpenMP details

• More details is using OpenMp and shared memory parallelisation

• Download code from: http://www.xs4all.nl/~janth/HPCourse/code/OpenMP.tar

• Unpack tar file and check the README for instructions.

• These are ‘old’ exercises from SGI and give insight in problems you can encounter using OpenMP.

• It requires already some knowledge about OpenMP. The OpenMP F-Summary.pdf from the website can be helpful.

!269

Page 270: Parallelization - XS4ALL Klantenservice

!270

Extend the Exercises

• Modify the pi program to use send/receive instead of bcast/reduce.

• Write a program that sends a message around a ring. That is, process 0 reads a line from the terminal and sends it to process 1, who sends it to process 2, etc. The last process sends it back to process 0, who prints it.

• Time programs with MPI_WTIME().

Page 271: Parallelization - XS4ALL Klantenservice

END

!271