Intel® MPI Library e OpenMP* - Intel Software Conference 2013

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.

and/or other countries. *Other names and brands may be claimed as the property of others.

MPI and OpenMP

Reducing effort for parallel software development

August, 2013

1

Werner Krotz-Vogel

http://software.intel.com/en-us/articles/optimization-notice


and/or other countries. *Other names and brands may be claimed as the property of others.© 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen

2

Objectives

• Design parallel applications from serial codes

• Determine appropriate decomposition strategies for applications

• Choose applicable parallel model for implementation

• MPI and OpenMP




Why MPI and OpenMP ?

• Performance ~ Die Area- 4x the Silicon Die area gives 2x the performance in one core, but 4x the performance when dedicated to 4 cores

- Power ~ Voltage2 (voltage is roughly prop. to clock freq.)

Conclusion (with respect to above Pollack’s rule)- Multiple cores is a powerful handle to adjust “Performance/Watt”

Parallel Hardware

Parallel Software




4

Parallel Programming: Algorithms

Distributed Versus Shared Memory

CPU

Memory

Bus

Memory

C

P

U

C

P

U

C

P

U

C

P

U

CPU

Memory

CPU

Memory

CPU

Memory

Network

Message Passing Threads

Multiple processes

Share data with messages

MPI*

Single process

Concurrent execution

Shared memory and resources

Explicit threads, OpenMP*



and/or other countries. *Other names and brands may be claimed as the property of others.5


Designing Parallel Programs

•Partition

– Divide problem into tasks

•Communicate

– Determine amount and pattern of communication

•Agglomerate– Combine tasks

•Map

– Assign agglomeratedtasks to physical processors

TheProblem

Initial tasks

Communication

Combined Tasks

Final Program




6


1. Partitioning

•Discover as much parallelism as possible

• Independent computations and/or data

• Maximize number of primitive tasks

•Functional decomposition

• Divide the computation, then associate the data

•Domain decomposition

• Divide the data into pieces, then associate computation

Initial tasks




7


Decomposition Methods

•Functional decomposition

– Focusing on computations can reveal structure in a problem

Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics

Laboratory, Engineer Research and Development Center (ERDC).

Domain decomposition

• Focus on largest or most frequently accessed data structure

• Data parallelism• Same operation(s) applied to all data

Atmosphere Model

Ocean Model

Land Surface Model

Hydrology

Model




8


2. Communication

•Determine the communication pattern between primitive tasks

• What data need to be shared?

•Point-to-point• One thread to another

•Collective• Groups of threads sharing data

•Execution order dependencies are communication

Communication




9


3. Agglomeration

•Group primitive tasks in order to:

• Improve performance/granularity

– Localize communication

• Put tasks that communicate in the same group

– Maintain scalability of design

• Gracefully handle changes in data set size or number of processors

– Simplify programming and maintenance

Combined Tasks




10


4. Mapping

•Assign tasks to processors in order to:– Maximize processor utilization

– Minimize inter-processor communication

•One task or multiple tasks per processor?

•Static or dynamic assignment?

•Most applicable to message passing– Programmer can map tasks to threads

Final Program




11


What Is Not Parallel•Subprograms with “state” or with side effects

– Pseudo-random number generators– File I/O routines– Output on screen

•Loops with data dependencies

– Variables written in one iteration and read in another– Quick test: Reverse loop iterations

Loop carried – Value carried from one iteration to the next

Induction variables – Incremented each trip through loop

Reductions – Summation; collapse array to single value

Recurrence – Feed information forward




12 Introduction to MPI

What is MPI ?

CPU

PrivateMemory

CPU

PrivateMemory

CPU

PrivateMemory

Node 0 Node 1 Node n





The Distributed-Memory Model

•Characteristics of distributed memory machines

• No common address space

• High-latency interconnection network

• Explicit message exchange





Message Passing Interface (MPI)

•Depending on the interconnection network, clusters exhibit different interfaces to the network, e.g.

• Ethernet: UNIX sockets

• InfiniBand: OFED, Verbs

•MPI provides an abstraction to these interfaces

• Generic communication interface

• Logical ranks (no physical addresses)

• Supportive functions (e.g. parallel file I/O)





“Hello World” in Fortran

•program hello

•include 'mpif.h‘

•integer mpierr, rank, procs

•call MPI_Init(mpierr)

•call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)

•call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr)

•write (*,*) 'Hello world from ', rank, 'of', procs

•call MPI_Finalize(mpierr)

•end program hello





Compilation and Execution

•MPI implementations ship with a compiler wrapper:

• mpiicc –o helloc hello.c

• mpiifort –o hellof hello.f

•Wrapper correctly calls native C/Fortran compiler and passes along MPI specifics (e.g. library)

•Wrappers usually accept the same compiler options as the underlying native compiler, e.g.

• mpiicc –O2 –fast –o module.o –c module.c





Compilation and Execution

•To run the “Hello World”, use:

• mpirun –np 8 helloc

•It provides portable, transparent application start-up

• connect to the cluster nodes for execution

• launch processes on the nodes

• pass along information how to reach others

•When mpirun returns, execution was completed.

•Note: mpirun is implementation-specific





Output of “Hello World”

• Hello world from 0 of 8








No particular ordering of process execution!

If needed, programmermust ensure orderingby explicit comm’.





Sending Messages (Blocking)

• subroutine master(array, length)

• include 'mpif.h'

• double precision array(1)

• integer length

• double precision sum, globalsum

• integer rank, procs, mpierr, size

• call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)

• size = length / procs

• do rank = 1,procs-1

• call MPI_Send(size, 1, MPI_INTEGER, rank, 0,

• & MPI_COMM_WORLD, mpierr)

• call MPI_Send(array(rank*size+1:rank*size+size), size,

• & MPI_DOUBLE_PRECISION, rank, 1, MPI_COMM_WORLD, mpierr)

• enddo

Example only correct, ifflength is a multiple of procs.





MPI_Send

•int MPI_Send(void* buf, int count, MPI_Datatype

dtype, int dest, int tag, MPI_Comm

comm)

•MPI_SEND(BUF, COUNT, DTYPE, DEST, TAG, COMM,IERR)<type> BUF(*)

INTEGER COUNT, DTYPE, DEST, TAG, COMM, IERR

•Blocking message delivery

• blocks until receiver has completely received the message

• effectively synchronizes sender and receiver




Introduction to MPI

MPI_Send

buf Pointer to message data (e.g. pointer to first element of an array)

count Length of the message in elements

dtype Data type of the message content(size of data type x count = message size)

dest Rank of the destination process

tag “Type” of the message

comm Handle to the communication group

ierr Fortran: OUT argument for error code

return value C/C++: error code




Introduction to MPI

MPI Data Type C Data Type

MPI_BYTE

MPI_CHAR signed char

MPI_DOUBLE double

MPI_FLOAT float

MPI_INT int

MPI_LONG long

MPI_LONG_DOUBLE long double

MPI_PACKED

MPI_SHORT short

MPI_UNSIGNED_SHORT unsigned short

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long

MPI_UNSIGNED_CHAR unsigned char

MPI provides predefined data types that must be specified when passing messages.

MPI Data Types for C




Introduction to MPI

Communication Wildcards

•MPI defines a set of wildcards to be specified with communication primitives:

MPI_ANY_SOURCE Matches any logical rank when receiving a message with MPI_Recv (message status contains actual sender)

MPI_ANY_TAG Matches any message tag when receiving a message (message status contains actual tag)

MPI_PROC_NULL Special value indicating non-existent process rank (messages are not delivered or received for this special rank)





Blocking Communication

•MPI_Send and MPI_Recv are blocking operations MPI_Send

MPI_Recv

Computation

Communication

Process A

Process B





Non-blocking Communication

•MPI_Isend and MPI_Irecv are blocking operations MPI_Isend

MPI_Irecv

Computation

Communication

Process A

Process B

MPI_Wait

MPI_Wait





‘Collectives’, e.g. MPI_Reduce

•int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype dtype, MPI_Op op, int root, MPI_Comm comm)

•MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DTYPE, OP, ROOT, COMM, IERR)

<type> SENDBUF(*), RECVBUF(*)INTEGER COUNT, DTYPE, OP, ROOT, COMM, IERR

•Global operation that accumulates data at the processors into a global result at the root process.

• All processes have to reach the same MPI_Reduce invocation.

• Otherwise deadlocks and undefined behavior may occur.





MPI_Reduce – Operators

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND / MPI_BAND logical and / bit-wise and

MPI_LOR / MPI_BOR logical or / bit-wise or

MPI_LXOR MPI_BXOR logical excl. or / bit-wise excl. or

MPI_MAXLOC max value and location

MPI_MINLOC min value and location





MPI _Barrier

•int MPI_Barrier(MPI_Comm comm )

•MPI_BARRIER(COMM, IERROR)INTEGER COMM, IERROR

•Global operation that synchronizes all participating processes.

• All processes have to reach an MPI_Barrierinvocation.

• Otherwise deadlocks and undefined behavior may occur.





Stencil Computation example

•Some algorithms (e.g. Jacobi, Gauss-Seidel) process data in with a stencil:

• grid(i,j) = 0.25 * (grid(i+1,j) + grid(i-1,j) +

grid(i,j+1) + grid(i,j-1))

•Data access pattern:i-1,j

i+1,j

i,j+1i,j-1 i,j





MPI features not covered

• One-sided communication

– MPI_Put, MPI_Get

– Uses Remote Memory Access (RMA)

– Separates communication from synchronization

• User-defined datatypes, strided messages

• Dynamic process spawning: MPI_Spawn

Collective communication can be used across disjoint intra-communicators

• Parallel I/O

• MPI 3.0 (released Sept 21, 2012)




31

What Is OpenMP?

• Portable, shared-memory threading API–Fortran, C, and C++–Multi-vendor support for both Linux and

Windows

• Standardizes task & loop-level parallelism

• Supports coarse-grained parallelism

• Combines serial and parallel code in single source

• Standardizes ~ 20 years of compiler-directed threading experience

http://www.openmp.orgCurrent spec is OpenMP 4.0

July 31, 2013

(combined C/C++ and Fortran)

Introduction to OpenMP




OpenMP Programming Model

Fork-Join Parallelism: • Master thread spawns a team of threads as needed

• Parallelism is added incrementally: that is, the sequential program evolves into a parallel program

Parallel Regions

Master

Thread





33

A Few Syntax Details to Get Started

• Most of the constructs in OpenMP are compiler directives or pragmas

– For C and C++, the pragmas take the form:#pragma omp construct [clause [clause]…]

– For Fortran, the directives take one of the forms:

C$OMP construct [clause [clause]…]

!$OMP construct [clause [clause]…]

*$OMP construct [clause [clause]…]

• Header file or Fortran 90 module#include “omp.h”

use omp_lib





34

Worksharing

• Worksharing is the general term used in OpenMP to describe distribution of work across threads.

• Three examples of worksharing in OpenMP are:

• omp for construct

• omp sections construct

• omp task construct

Automatically divides work among threads





35

‘omp for’ Construct

• Threads are assigned an independent set of iterations

• Threads must wait at the end of work-sharing construct

#pragma omp parallel

#pragma omp for

Implicit barrier

i = 1

i = 2

i = 3

i = 4

i = 5

i = 6

i = 7

i = 8

i = 9

i = 10

i = 11

i = 12

// assume N=12#pragma omp parallel#pragma omp for

for(i = 1, i < N+1, i++) c[i] = a[i] + b[i];





36

New Addition to OpenMP

TasksMain change for OpenMP 3.0

• Allows parallelization of irregular problems

• unbounded loops

• recursive algorithms

• producer/consume

Device ConstructsMain change for OpenMP 4.0

• Allows to describe regions of code where data and/or computation should be moved to another computing device.





37

What are tasks?

• Tasks are independent units of work

• Threads are assigned to perform the work of each task

– Tasks may be deferred

• Tasks may be executed immediately

• The runtime system decides which of the above

– Tasks are composed of:

• code to execute

• data environment

• internal control variables (ICV)

Serial Parallel





38

Simple Task Example

A pool of 8 threads is created

here


// assume 8 threads

{

#pragma omp single private(p)

{

…

while (p) {

#pragma omp task

{

processwork(p);

}

p = p->next;

}

}

}

One thread gets to execute

the while loop

The single “while loop” thread

creates a task for each

instance of processwork()





39

Task Construct – Explicit Task View

– A team of threads is created at the omp parallel construct

– A single thread is chosen to execute the while loop – lets call this thread “L”

– Thread L operates the while loop, creates tasks, and fetches next pointers

– Each time L crosses the omp task construct it generates a new task and has a thread assigned to it

– Each task runs in its own thread

– All tasks complete at the barrier at the end of the parallel region’s single construct


{

#pragma omp single

{ // block 1

node * p = head;

while (p) { //block 2

#pragma omp task

process(p);

p = p->next; //block 3

}

}

}





40

OpenMP* Reduction Clause

• reduction (op : list)

• The variables in “list” must be shared in the enclosing parallel region

• Inside parallel or work-sharing construct:

• A PRIVATE copy of each list variable is created and initialized depending on the “op”

• These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable





41

Reduction Example

• Local copy of sum for each thread

• All local copies of sum added together and stored in “global” variable

#pragma omp parallel for reduction(+:sum)for(i=0; i<N; i++) {

sum += a[i] * b[i];}





10

20

40

80

160

320

640

1280

2560

5120

1 2 4 8 16 32 64 128

Ru

nti

me

in s

eco

nd

s

Number of nodes

1 PPN1 PPN / 2 TPP1 PPN / 4 TPP1 PPN / 8 TPP2 PPN2 PPN / 2 TPP2 PPN / 4 TPP4 PPN4 PPN / 2 TPP8 PPN

Why Hybrid Programming? OpenMP/MPI

PPN = processes per nodeTPP = threads per process

53% improvementover MPI

Simulation of Free-Surface Flows, Finite Element CFD solver written in Fortran and C Figure kindly provided by HPC group of the Center of Computing and Communication, RWTH Aachen, Germany




The Good, the Bad, and the Ugly

The Good

• OpenMP and MPI blend well with each other if certain rules are respected by programmers.

The Bad

• Programmers need to be aware of the issues of hybrid programming, e.g. using thread-safe libraries and MPI.

The Ugly

• What’s the best setting for PPN and TPP for a given machine?

MPI and OpenMP hybrid programs can greatly improve performance of parallel codes !

43




INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance ofthat product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

45

8/21/2013

Intel Confidential - Use under NDA only

45


Intel® MPI Library e OpenMP* - Intel Software Conference 2013

Technology

Transcript of Intel® MPI Library e OpenMP* - Intel Software Conference 2013