3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel...

66
3D Parallel FEM (IV) (OpenMP + MPI) Hybrid Parallel Programming Model Kengo Nakajima Information Technology Center Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Hybrid Distributed Parallel Computing (3747-111)

Transcript of 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel...

Page 1: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

3D Parallel FEM (IV)(OpenMP + MPI) Hybrid Parallel

Programming Model

Kengo NakajimaInformation Technology Center

Technical & Scientific Computing II (4820-1028)Seminar on Computer Science II (4810-1205)

Hybrid Distributed Parallel Computing (3747-111)

Page 2: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

2

Hybrid Parallel Programming Model• Message Passing (e.g. MPI) + Multi Threading (e.g.

OpenMP, CUDA, OpenCL, OpenACC etc.)• In K computer and FX10, hybrid parallel

programming is recommended– MPI + Automatic Parallelization by Fujitsu’s Compiler

• Personally, I do not like to call this “hybrid” !!!

• Expectations for Hybrid– Number of MPI processes (and sub-domains) to be

reduced– O(108-109)-way MPI might not scale in Exascale Systems– Easily extended to Heterogeneous Architectures

• CPU+GPU, CPU+Manycores (e.g. Intel MIC/Xeon Phi)• MPI+X: OpenMP, OpenACC, CUDA, OpenCL

Page 3: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

Flat MPI vs. Hybrid

Hybrid:Hierarchal Structure

Flat-MPI:Each Core -> Independent

corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y

mem

ory

mem

ory

mem

ory

core

core

core

core

core

core

core

core

core

core

core

core

3

Page 4: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 4

Background• Multicore/Manycore Processors

– Low power consumption, Various types of programming models

• OpenMP– Directive based, (seems to be) easy– Many books

• Data Dependency (S1/S2 Semester)– Conflict of reading from/writing to memory– Appropriate reordering of data is needed for

“consistent” parallel computing– NO detailed information in OpenMP books: very

complicated• OpenMP/MPI Hybrid Parallel Programming Model

for Multicore/Manycore Clusters

Page 5: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 5

SMPMEMORY

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

• SMP– Symmetric Multi Processors– Multiple CPU’s (cores) share a single memory space

Page 6: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 6

What is OpenMP ?http://www.openmp.org

• An API for multi-platform shared-memory parallel programming in C/C++ and Fortran– Current version: 4.0

• Background– Merger of Cray and SGI in 1996– ASCI project (DOE) started

• C/C++ version and Fortran version have been separately developed until ver.2.5.

• Fork-Join Parallel Execution Model• Users have to specify everything by directives.

– Nothing happen, if there are no directives

Page 7: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 7

Fork-Join Parallel Execution Model

Master

thread

thread

thread

thread

thread

thread

thread

Master

thread

thread

thread

thread

thread

thread

thread

MasterMasterMaster

PARALLELfork

END PARALLELjoin

PARALLELfork

END PARALLELjoin

Page 8: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 8

Number of Threads• OMP_NUM_THREADS

– How to change ?• bash(.bashrc) export OMP_NUM_THREADS=8• csh(.cshrc) setenv OMP_NUM_THREADS 8

Page 9: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 9

Information about OpenMP• OpenMP Architecture Review Board (ARB)

– http://www.openmp.org• References

– Chandra, R. et al.「Parallel Programming in OpenMP」(Morgan Kaufmann)

– Quinn, M.J. 「Parallel Programming in C with MPI and OpenMP」(McGrawHill)

– Mattson, T.G. et al. 「Patterns for Parallel Programming」(Addison Wesley)

– 牛島「OpenMPによる並列プログラミングと数値計算法」(丸善)

– Chapman, B. et al. 「Using OpenMP」(MIT Press)• Japanese Version of OpenMP 3.0 Spec. (Fujitsu etc.)

– http://www.openmp.org/mp-documents/OpenMP30spec-ja.pdf

Page 10: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 10

Features of OpenMP• Directives

– Loops right after the directives are parallelized.– If the compiler does not support OpenMP, directives are

considered as just comments.

Page 11: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 11

OpenMP/DirectivesArray Operations

!$omp parallel dodo i= 1, NPW(i,1)= 0.d0W(i,2)= 0.d0

enddo!$omp end parallel do

!$omp parallel do private(iS,iE,i)!$omp& reduction(+:RHO)

do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iERHO= RHO + W(i,R)*W(i,Z)

enddoenddo

!$omp end parallel do

Simple Substitution Dot Products

!$omp parallel dodo i= 1, NPY(i)= ALPHA*X(i) + Y(i)

enddo!$omp end parallel do

DAXPY

Page 12: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 12

OpenMP/DireceivesMatrix/Vector Products!$omp parallel do private(ip,iS,iE,i,j)

do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iEW(i,Q)= D(i)*W(i,P)do j= 1, INL(i)W(i,Q)= W(i,Q) + W(IAL(j,i),P)

enddodo j= 1, INU(i)W(i,Q)= W(i,Q) + W(IAU(j,i),P)

enddoenddo

enddo!$omp end parallel do

Page 13: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 13

Features of OpenMP• Directives

– Loops right after the directives are parallelized.– If the compiler does not support OpenMP, directives are

considered as just comments.• Nothing happen without explicit directives

– Different from “automatic parallelization/vectorization”– Something wrong may happen by un-proper way of usage – Data configuration, ordering etc. are done under users’

responsibility• “Threads” are created according to the number of

cores on the node– Thread: “Process” in MPI– Generally, “# threads = # cores”: Xeon Phi supports 4

threads per core (Hyper Multithreading)

Page 14: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 14

Memory Contention: メモリ競合

MEMORY

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

• During a complicated process, multiple threads may simultaneously try to update the data in same address on the memory.– e.g.: Multiple cores update a single component of an array.– This situation is possible.– Answers may change compared to serial cases with a

single core (thread).

Page 15: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 15

Memory Contention (cont.)MEMORY

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

• In this lecture, no such case does not happen by reordering etc.– In OpenMP, users are responsible for such issues (e.g.

proper data configuration, reordering etc.)• Generally speaking, performance per core reduces

as number of used cores (thread number) increases.– Memory access performance: STREAM

Page 16: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 16

Features of OpenMP (cont.)

• “!omp parallel do”-”!omp end parallel do”• Global (Shared) Variables, Private Variables

– Default: Global (Shared)– Dot Products: reduction

W(:,:),R,Z,PEsmpTOTglobal (shared)

!$omp parallel do private(iS,iE,i)!$omp& reduction(+:RHO)

do ip= 1, PEsmpTOTiS= STACKmcG(ip-1) + 1iE= STACKmcG(ip )do i= iS, iERHO= RHO + W(i,R)*W(i,Z)

enddoenddo

!$omp end parallel do

Page 17: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 17

FORTRAN & C

#include <omp.h>...{

#pragma omp parallel for default(none) shared(n,x,y) private(i)

for (i=0; i<n; i++)x[i] += y[i];

}

use omp_lib...!$omp parallel do shared(n,x,y) private(i)

do i= 1, nx(i)= x(i) + y(i)

enddo!$ omp end parallel do

Page 18: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 18

In this class ...• There are many capabilities of OpenMP.• In this class, only several functions are shown for

parallelization of parallel FEM.

Page 19: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 19

First things to be done(after OpenMP 3.0)

• use omp_lib Fortran• #include <omp.h> C

Page 20: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 20

OpenMP Directives (Fortran)

• NO distinctions between upper and lower cases.• sentinel

– Fortran: !$OMP, C$OMP, *$OMP• !$OMP only for free format

– Continuation Lines (Same rule as that of Fortran compiler is applied)

• Example for !$OMP PARALLEL DO SHARED(A,B,C)

sentinel directive_name [clause[[,] clause]…]

!$OMP PARALLEL DO!$OMP+SHARED (A,B,C)

!$OMP PARALLEL DO &!$OMP SHARED (A,B,C)

Page 21: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 21

OpenMP Directives (C)

• “\” for continuation lines• Only lower case (except names of variables)

#pragma omp directive_name [clause[[,] clause]…]

#pragma omp parallel for shared (a,b,c)

Page 22: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 22

PARALLEL DO

• Parallerize DO/for Loops• Examples of “clause”

– PRIVATE(list)– SHARED(list)– DEFAULT(PRIVATE|SHARED|NONE)– REDUCTION({operation|intrinsic}: list)

!$OMP PARALLEL DO[clause[[,] clause] … ](do_loop)

!$OMP END PARALLEL DO

#pragma parallel for [clause[[,] clause] … ](for_loop)

Page 23: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 23

REDUCTION

• Similar to “MPI_Reduce”• Operator

– +,*,-, .AND., .OR., .EQV., .NEQV.• Intrinsic

– MAX, MIN, IAND, IOR, IEQR

REDUCTION ({operator|instinsic}: list)

reduction ({operator|instinsic}: list)

Page 24: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 24

Example-1: A Simple Loop!$OMP PARALLEL DO

do i= 1, NB(i)= (A(i) + B(i)) * 0.50

enddo !$OMP END PARALLEL DO

• Default status of loop variables (“i” in this case) is private. Therefore, explicit declaration is not needed.

• “END PARALLEL DO” is not required– In C, there are no definitions of “end parallel do”

Page 25: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 25

Example-1: REDUCTION!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+:A,B)

do i= 1, Ncall WORK (Alocal, Blocal)A= A + AlocalB= B + Blocal

enddo!$OMP END PARALLEL DO

• “END PARALLEL DO” is not required

Page 26: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

26

Functions which can be used with OpenMP

Name Functions

int omp_get_num_threads (void) Total Thread #int omp_get_thread_num (void) Thread IDdouble omp_get_wtime (void) = MPI_Wtime

void omp_set_num_threads (intnum_threads)call omp_set_num_threads (num_threads)

Setting Thread #

Page 27: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 27

OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)

enddo

Page 28: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 28

OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)

enddo

VAL= 0.d0!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)

do i= 1, NVAL= VAL + W(i,R) * W(i,Z)

enddo!$OMP END PARALLEL DO

Directives are just inserted.

Page 29: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 29

OpenMP for Dot ProductsVAL= 0.d0do i= 1, NVAL= VAL + W(i,R) * W(i,Z)

enddo

VAL= 0.d0!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)

do i= 1, NVAL= VAL + W(i,R) * W(i,Z)

enddo!$OMP END PARALLEL DO

VAL= 0.d0!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)

do ip= 1, PEsmpTOTdo i= index(ip-1)+1, index(ip)

VAL= VAL + W(i,R) * W(i,Z)enddo

enddo!$OMP END PARALLEL DO

Multiple LoopPEsmpTOT: Number of threads

Additional array INDEX(:) is needed.Efficiency is not necessarily good, but users can specify thread for each component of data.

Directives are just inserted.

Page 30: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 30

OpenMP for Dot ProductsVAL= 0.d0

!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)do ip= 1, PEsmpTOT

do i= index(ip-1)+1, index(ip)VAL= VAL + W(i,R) * W(i,Z)

enddoenddo

!$OMP END PARALLEL DO

e.g.: N=100, PEsmpTOT=4

INDEX(0)= 0INDEX(1)= 25INDEX(2)= 50INDEX(3)= 75INDEX(4)= 100

Multiple LoopPEsmpTOT: Number of threads

Additional array INDEX(:) is needed.Efficiency is not necessarily good, but users can specify thread for each component of data.

NOT good for GPU’s

Page 31: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

31

Matrix-Vector Multiply

do i = 1, NVAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)

enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)

enddoW(i,Q)= VAL

enddo

Page 32: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

32

Matrix-Vector Multiply!$omp parallel do private(ip,i,VAL,k)

do ip= 1, PEsmpTOTdo i = INDEX(ip-1)+1, INDEX(ip)VAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)

enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)

enddoW(i,Q)= VAL

enddoenddo

!$omp end parallel do

Page 33: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

33

Matrix-Vector Multiply: Other ApproachThis is rather better for GPU and (very) many-core

architectures: simpler structure of loops

!$omp parallel do private(i,VAL,k)do i = 1, NVAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)

enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)

enddoW(i,Q)= VAL

enddo!$omp end parallel do

Page 34: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

34

omp parallel (do)• Each “omp parallel-omp end parallel” pair starts &

stops threads: fork-join• If you have many loops, these operations on

threads could be overhead• omp parallel + omp do/omp for

#pragma omp parallel ...

#pragma omp for {

...#pragma omp for {

!$omp parallel ...

!$omp dodo i= 1, N

...!$omp do

do i= 1, N...!$omp end parallel 必須

Page 35: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

35

Exercise !!

• Apply multi-threading by OpenMP on parallel FEM code using MPI– CG Solver (solver_CG, solver_SR)– Matrix Assembling (mat_ass_main, mat_ass_bc)

• Hybrid parallel programming model

• Evaluate the effects of– Problem size, parallel programming model, thread #

Page 36: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

36

OpenMP(Only Solver)(F・C)>$ cd <$O-TOP>/pfem3d/src1>$ make>$ cd ../run>$ ls sol1

sol1

>$ cd ../pmesh

<Parallel Mesh Generation>

>$ cd ../run

<modify go1.sh>

>$ pjsub go1.sh

Page 37: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

37

Makefile(Fortran)F90 = mpiifortF90LINKER = $(F90)LIB_DIR =INC_DIR =OPTFLAGS = -O3 -xCORE-AVX2 -align array32byte -qopenmpFFLAGS = $(OPTFLAGS)FLIBS =F90LFLAGS=#TARGET = ../run/sol1default: $(TARGET)OBJS =¥pfem_util.o …

$(TARGET): $(OBJS)$(F90LINKER) $(OPTFLAGS) -o $(TARGET) $(OBJS) $(F90LFLAGS)

clean:/bin/rm -f *.o $(TARGET) *~ *.mod

.f.o:$(F90) $(FFLAGS) $(INC_DIR) -c $*.f

.f90.o:$(F90) $(FFLAGS) $(INC_DIR) -c $*.f90

.SUFFIXES: .f90 .f

Page 38: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

38

Makefile(C)CC = mpiiccLIB_DIR=INC_DIR=OPTFLAGS= -O3 -xCORE-AVX2 -align -qopenmpLIBS =LFLAGS=#TARGET = ../run/sol1default: $(TARGET)OBJS =¥

test1.o¥...

$(TARGET): $(OBJS)$(CC) $(OPTFLAGS) -o $@ $(OBJS) $(LFLAGS)

.c.o:$(CC) $(OPTFLAGS) -c $*.c

clean:/bin/rm -f *.o $(TARGET) *~ *.mod

Page 39: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

39

HB M x N

Number of OpenMP threads per a single MPI process

Number of MPI processper a single “socket”

Socket #0 Socket #1

Page 40: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 40

4-nodes/8-sockets: 128 MPI process’sFlat MPI, 32 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3

mesh.inp256 128 6416 8 1pcube

inp_kmetiscube.02128pcube

select=4:mpiprocs=32

I_MPI_PERHOST=32

inp_mg256 128 64

16/18 cores/socket

Page 41: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

41

Flat MPI: 16 MPI Processes/Socket#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=32 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=32 MPI Proc.#/Node

mpirun ./impimap.sh ./sol

Page 42: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 42

4-nodes: 16-threads x 8 MPI process’sHB 16x1, 2 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3 16/18

cores/socket

mesh.inp256 128 644 2 1

pcube

inp_kmetiscube.028pcube

select=4:mpiprocs=2

I_MPI_PERHOST=2OMP_NUM_THREADS=16

inp_mg256 128 64

Page 43: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

43

HB 16x1#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=2 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=16 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2 MPI Proc.#/Node

mpirun ./impimap.sh ./sol1

Page 44: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 44

4-nodes: 8-threads x 16 MPI process’sHB 8x2, 4 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3

mesh.inp256 128 644 4 1

pcube

inp_kmetiscube.0216pcube

select=4:mpiprocs=4

I_MPI_PERHOST=4OMP_NUM_THREADS=8

inp_mg256 128 64

16/18 cores/socket

Page 45: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

45

HB 8x2#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=4 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=8 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=4 MPI Proc.#/Node

mpirun ./impimap.sh ./sol1

Page 46: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 46

4-nodes: 4-threads x 32 MPI process’sHB 4x4, 8 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3

mesh.inp256 128 648 4 1

pcube

inp_kmetiscube.0232pcube

select=4:mpiprocs=8

I_MPI_PERHOST=8OMP_NUM_THREADS=4

inp_mg256 128 64

16/18 cores/socket

Page 47: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

47

HB 4x4#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=8 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=4 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=8 MPI Proc.#/Node

mpirun ./impimap.sh ./sol1

Page 48: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 48

4-nodes: 2-threads x 64 MPI process’sHB 2x8, 16 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3

mesh.inp256 128 648 8 1

pcube

inp_kmetiscube.0264pcube

select=4:mpiprocs=16

I_MPI_PERHOST=16OMP_NUM_THREADS=2

inp_mg256 128 64

16/18 cores/socket

Page 49: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

49

HB 2x8#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=16 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=2 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=16 MPI Proc.#/Node

mpirun ./impimap.sh ./sol1

Page 50: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 50

4-nodes: 18-threads x 8 MPI process’sHB 18x1, 2 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3 18/18

cores/socket

mesh.inp256 128 644 2 1

pcube

inp_kmetiscube.028pcube

select=4:mpiprocs=2

I_MPI_PERHOST=2OMP_NUM_THREADS=18

inp_mg256 128 64

Page 51: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

51

HB 18x1#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=2 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=18 Thread#/MPI Processexport I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=2 MPI Proc.#/Node

mpirun ./impimap.sh ./sol1

Page 52: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

OMP-1 52

4-nodes/8-sockets: 144 MPI process’sFlat MPI, 36 MPI process’s/Node

Socket #0 Socket #1

Node#0

Node#1

Node#2

Node#3

inp_kmetiscube.02144pcube

select=4:mpiprocs=36

I_MPI_PERHOST=36

inp_mg256 128 64

18/18 cores/socket

Page 53: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

53

Flat MPI: 18 MPI Processes/Socket#!/bin/sh#PBS -q u-lecture#PBS -N hybrid#PBS -l select=4:mpiprocs=36 Node#, MPI Proc#/Node#PBS -Wgroup_list=gt16#PBS -l walltime=00:05:00#PBS -e err#PBS -o test.lst

cd $PBS_O_WORKDIR. /etc/profile.d/modules.sh

export I_MPI_PIN_DOMAIN=socketexport I_MPI_PERHOST=36 MPI Proc.#/Node

mpirun ./impimap.sh ./sol

Page 54: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

54

How to apply multi-threading• CG Solver

– Just insert OpenMP directives– ILU/IC preconditioning is much more difficult

• MAT_ASS (mat_ass_main, mat_ass_bc)– Data Dependency– Avoid to accumulate contributions of multiple elements to

a single node simultaneously (in parallel)• results may be changed• deadlock may occur

– Coloring• Elements in a same color do not share a node• Parallel operations are possible for elements in each color• In this case, we need only 8 colors for 3D problems (4 colors for

2D problems)• Coloring part is very expensive: parallelization is difficult

Page 55: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

55

FORTRAN(solver_CG)!$omp parallel do private(i)

do i= 1, NX(i) = X (i) + ALPHA * WW(i,P)

WW(i,R)= WW(i,R) - ALPHA * WW(i,Q)enddo

DNRM20= 0.d0!$omp parallel do private(i) reduction (+:DNRM20)

do i= 1, NDNRM20= DNRM20 + WW(i,R)**2

enddo

!$omp parallel do private(j,k,i,WVAL)do j= 1, N

WVAL= D(j)*WW(j,P)do k= index(j-1)+1, index(j)

i= item(k)WVAL= WVAL + AMAT(k)*WW(i,P)

enddoWW(j,Q)= WVAL

enddo

Page 56: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

56

C(solver_CG)#pragma omp parallel for private (i)

for(i=0;i<N;i++){X [i] += ALPHA *WW[P][i];WW[R][i]+= -ALPHA *WW[Q][i];

}

DNRM20= 0.e0;#pragma omp parallel for private (i) reduction (+:DNRM20)

for(i=0;i<N;i++){DNRM20+=WW[R][i]*WW[R][i];

}

#pragma omp parallel for private (j,i,k,WVAL)for( j=0;j<N;j++){

WVAL= D[j] * WW[P][j];for(k=indexLU[j];k<indexLU[j+1];k++){

i=itemLU[k];WVAL+= AMAT[k] * WW[P][i];

}WW[Q][j]=WVAL;

Page 57: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

57

solver_SR (send)

for( neib=1;neib<=NEIBPETOT;neib++){istart=EXPORT_INDEX[neib-1];inum =EXPORT_INDEX[neib]-istart;

#pragma omp parallel for private (k,ii)for( k=istart;k<istart+inum;k++){

ii= EXPORT_ITEM[k];WS[k]= X[ii-1];

}MPI_Isend(&WS[istart],inum,MPI_DOUBLE,

NEIBPE[neib-1],0,MPI_COMM_WORLD,&req1[neib-1]);}

do neib= 1, NEIBPETOTistart= EXPORT_INDEX(neib-1)inum = EXPORT_INDEX(neib ) - istart

!$omp parallel do private(k,ii)do k= istart+1, istart+inum

ii = EXPORT_ITEM(k)WS(k)= X(ii)

enddo

call MPI_Isend (WS(istart+1), inum, MPI_DOUBLE_PRECISION, && NEIBPE(neib), 0, MPI_COMM_WORLD, req1(neib), && ierr)enddo

Page 58: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

pFEM3D-2

58

Example: Strong Scaling: Fortran• 256×128×128 nodes

– 4,194,304 nodes, 4,112,895 elements• 32~864 cores, HB 16x1, HB 18x1, Flat MPI• Linear Solver

256 128 1282 1 1

pcube

256 128 1282 1 2

pcube

256 128 1284 2 2

pcube

select=1:mpiprocs=2

select=2:mpiprocs=4

select=8:mpiprocs=16

Performance of Flat-pmesh/16 w/32 cores= 32.0

0

200

400

600

800

1000

0 200 400 600 800 1000

Spee

d-U

p

CORE#

HB-pmesh/16HB-pmesh/18HB-kmetis/16Flat-pmesh/16Flat-pmetis/16Ideal

Page 59: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

pFEM3D-2

59

Example: Strong Scaling: Fortran• 256×128×128 nodes

– 4,194,304 nodes, 4,112,895 elements• 32~864 cores, HB 16x1, HB 18x1, Flat MPI• Linear Solver

0

200

400

600

800

1000

0 200 400 600 800 1000

Spee

d-U

p

CORE#

HB-pmesh/16HB-pmesh/18HB-kmetis/16Flat-pmesh/16Flat-pmetis/16Ideal

400

500

600

700

800

900

1000

512 768

Spee

d-U

p

CORE#

HB-pmesh/16HB-kmetis/16Flat-pmesh/16Flat-pmetis/16

Performance of Flat-pmesh/16 w/32 cores= 32.0

Page 60: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

60

Computation Time using 16 nodes• kmetis• Linear Solver

0.00

0.10

0.20

0.30

0.40

FlatMPI/16

HB 2x8 HB 4x4 HB 8x2 HB 16x1 HB 18x1

sec.

Thread/#MPI Process

Page 61: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

61

Flat MPI vs. Hybrid• Depends on applications, problem size, HW etc.• Flat MPI is generally better for sparse linear solvers,

if number of computing nodes is not so large.– Memory contention

• Hybrid becomes better, if number of computing nodes is larger.– Fewer number of MPI processes.

• 1 MPI Process/Node is possible: NUMA (A1/A2)

Socket #0 Socket #1

Page 62: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

62

How to apply multi-threading• CG Solver

– Just insert OpenMP directives– ILU/IC preconditioning is much more difficult

• MAT_ASS (mat_ass_main, mat_ass_bc)– Data Dependency– Avoid to accumulate contributions of multiple elements to

a single node simultaneously (in parallel)• results may be changed• deadlock may occur

– Coloring• Elements in a same color do not share a node• Parallel operations are possible for elements in each color• In this case, we need only 8 colors for 3D problems (4 colors for

2D problems)• Coloring part is very expensive: parallelization is difficult

Page 63: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

63

Multi-Threading: Mat_AssParallel operations are possible for elements in same

color (they are independent)

Page 64: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

64

Coloring (1/2)allocate (ELMCOLORindex(0:NP)) Number of elements in each colorallocate (ELMCOLORitem (ICELTOT)) Element ID renumbered according to “color”if (allocated (IWKX)) deallocate (IWKX)allocate (IWKX(0:NP,3))

IWKX= 0icou= 0do icol= 1, NP

do i= 1, NPIWKX(i,1)= 0

enddodo icel= 1, ICELTOT

if (IWKX(icel,2).eq.0) thenin1= ICELNOD(icel,1)in2= ICELNOD(icel,2)in3= ICELNOD(icel,3)in4= ICELNOD(icel,4)in5= ICELNOD(icel,5)in6= ICELNOD(icel,6)in7= ICELNOD(icel,7)in8= ICELNOD(icel,8)

ip1= IWKX(in1,1)ip2= IWKX(in2,1)ip3= IWKX(in3,1)ip4= IWKX(in4,1)ip5= IWKX(in5,1)ip6= IWKX(in6,1)ip7= IWKX(in7,1)ip8= IWKX(in8,1)

Page 65: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

65

Coloring (2/2)isum= ip1 + ip2 + ip3 + ip4 + ip5 + ip6 + ip7 + ip8if (isum.eq.0) then None of the nodes is accessed in same color

icou= icou + 1IWKX(icol,3)= icou (Current) number of elements in each colorIWKX(icel,2)= icolELMCOLORitem(icou)= icel ID of icou-th element= icel

IWKX(in1,1)= 1 These nodes on the same elements can not beIWKX(in2,1)= 1 accessed in same colorIWKX(in3,1)= 1IWKX(in4,1)= 1IWKX(in5,1)= 1IWKX(in6,1)= 1IWKX(in7,1)= 1IWKX(in8,1)= 1if (icou.eq.ICELTOT) goto 100 until all elements are colored

endifendif

enddoenddo

100 continueELMCOLORtot= icol Number of ColorsIWKX(0 ,3)= 0IWKX(ELMCOLORtot,3)= ICELTOT

do icol= 0, ELMCOLORtotELMCOLORindex(icol)= IWKX(icol,3)

enddo

Page 66: 3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

66

Multi-Threaded Matrix Assembling Procedure

do icol= 1, ELMCOLORtot!$omp parallel do private (icel0,icel) &!$omp& private (in1,in2,in3,in4,in5,in6,in7,in8) &!$omp& private (nodLOCAL,ie,je,ip,jp,kk,iiS,iiE,k) &!$omp& private (DETJ,PNX,PNY,PNZ,QVC,QV0,COEFij,coef,SHi) &!$omp& private (PNXi,PNYi,PNZi,PNXj,PNYj,PNZj,ipn,jpn,kpn) &!$omp& private (X1,X2,X3,X4,X5,X6,X7,X8) &!$omp& private (Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8) &!$omp& private (Z1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,COND0)

do icel0= ELMCOLORindex(icol-1)+1, ELMCOLORindex(icol)icel= ELMCOLORitem(icel0)in1= ICELNOD(icel,1)in2= ICELNOD(icel,2)in3= ICELNOD(icel,3)in4= ICELNOD(icel,4)in5= ICELNOD(icel,5)in6= ICELNOD(icel,6)in7= ICELNOD(icel,7)in8= ICELNOD(icel,8)

...