Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower...

Post on 11-Sep-2020

11 views 0 download

Transcript of Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower...

Porting Telemac–Mascaret to OpenPower and

experimenting GPU offloading to accelerate

the Tomawac module

TUC 2019 16-17th October, CERFACS, Toulouse, France

Judicael Grasset(1), Stephen Longshaw(1), Charles Moulinec(1), David R. Emerson(1)

Yoann Audouin(2), Pablo Tassi(2)

October 17, 2019

(1) STFC, Daresbury Laboratory, Warrington, United Kingdom

(2) EDF R&D, Chatou, France

Computing used

OpenPower architecture in a

nutshell:

• IBM POWER processors

• NVIDIA GPUs

• NVIDIA NVLink The machine used for this work, Paragon

In our case, each node of the machine used consists of:

• 2 IBM POWER8 processors, with 8 cores each

• Each core has simultaneous multithreading (SMT) capability

• In this case the cores are able to run either 1 thread (SMT1), 2

threads (SMT2), 4 threads (SMT4) or 8 threads (SMT8) at the

same time

• 4 NVIDIA P100 GPUs

• NVIDIA NVLink for GPU–GPU and GPU–CPU interconnections1

Porting to OpenPower

• Why? Summit and Sierra, the 2 most powerful cluster in the world

are based on an OpenPower architecture (Top500, June 2019)

• Porting to different architecure might reveal some bugs in the code

(increased robustness)

2

Porting to OpenPower

Status of the port:

Version > PGI 18.10 > GCC 9.1 > XL 16.1.1.1

v8p0r2 compile compile does not compile*

trunk (Oct. 2019) does not compile* compile does not compile*

*problem known and solved, it compile when applying a small patch

All tests done with the Spectrum MPI library

3

.

Experimenting with GPUs

Or trying to port Telemac to the architecture of the ���future present

4

The test case

Test case used: tomawac/fetch limited/tom test6.cas

• This is a limited test with a small mesh: 75k elements, 32k points.

• It spends all of its time in a single fortran subroutine: qnlin3.f

• This function was reported to be a bottleneck by some users during

the annual TELEMAC User Conference (2018).

5

qnlin3.f

In a nutshell:

• do loop

• init some variables

• do loop

• init some variables

• do loop

• init some variables

• do loop

• tmp array(x,y,z) = tmp array(x,y,z) + k

6

Porting to GPUs, methods

Different solutions exist:

• Pragma based: OpenMP, OpenACC

• Library based: Magma, cuBLAS...

• Language extension: CUDA, OpenCL

7

MPI+OpenACC (PGI compiler) on GPU

Move data to GPU and execute the loop on it.

• !$acc data copy(array)

• !$acc parallel loop collapse(4)

• do loop

• do loop

• do loop

• do loop

• !$acc atomic

• array(x,y,z) = array(x,y,z) + k

• ...

• !$acc end data

Elsewhere during the initialisation of the code, we have linked each MPI

task to a specific GPU.

8

MPI+OpenACC (PGI compiler) on GPU

9

MPI+OpenMP (IBM compiler) on GPU

Move data to GPU and execute the loop on it.

• !$omp target data map(array)

• !$omp target teams distribute parallel do collapse(4)

• do loop

• do loop

• do loop

• do loop

• !$omp atomic

• array(x,y,z) = array(x,y,z) + k

• ...

• !$omp end target data

Elsewhere during the initialisation of the code, we have linked each MPI

task to a specific GPU.

10

MPI+OpenMP (IBM compiler) on GPU

11

Somme test-case

• Somme 7 days

• Telemac2d-Tomawac-Sisyphe

20.8%

6%6%6.6%

6.9%

9.6%

11.4%

11.6%

21.1%

other subroutinessemimpqwind1propa

fremoyschar41 per 4dlogqnlin1bief interp

12

Inclusion in the codebase

• OpenACC and OpenMP redundancy

• Could be solved with pragma in this case

• But might not always be possible

• Usage of the optional directory

13

Conclusion

Results achieved:

• Telemac-Mascaret ported to OpenPower

• The port revelead bugs in Telemac-Mascaret and some compilers

• Good improvement when using GPU for the qnlin3 subroutine

• Work still going on, but will be more difficult for real world test-case

14

Acknowledgements

• This work is supported by the Hartree Centre through the Innovation

Return on Research (IROR) programme.

15

Thank you for your attention

If you think the code is too slow, or uses to much memory for you

(partel, Telemac, Tomawac...)

Please contact us.

Contact:

judicael.grasset@stfc.ac.uk charles.moulinec@stfc.ac.uk

16