From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus...

17
From PeleC to PeleACC, to PeleC++ P3HPC Forum 2020, September 2 Jon Rood, Marc Henry de Frahan, Ray Grout

Transcript of From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus...

Page 1: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

From PeleC to PeleACC, to PeleC++

P3HPC Forum 2020, September 2Jon Rood, Marc Henry de Frahan, Ray Grout

Page 2: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 2

The Pele Project

Solves reacting Navier-Stokes on structured grid using AMR and embedded boundaries based on AMReX libraryPeleC

– Compressible combustion simulations– Explicit time stepping

PeleLM– Low-mach combustion simulations– Implicit, requiring linear solver

PelePhysics– Shared code for chemistry/reactions

Page 3: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 3

PeleC Overview

• 50k LOC• 11373 lines of C++• 38905 lines of Fortran (including duplicate dimension-specific

code)• High level C++ orchestration with Fortran kernels• Source code generator used for chemistry to unroll code• C++ -> Fortran -> C– Mixed languages pose many issues

Page 4: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 4

Original PeleC Programming Model

• MPI + OpenMP• Ranks operate in bulk-synchronous data-

parallel fashion• Threads operate on independent tiles• Originally focused on KNL and

vectorization (lowered loops)#pragma omp parallelfor (MFIter mfi(F,true); mfi.isValid(); ++mfi) {

const Box& box = mfi.tilebox();Array4<Real const> const& u = U.const_array(mfi);Array4<Real > const& f = F.array(mfi);f2(box, u, f); // Call Fortran kernel

}

AMReX FAB data structures1.

Page 5: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 5

PeleC on GPUs

• Xeon Phi discontinued; GPUs become focus for birth of Exascale• Quickest way to utilizing GPUs

– Offload kernels to device• OpenACC most mature Fortran GPU programming model at the time• Tied to PGI compiler• Introduced in 2011

– Used in production since ~2014• OpenMP 4 introduced for accelerators in 2013

– Jeff Larkin (NVIDIA) - GTC March 2018 – OpenMP on GPUs, First Experiences and Best Practices

• OpenACC pragmas have a straightforward mapping to OpenMP pragmas• Minimize the need to modify current PeleC code• Don’t need to remove current OpenMP pragmas

Page 6: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 6

PeleC Call Graph

• do_mol_advance – 90%– getMOLSrcTerm – 64%– react_state – 26%

Page 7: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 7

OpenACC Effort

• 90% of runtime under one routine• Around 5 kernel routines under getMOLSrcTerm to parallelize on GPU

– Around 50 routines to label as seq– Wrote Fortran version of Fuego code generator for these routines

• react_state is implicit ODE solver with thousands of if conditions– Implement a simpler explicit solver instead– Explicit solver written in C and CUDA– Explicit solver 6x slower on CPU

• Completely dominates runtime (react_state now around 90%)

Page 8: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 8

PeleC OpenACC Programming Model

• Memory management originally done explicitly

• Later used AMReX’s GPU memory management– Use default(present)

• Just need to make sure every routine under kernel is decorated as seq device routine

• Run with MPS, 7 ranks per Summit GPU to obtain asynchronous kernels

Page 9: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 9

Test Case – Pre-mixed Flame

Page 10: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 10

OpenACC Results

0

50

100

150

200

250

300

350

400

KNL CPU OpenACC ACC + CUDA Full CUDA

Runt

ime

Single Summit or Cori KNL Node

PeleC GPU Port - Summit vs Cori Node - PMF 3D Case

• Initial OpenACC port over 3x faster than Cori KNL

• 8x faster with CUDA react_state()

• 2 people, 5 weeks of development time

• 1 major bug found and reported to PGI

Page 11: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 11

C++ Effort

• AMReX GPU strategy was emerging alongside our OpenACC effort– Much like Kokkos using C++ lambdas, but need not be as

general• Steven Reeves, graduate student at LBL prototyped PeleC on the

GPU by porting every necessary routine to C++– Performance much better than OpenACC prototype

• However, once AMReX’s memory management was used in OpenACC, performance over OpenACC seemed to be a toss-up (mostly due to sharing of react_state routine)

• Performance in general was 16-18x faster than KNL

Page 12: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 12

OpenACC vs C++ Prototype

Page 13: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 13

C++ Effort

• MPI+CUDA for GPUs• Essentially one thread per cell• Focus on maximum parallelism in kernel

(hoisted perfectly nested loops)• 1 rank per GPU with CUDA streams for

asynchronous behavior#pragma omp parallel if (amrex::Gpu::notInLaunchRegion())for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi){

const Box& bx = mfi.tilebox();Array4<Real> const& fab = mf.array(mfi);amrex::ParallelFor(bx, ncomp,[=] AMREX_GPU_DEVICE (int i, int j, int k, int n){

fab(i,j,k,n) += 1.;});

}

AMReX GPU strategy2.

Page 14: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 14

C++ Results

25 26 27 28 29 210 211 212100

101

102

103

104

Nodes

RunTim

e(secon

ds)

PeleC Strong Scaling on Summit and Eagle

PeleC Original Kernels CPU GCC Summit

Ideal

PeleC CPU GCC Summit

Ideal

PeleC CPU Intel Eagle

Ideal

PeleC GPU GCC Summit

Ideal

Figure 1: Strong scaling of PMF case with drm19 chemistry on Summit andEagle machines. 360M cells with 2 levels of AMR.

<latexit sha1_base64="DmarVNg0Bujarii2UguXccQJvWc=">AAAPzHiczZfdjty2FcfHadqkk9obp81VbpgaC9jFmOY3xQQDpN20aVA43cZ2HGB3YXAkjkZYfUHSxDsZTC/7jnmAvEcOJc2OZr3bxEkvKiww2kNS5I/nnD95ZmWa1A0h391641dv/vo3b7392/E7v7t95+Ddu+99VRfLKnTPwiItqq9ntnZpkrtnTdKk7uuycjabpe757PzItz//xlV1UuRPm1XpzjIb58k8CW0Dphd3b//7dObiJF/Pk3hZuc34NHR546okj8d9S5Ocf1smYdM1d7a0iOHPXiT15mSMtk87//TUZeVifexSd7RBT5qqyGP0JLSwwhgVOXqyzLKkQTaP0F9tnLoJjL9I7cyl0/UXReTqjbesesuXyxw9TTKH7tcuLPKoftA2w+QIsJMaXUzXrDUdrrIkn5IJWmX2YkoZafu52MFEZVFPiyVgobyomgVytm4GzbCNyyyvp3Rgq5sVsKznRd5MT5skX7WTvEyiZjGlJMwmaOGSeNFMA3iHFge7vEJ+R1CZFs0jbMty8J1mkYTn7ScelfH8Ub7MZrCaeVFlFvqG0eRyF5e1gwVlmd2ZKCEE1a6cruELZ+PxYQmbG6KwXKIiK8enNor8nCeZrc6nf5og/4uK0nu4nq7rIk2izQS1v5M2ZKaz1IbnZzBPUUVJbhtXo7Xfw/ucTR7Ay30lJogxTglmKmCtibJggijjAcckELq1MakmSAecYRpwxVubpGyChOIUG8VY0I0lDD5IoSc2nNFuLBH+g5QarIim4sF48/EeWxI5m15DWEwiWy/cL2ehVEpMuaA7Fim1xlJpIncsTAcBhqVLJocw3AjMDdWqN3c8ymhMjQwC6s0t0QBpntoGZWUyPvxpTusIi8rmsftQkA+vZT3sYRHXRkjMhAagwy24JlRiJaXqbC05o0IQbIgSnbFzI6UBpppr3Rk79kBQrJSSvB/esQsZYGZIYMB6lbFz202kV5z3P0ejXBKNA/DZAE1pCYEHHhuQca4VNlIZuUemwNdCQ3zLPbT4l7svuJmxRWSES1i5oXKHCH4yDBNO+BBREPAUN1IPELnhHAtwnxgwUk4oJowYNmRUQmPOmCTq/wKRgCxg8K6WA0YJsQV+YMHW2kIyKQTWXAm+NXeYTDPMIbfpZfeelCsQJkGJ4Zf+7CUmfk357DgrF72qNC0ahaUxiDEfjD0ZBByFkFV8oDlSUYpBHYzZaY5gkmJQHGEGmsMJ8BMthpLjY5MxrfhQcIRiWPqebCCgN9Fdyb7XwtGaUQxuCnY0PFAEg5MGMNRwAr4QYiCgRoHMBozt6acIYN0GNHhPPhnIDTMScvpSPp2/JaCf77O4ci6/UWF6Xg5iCPtNCN0dGRxWKLy+D44MQSQGgaFsQMwhKA2Iod4Rw1kIecdBJh+0R1GL/JN5rnjp5wLAMcQxA1N/mLUEBlRPedt229sIhMDi3rh1RheDwEWlN/dH3D5HRxCXy9fyxixduhsQGJcGS9am6hYB5AKyg2o9IAiw5D6FBj6AgYxRkIBBCokAE6ID0/ulizpusDYExEUPgo7DQa4Mo/0FRhDjtQYOSIhYGgQ/wvvKdeRGQE4FBA9giv8KCHIIGw6EAw/BujnInBFDB/mbk1TcaLqXWQaOv4CaQO7fTCBIIXyNYcG2ew/qk5oIowS5TLrT7hbc3eLRP6sELv1wqP/DVblLa3R0/Ax9dnTU3+Un6HN/5E9Q1/tHGz+HAiPtL//7rZ9dNxSKDr+UQcnRGfbrktC24bbuC456W3DM0fHjv6EQSiS4vMPVP6oyalC4cBkUV3Bjv6YigfANF1BP1RjymDxGoUuBuR3NoDz4xu8AfPfPj7/EMHFbqPjy6aP2CvTwpbPnD/vpH+bFQ5tV/YK3JdbwefHuPYJJ+6BXX2j/cm/UP8cv7t6Sp1ERLjMo08LU1vUJJWVztrZVk4Sp3wkoIErQCRu7E3jNbebqs3Ubmxt0CJbIlx3IFzaotQ5HrG1W16tsBj2hMFnUV9u88bq2k2UzD87WSV5ClZWH3UTzZYqaAvmyE0VJ5cImXcGLDasE1go+sJUNoSjbnwXKI59b4OXDoTmq7Lx5CflU+aTzjU9c83z7/1N30aw/9V2utvjK063lBjaa7vbXwKNF/2Lo5UZ/BYUM5Oi/2L1P/tJv+dujD0Z/HN0f0ZEefTL6++h49GwU3v7+zjt3/nDn/YMvDpqD9cGm6/rGrX7M70d7z8F/fgBgd0dq</latexit>

22 24 26 28 210 212

0

20

40

60

80

100

120

Nodes

Tim

eper

Tim

estep(secon

ds)

PeleC Weak Scaling on Summit GPUs

Figure 1: Weak scaling of PMF case with drm19 chemistry with no AMR. 222

cells per node.

<latexit sha1_base64="RAe4vuGZBT8+4a+9db2blKbQP0M=">AAAFxHicbVTdb9s2EFcae+u89SPb416IpQaSwXEkWbalDQK6ZWj30sFbkqaA7QU0RcuERVIQqcWa4P2fe9nfsiOttE63e9Hpd1/k3e+4yDOmtOv+ffDosNX+5NPHn3U+/+LJ02fPj758q2RZEHpNZCaLdwusaMYEvdZMZ/RdXlDMFxm9WawvjP3mD1ooJsWVrnI65zgVbMkI1gDdHh3ezRY0ZaJesrQs6LYzI1RoWjCRdhqLZus/c0b0zrzDFOUsk+kGb5jaTjvoXuwJ4hnl+aqe0IxebNENxWt0STCcMEVSoMuSc6bR68m16kHgJsMLmsX1LzKhamuQqkGuGKcopwUyitI0RyeKEikSdWr9oD6CqzOFNnHtW6hbcSZit4cqjjex51o3mlKRoFyqWMhCrxDFSu8ZoIclFyr29jClK7hGvZRCxzPNRGWz37FEryAr4T20oixd6TgEHSwUWlwh0wyUZ1Kf93Ge7+XRK0bWNsV5ni7PRckXcK2lLDgGX5L03jewVBQOxDn+AHmu6yJF87iGDPNOp5tDXwlK87Izw0li6k05Ltbxtz1kvkjmZrQqrpXMWLLtIfvtWa7Ei6ykcyghi4QJrKlCNVQ6CXoo6Pv+aOi7UXRqkIFvIC8MovF45BmoezKybsMoGI+jcGTd/OHIhg6CkRt6Q4t5rm8dw0Hge17g7xzdILSeoR8ORlFgwcCNINzzh/0wglKD0872+w5ccbabw45Chiro9cVFwxzgoDE9YOAOekhUgm0fass/dc+/JZq8eYUIbAyME8iQFNyLEFlBNqVhhhYUEv3w5rc+euH/Xvv+9gUiNMuU5aIAmvYhu+WoWZrv7DjO7qDKWVPlTMgzzIvmVPeL9ZHcPj92+64V9F/Fa5Rjp5HJ7dFBMEskKTnsJ8mwUlPPzfW8xoVmJDM3BvLkmKxxSqegCgxbM6/t2LeoC0hiKIcMqZFF9yNqzJWq+AI8gZQr9bHNgP9nm5Z6Gc5rJvJSU0F2hZZlhrRE5r1BCSso0VkFCiYFg7NCt3GBCTwyD6vAahgywzS7+3BS4KW+A6oWht3GeEn1zf3/Fd3o+ifjApZ9g3lxaD3cQp+9D+2NQMZBo0Te+z6/9fseMPtX//jlj03HHztfO984J47njJ2Xzs/OxLl2yOE/rVbrSetp+1U7a6t2uXN9dNDEfOU8kPZf/wK4asWH</latexit>

• 2x faster on CPU• 18x faster than fastest

CPU case using Intel compiler

• 56x faster than GCC CPU on Summit

• 124x faster than original Fortran on Summit CPUs

24576 GPUs on 90% of Summit

Page 15: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 15

Conclusions

• PeleC now 19363 lines of C++• Fortran appears to be not

beneficial to PeleC in any way• Even 2x faster on the CPU• Easier to debug and profile• Kernels easier to write and to read• Much less duplicate code

necessary for dimensions• Ability to use many compilers• Good performance portability• 1 graduate student 6 months + 2

staff 12 weeks to completely move to C++

• OpenACC allowed us to prototype PeleC on GPU very quickly

• Performance can be similar toCUDA

• Code quickly became displeasing• Mixed languages cause problems

for readability, debugging, profiling, and compiler optimizations

• Non-ubiquitous programming models lack support, robustness, and flexibility

• Fortran was holding us back

Page 16: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

NREL | 16

References

1. https://amrex-codes.github.io/amrex/docs_html/Basics.html#mfiter-and-tiling2. https://amrex-codes.github.io/amrex/docs_html/GPU.html#overview-of-amrex-

gpu-strategy

Page 17: From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus for birth of Exascale • Quickest way to utilizing GPUs –Offload kernels to device

www.nrel.gov

Q&A

This work was authored by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the Exascale Computing Project. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes.

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation's exascale computing imperative.