From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus...
Transcript of From PeleCto PeleACC, to PeleC++€¦ · PeleCon GPUs • Xeon Phi discontinued; GPUs become focus...
From PeleC to PeleACC, to PeleC++
P3HPC Forum 2020, September 2Jon Rood, Marc Henry de Frahan, Ray Grout
NREL | 2
The Pele Project
Solves reacting Navier-Stokes on structured grid using AMR and embedded boundaries based on AMReX libraryPeleC
– Compressible combustion simulations– Explicit time stepping
PeleLM– Low-mach combustion simulations– Implicit, requiring linear solver
PelePhysics– Shared code for chemistry/reactions
NREL | 3
PeleC Overview
• 50k LOC• 11373 lines of C++• 38905 lines of Fortran (including duplicate dimension-specific
code)• High level C++ orchestration with Fortran kernels• Source code generator used for chemistry to unroll code• C++ -> Fortran -> C– Mixed languages pose many issues
NREL | 4
Original PeleC Programming Model
• MPI + OpenMP• Ranks operate in bulk-synchronous data-
parallel fashion• Threads operate on independent tiles• Originally focused on KNL and
vectorization (lowered loops)#pragma omp parallelfor (MFIter mfi(F,true); mfi.isValid(); ++mfi) {
const Box& box = mfi.tilebox();Array4<Real const> const& u = U.const_array(mfi);Array4<Real > const& f = F.array(mfi);f2(box, u, f); // Call Fortran kernel
}
AMReX FAB data structures1.
NREL | 5
PeleC on GPUs
• Xeon Phi discontinued; GPUs become focus for birth of Exascale• Quickest way to utilizing GPUs
– Offload kernels to device• OpenACC most mature Fortran GPU programming model at the time• Tied to PGI compiler• Introduced in 2011
– Used in production since ~2014• OpenMP 4 introduced for accelerators in 2013
– Jeff Larkin (NVIDIA) - GTC March 2018 – OpenMP on GPUs, First Experiences and Best Practices
• OpenACC pragmas have a straightforward mapping to OpenMP pragmas• Minimize the need to modify current PeleC code• Don’t need to remove current OpenMP pragmas
NREL | 6
PeleC Call Graph
• do_mol_advance – 90%– getMOLSrcTerm – 64%– react_state – 26%
NREL | 7
OpenACC Effort
• 90% of runtime under one routine• Around 5 kernel routines under getMOLSrcTerm to parallelize on GPU
– Around 50 routines to label as seq– Wrote Fortran version of Fuego code generator for these routines
• react_state is implicit ODE solver with thousands of if conditions– Implement a simpler explicit solver instead– Explicit solver written in C and CUDA– Explicit solver 6x slower on CPU
• Completely dominates runtime (react_state now around 90%)
NREL | 8
PeleC OpenACC Programming Model
• Memory management originally done explicitly
• Later used AMReX’s GPU memory management– Use default(present)
• Just need to make sure every routine under kernel is decorated as seq device routine
• Run with MPS, 7 ranks per Summit GPU to obtain asynchronous kernels
NREL | 9
Test Case – Pre-mixed Flame
NREL | 10
OpenACC Results
0
50
100
150
200
250
300
350
400
KNL CPU OpenACC ACC + CUDA Full CUDA
Runt
ime
Single Summit or Cori KNL Node
PeleC GPU Port - Summit vs Cori Node - PMF 3D Case
• Initial OpenACC port over 3x faster than Cori KNL
• 8x faster with CUDA react_state()
• 2 people, 5 weeks of development time
• 1 major bug found and reported to PGI
NREL | 11
C++ Effort
• AMReX GPU strategy was emerging alongside our OpenACC effort– Much like Kokkos using C++ lambdas, but need not be as
general• Steven Reeves, graduate student at LBL prototyped PeleC on the
GPU by porting every necessary routine to C++– Performance much better than OpenACC prototype
• However, once AMReX’s memory management was used in OpenACC, performance over OpenACC seemed to be a toss-up (mostly due to sharing of react_state routine)
• Performance in general was 16-18x faster than KNL
NREL | 12
OpenACC vs C++ Prototype
NREL | 13
C++ Effort
• MPI+CUDA for GPUs• Essentially one thread per cell• Focus on maximum parallelism in kernel
(hoisted perfectly nested loops)• 1 rank per GPU with CUDA streams for
asynchronous behavior#pragma omp parallel if (amrex::Gpu::notInLaunchRegion())for (MFIter mfi(mf,TilingIfNotGPU()); mfi.isValid(); ++mfi){
const Box& bx = mfi.tilebox();Array4<Real> const& fab = mf.array(mfi);amrex::ParallelFor(bx, ncomp,[=] AMREX_GPU_DEVICE (int i, int j, int k, int n){
fab(i,j,k,n) += 1.;});
}
AMReX GPU strategy2.
NREL | 14
C++ Results
25 26 27 28 29 210 211 212100
101
102
103
104
Nodes
RunTim
e(secon
ds)
PeleC Strong Scaling on Summit and Eagle
PeleC Original Kernels CPU GCC Summit
Ideal
PeleC CPU GCC Summit
Ideal
PeleC CPU Intel Eagle
Ideal
PeleC GPU GCC Summit
Ideal
Figure 1: Strong scaling of PMF case with drm19 chemistry on Summit andEagle machines. 360M cells with 2 levels of AMR.
<latexit sha1_base64="DmarVNg0Bujarii2UguXccQJvWc=">AAAPzHiczZfdjty2FcfHadqkk9obp81VbpgaC9jFmOY3xQQDpN20aVA43cZ2HGB3YXAkjkZYfUHSxDsZTC/7jnmAvEcOJc2OZr3bxEkvKiww2kNS5I/nnD95ZmWa1A0h391641dv/vo3b7392/E7v7t95+Ddu+99VRfLKnTPwiItqq9ntnZpkrtnTdKk7uuycjabpe757PzItz//xlV1UuRPm1XpzjIb58k8CW0Dphd3b//7dObiJF/Pk3hZuc34NHR546okj8d9S5Ocf1smYdM1d7a0iOHPXiT15mSMtk87//TUZeVifexSd7RBT5qqyGP0JLSwwhgVOXqyzLKkQTaP0F9tnLoJjL9I7cyl0/UXReTqjbesesuXyxw9TTKH7tcuLPKoftA2w+QIsJMaXUzXrDUdrrIkn5IJWmX2YkoZafu52MFEZVFPiyVgobyomgVytm4GzbCNyyyvp3Rgq5sVsKznRd5MT5skX7WTvEyiZjGlJMwmaOGSeNFMA3iHFge7vEJ+R1CZFs0jbMty8J1mkYTn7ScelfH8Ub7MZrCaeVFlFvqG0eRyF5e1gwVlmd2ZKCEE1a6cruELZ+PxYQmbG6KwXKIiK8enNor8nCeZrc6nf5og/4uK0nu4nq7rIk2izQS1v5M2ZKaz1IbnZzBPUUVJbhtXo7Xfw/ucTR7Ay30lJogxTglmKmCtibJggijjAcckELq1MakmSAecYRpwxVubpGyChOIUG8VY0I0lDD5IoSc2nNFuLBH+g5QarIim4sF48/EeWxI5m15DWEwiWy/cL2ehVEpMuaA7Fim1xlJpIncsTAcBhqVLJocw3AjMDdWqN3c8ymhMjQwC6s0t0QBpntoGZWUyPvxpTusIi8rmsftQkA+vZT3sYRHXRkjMhAagwy24JlRiJaXqbC05o0IQbIgSnbFzI6UBpppr3Rk79kBQrJSSvB/esQsZYGZIYMB6lbFz202kV5z3P0ejXBKNA/DZAE1pCYEHHhuQca4VNlIZuUemwNdCQ3zLPbT4l7svuJmxRWSES1i5oXKHCH4yDBNO+BBREPAUN1IPELnhHAtwnxgwUk4oJowYNmRUQmPOmCTq/wKRgCxg8K6WA0YJsQV+YMHW2kIyKQTWXAm+NXeYTDPMIbfpZfeelCsQJkGJ4Zf+7CUmfk357DgrF72qNC0ahaUxiDEfjD0ZBByFkFV8oDlSUYpBHYzZaY5gkmJQHGEGmsMJ8BMthpLjY5MxrfhQcIRiWPqebCCgN9Fdyb7XwtGaUQxuCnY0PFAEg5MGMNRwAr4QYiCgRoHMBozt6acIYN0GNHhPPhnIDTMScvpSPp2/JaCf77O4ci6/UWF6Xg5iCPtNCN0dGRxWKLy+D44MQSQGgaFsQMwhKA2Iod4Rw1kIecdBJh+0R1GL/JN5rnjp5wLAMcQxA1N/mLUEBlRPedt229sIhMDi3rh1RheDwEWlN/dH3D5HRxCXy9fyxixduhsQGJcGS9am6hYB5AKyg2o9IAiw5D6FBj6AgYxRkIBBCokAE6ID0/ulizpusDYExEUPgo7DQa4Mo/0FRhDjtQYOSIhYGgQ/wvvKdeRGQE4FBA9giv8KCHIIGw6EAw/BujnInBFDB/mbk1TcaLqXWQaOv4CaQO7fTCBIIXyNYcG2ew/qk5oIowS5TLrT7hbc3eLRP6sELv1wqP/DVblLa3R0/Ax9dnTU3+Un6HN/5E9Q1/tHGz+HAiPtL//7rZ9dNxSKDr+UQcnRGfbrktC24bbuC456W3DM0fHjv6EQSiS4vMPVP6oyalC4cBkUV3Bjv6YigfANF1BP1RjymDxGoUuBuR3NoDz4xu8AfPfPj7/EMHFbqPjy6aP2CvTwpbPnD/vpH+bFQ5tV/YK3JdbwefHuPYJJ+6BXX2j/cm/UP8cv7t6Sp1ERLjMo08LU1vUJJWVztrZVk4Sp3wkoIErQCRu7E3jNbebqs3Ubmxt0CJbIlx3IFzaotQ5HrG1W16tsBj2hMFnUV9u88bq2k2UzD87WSV5ClZWH3UTzZYqaAvmyE0VJ5cImXcGLDasE1go+sJUNoSjbnwXKI59b4OXDoTmq7Lx5CflU+aTzjU9c83z7/1N30aw/9V2utvjK063lBjaa7vbXwKNF/2Lo5UZ/BYUM5Oi/2L1P/tJv+dujD0Z/HN0f0ZEefTL6++h49GwU3v7+zjt3/nDn/YMvDpqD9cGm6/rGrX7M70d7z8F/fgBgd0dq</latexit>
22 24 26 28 210 212
0
20
40
60
80
100
120
Nodes
Tim
eper
Tim
estep(secon
ds)
PeleC Weak Scaling on Summit GPUs
Figure 1: Weak scaling of PMF case with drm19 chemistry with no AMR. 222
cells per node.
<latexit sha1_base64="RAe4vuGZBT8+4a+9db2blKbQP0M=">AAAFxHicbVTdb9s2EFcae+u89SPb416IpQaSwXEkWbalDQK6ZWj30sFbkqaA7QU0RcuERVIQqcWa4P2fe9nfsiOttE63e9Hpd1/k3e+4yDOmtOv+ffDosNX+5NPHn3U+/+LJ02fPj758q2RZEHpNZCaLdwusaMYEvdZMZ/RdXlDMFxm9WawvjP3mD1ooJsWVrnI65zgVbMkI1gDdHh3ezRY0ZaJesrQs6LYzI1RoWjCRdhqLZus/c0b0zrzDFOUsk+kGb5jaTjvoXuwJ4hnl+aqe0IxebNENxWt0STCcMEVSoMuSc6bR68m16kHgJsMLmsX1LzKhamuQqkGuGKcopwUyitI0RyeKEikSdWr9oD6CqzOFNnHtW6hbcSZit4cqjjex51o3mlKRoFyqWMhCrxDFSu8ZoIclFyr29jClK7hGvZRCxzPNRGWz37FEryAr4T20oixd6TgEHSwUWlwh0wyUZ1Kf93Ge7+XRK0bWNsV5ni7PRckXcK2lLDgGX5L03jewVBQOxDn+AHmu6yJF87iGDPNOp5tDXwlK87Izw0li6k05Ltbxtz1kvkjmZrQqrpXMWLLtIfvtWa7Ei6ykcyghi4QJrKlCNVQ6CXoo6Pv+aOi7UXRqkIFvIC8MovF45BmoezKybsMoGI+jcGTd/OHIhg6CkRt6Q4t5rm8dw0Hge17g7xzdILSeoR8ORlFgwcCNINzzh/0wglKD0872+w5ccbabw45Chiro9cVFwxzgoDE9YOAOekhUgm0fass/dc+/JZq8eYUIbAyME8iQFNyLEFlBNqVhhhYUEv3w5rc+euH/Xvv+9gUiNMuU5aIAmvYhu+WoWZrv7DjO7qDKWVPlTMgzzIvmVPeL9ZHcPj92+64V9F/Fa5Rjp5HJ7dFBMEskKTnsJ8mwUlPPzfW8xoVmJDM3BvLkmKxxSqegCgxbM6/t2LeoC0hiKIcMqZFF9yNqzJWq+AI8gZQr9bHNgP9nm5Z6Gc5rJvJSU0F2hZZlhrRE5r1BCSso0VkFCiYFg7NCt3GBCTwyD6vAahgywzS7+3BS4KW+A6oWht3GeEn1zf3/Fd3o+ifjApZ9g3lxaD3cQp+9D+2NQMZBo0Te+z6/9fseMPtX//jlj03HHztfO984J47njJ2Xzs/OxLl2yOE/rVbrSetp+1U7a6t2uXN9dNDEfOU8kPZf/wK4asWH</latexit>
• 2x faster on CPU• 18x faster than fastest
CPU case using Intel compiler
• 56x faster than GCC CPU on Summit
• 124x faster than original Fortran on Summit CPUs
24576 GPUs on 90% of Summit
NREL | 15
Conclusions
• PeleC now 19363 lines of C++• Fortran appears to be not
beneficial to PeleC in any way• Even 2x faster on the CPU• Easier to debug and profile• Kernels easier to write and to read• Much less duplicate code
necessary for dimensions• Ability to use many compilers• Good performance portability• 1 graduate student 6 months + 2
staff 12 weeks to completely move to C++
• OpenACC allowed us to prototype PeleC on GPU very quickly
• Performance can be similar toCUDA
• Code quickly became displeasing• Mixed languages cause problems
for readability, debugging, profiling, and compiler optimizations
• Non-ubiquitous programming models lack support, robustness, and flexibility
• Fortran was holding us back
NREL | 16
References
1. https://amrex-codes.github.io/amrex/docs_html/Basics.html#mfiter-and-tiling2. https://amrex-codes.github.io/amrex/docs_html/GPU.html#overview-of-amrex-
gpu-strategy
www.nrel.gov
Q&A
This work was authored by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the Exascale Computing Project. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation's exascale computing imperative.