Kokkos and Fortran in the Exascale Computing …...Kokkos and Fortran in the Exascale Computing...

1
Kokkos and Fortran in the Exascale Computing Project plasma physics code XGC A. Scheinberg, 1 * G. Chen, 2 S. Ethier, 1 S. Slattery, 3 R. Bird, 2 P. Worley, C.S. Chang 1 Introduction Performance Conclusion Numerical plasma physics models such as the particle-in-cell XGC code are important tools to understand phenomena encountered in experimental fusion devices. Here we adapt XGC to use Kokkos, a programming model for portable supercomputer performance, and Cabana, a recent library built on Kokkos as part of the ECP-CoPA project containing kernels and operations typically nec- essary for particle-based scientific codes. We summarize the use by XGC of the execution and data layout patterns offered by this framework. We then show that this approach can provide a single, portable code base that performs well on both GPUs and multicore machines. • Using the Kokkos/Cabana framework, we produced a production-ready code that matches performance of an already highly optimized HPC code. • Due to Cabana's ease of use, other parts of the code that were previ- ously un-optimized were rapidly vectorized and ported to GPU. • The framework will allow other scientific programmers to focus more on science and less on optimization, though its flexibility leaves plenty of op- portunities for optimizations within and between kernels. • Our macros and simple examples of the Fortran Cabana approach can be found at github.com/ECP-copa/Cabana; however, we suggest convert- ing to C++ for maximum portability on upcoming architectures. • In either language, attention to modularization, data layout, and hierar- chical parallelism are key. Model: XGC Kernel execution *Corresponding author. Contact: [email protected] XGC is a particle-in-cell plasma physics code which uses a gyrokinetic model to simulate plasma turbulence in the edge and core of magnetically con- fined fusion devices. Markers denoting the ion and electron distribution functions are distributed in con- figuration space. The electric field, stored on an un- structured grid, is evaluated at the marker positions, which are updated (”pushed'') accordingly. The new charge distribution is then mapped back to the grid, where the electric field is solved for the next time step. Electrons are “sub-cycled'' at 60 steps per ion step and comprise the bulk of computation. 1 Princeton Plasma Physics Laboratory, 2 Los Alamos National Laboratory, 3 Oak Ridge National Laboratory XGC weak scaling, 370k mesh, 12M particles/node, n planes = n nodes /256 6% Ion scatter 2% Electron scatter 9% Collisions 1% Ion push 48% Electron push 12% Ion shift 5% Electron shift 17% Other TOTAL 50 billion electrons 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Nodes 0 20 40 60 80 100 120 Normalized time 0 20 40 60 80 100 120 XGC versions on Summit, 1M mesh, 25.2M particles/GPU, 8 planes Speed-up: 14.4 Speed-up: 15.5 0.9 0.6 0.9 0.3 4.7 0.6 0.9 0.4 90.8 0.9 0.6 0.9 0.2 1.0 0.6 0.9 0.3 1.5 0.9 0.6 0.9 0.3 1.0 0.3 0.4 0.5 1.7 Old XGC (CPU only) Old XGC (with GPU) Cabana XGC 0 2 4 6 8 10 100 Normalized time Push Other Electron shift Electron scatter f - Collisions f - Search f - Other Ion shift Ion scatter XGC weak scaling, 1M mesh, 50.4M particles/GPU, n planes = n nodes /32 5% Electron scatter (GPU+CPU) 14% Ion scatter (CPU) 3% f - Search (CPU) 7% f - Collisions (GPU) 15% f - Other (CPU) 29% Push - Electrons (GPU) - Ions (CPU) 8% Electron shift (CPU) 13% Ion shift (CPU) 7% Other (CPU) TOTAL 1.24 trillion electrons 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Nodes 0 20 40 60 80 100 120 Normalized time 0 20 40 60 80 100 120 XGC versions on Cori KNL, 370k mesh, 12M particles/node, 2 planes, 512 KNL nodes Speed-up: 1.5 Speed-up: 1.8 1.5 4.5 59.5 24.3 2.2 7.0 1.8 4.9 35.4 11.2 2.7 8.0 2.0 4.7 28.0 9.8 1.0 8.3 Old XGC (Original) Old XGC (Vectorized) Cabana XGC 0 10 20 30 40 50 60 70 80 90 100 Normalized time Other Electron shift Ion shift Electron push Ion push Collisions Electron scatter Ion scatter Asynchronous execution Data layout and transfer Scatter operation Field solver Gather operation Particle push MPI particle transfer Collisions, Sources, Diagnostics Performance was evaluated on two machines: the KNL partition of Cori (left) and Summit (right). The Cabana version is compared against previous versions of XGC: an un- vectorized OpenMP version and a vector- ized OpenMP version on Cori, and the OpenMP version and a Cuda Fortran ver- sion on Summit. In both cases, the Cabana implementation of the costly electron push kernel is able to perform about as well as the previous, architec- ture-specific implementations. Weak scaling studies on both machines demonstrate that supercomputer-scale simulations can be done with the Cabana version without loss of performance. Idle time on both the host and device must be minimized for ef- ficient resource use. Increased modularization, in our case sepa- ration of ion tasks from electron tasks, can remove unnecessary sync points and enable asyn- chronous CPU and GPU opera- tion which reduces idle time. Communication can also be done asynchronously, but this was not prioritized since, due to fast data transfer on Summit, host-device communication is only ~1% of total runtime. The Fortran main program calls a Kokkos C++ function, which in turn calls a Fortran kernel using ISO C Binding:: void electron_push_dispatch(int N) // called by Fortran main auto push_lambda = KOKKOS_LAMBDA( const int idx) electron_push(idx); // call to Fortran kernel Kokkos::parallel_for( range_policy, push_lambda ) This interface and others were shortened to convenient macros: KOKKOS_OP(electron_push_dispatch, electron_push) The particles use flexible Array of Structures of Arrays data layout in Cabana. If on GPU, the parallel_for loops over particles; if on CPU, it loops over vectors: N = USE_GPU ? N_PTL : N_PTL / VEC_LEN; Kokkos::RangePolicy<ExecutionSpace> range_policy( 0, N); Host-device data transfer is still handled with Cuda Fortran. Data is shared between the main program and the kernel via shared Fortran modules rather than passed arguments. Figure credit: Steve Abbott

Transcript of Kokkos and Fortran in the Exascale Computing …...Kokkos and Fortran in the Exascale Computing...

Page 1: Kokkos and Fortran in the Exascale Computing …...Kokkos and Fortran in the Exascale Computing Project plasma physics code XGC A. Scheinberg,1* G. Chen,2 S. Ethier,1 S. Slattery,3

Kokkos and Fortran in the Exascale Computing Project plasma physics code XGCA. Scheinberg,1* G. Chen,2 S. Ethier,1 S. Slattery,3 R. Bird,2 P. Worley, C.S. Chang1

Introduction

Performance

Conclusion

Numerical plasma physics models such as the particle-in-cell XGC code are important tools to understand phenomena encountered in experimental fusion devices. Here we adapt XGC to use Kokkos, a programming model for portable supercomputer performance, and Cabana, a recent library built on Kokkos as part of the ECP-CoPA project containing kernels and operations typically nec-essary for particle-based scientific codes. We summarize the use by XGC of the execution and data layout patterns offered by this framework. We then show that this approach can provide a single, portable code base that performs well on both GPUs and multicore machines.

• Using the Kokkos/Cabana framework, we produced a production-ready code that matches performance of an already highly optimized HPC code. • Due to Cabana's ease of use, other parts of the code that were previ-ously un-optimized were rapidly vectorized and ported to GPU. • The framework will allow other scientific programmers to focus more on science and less on optimization, though its flexibility leaves plenty of op-portunities for optimizations within and between kernels. • Our macros and simple examples of the Fortran Cabana approach can be found at github.com/ECP-copa/Cabana; however, we suggest convert-ing to C++ for maximum portability on upcoming architectures. • In either language, attention to modularization, data layout, and hierar-chical parallelism are key.

Model: XGC

Kernel execution

*Corresponding author. Contact: [email protected]

XGC is a particle-in-cell plasma physics code which uses a gyrokinetic model to simulate plasma turbulence in the edge and core of magnetically con-fined fusion devices. Markers denoting the ion and electron distribution functions are distributed in con-figuration space. The electric field, stored on an un-structured grid, is evaluated at the marker positions, which are updated (”pushed'') accordingly. The new charge distribution is then mapped back to the grid, where the electric field is solved for the next time step. Electrons are “sub-cycled'' at 60 steps per ion step and comprise the bulk of computation.

1Princeton Plasma Physics Laboratory, 2Los Alamos National Laboratory, 3Oak Ridge National Laboratory

XGC weak scaling, 370k mesh, 12M particles/node, n planes = nnodes/256

6%Ion scatter2%Electron scatter

9%Collisions

1%Ion push

48%Electron push

12%Ion shift

5%Electron shift

17%Other

TOTAL

50 billionelectrons

0 500 1000 1500 2000 2500 3000 3500 4000 4500Nodes

0

20

40

60

80

100

120

Nor

mal

ized

tim

e

0

20

40

60

80

100

120XGC versions on Summit, 1M mesh, 25.2M particles/GPU, 8 planes

Speed-up: 14.4Speed-up: 15.5

0.9

0.6

0.90.3

4.7

0.6

0.90.4

90.8

0.9

0.6

0.90.21.0

0.6

0.90.3

1.5

0.9

0.6

0.90.31.00.30.40.5

1.7

Old XGC (CPU only) Old XGC (with GPU) Cabana XGC0

2

4

6

8

10

100

Nor

mal

ized

tim

e

PushOtherElectron shiftElectron scatterf - Collisionsf - Searchf - OtherIon shiftIon scatter

XGC weak scaling, 1M mesh, 50.4M particles/GPU, n planes = nnodes/32

5%Electron scatter (GPU+CPU)

14%Ion scatter (CPU)

3%f - Search (CPU)

7%f - Collisions (GPU)

15%f - Other (CPU)

29%Push - Electrons (GPU) - Ions (CPU)

8%Electron shift (CPU)

13%Ion shift (CPU)

7%Other (CPU)TOTAL

1.24 trillionelectrons

0 500 1000 1500 2000 2500 3000 3500 4000 4500Nodes

0

20

40

60

80

100

120

Nor

mal

ized

tim

e

0

20

40

60

80

100

120XGC versions on Cori KNL, 370k mesh, 12M particles/node, 2 planes, 512 KNL nodes

Speed-up: 1.5

Speed-up: 1.8

1.54.5

59.5

24.3

2.27.0

1.84.9

35.4

11.2

2.7

8.0

2.04.7

28.0

9.8

1.08.3

Old XGC (Original) Old XGC (Vectorized) Cabana XGC0

10

20

30

40

50

60

70

80

90

100

Nor

mal

ized

tim

e

OtherElectron shiftIon shiftElectron pushIon pushCollisionsElectron scatterIon scatter

Asynchronous execution

Data layout and transfer

Scatteroperation

Field solver Gather operationParticle push

MPI particletransfer

Collisions, Sources,Diagnostics

Performance was evaluated on two machines: the KNL partition of Cori (left) and Summit (right). The Cabana version is compared against previous versions of XGC: an un-vectorized OpenMP version and a vector-ized OpenMP version on Cori, and the OpenMP version and a Cuda Fortran ver-sion on Summit. In both cases, the Cabana implementation of the costly electron push kernel is able to perform about as well as the previous, architec-ture-specific implementations. Weak scaling studies on both machines demonstrate that supercomputer-scale simulations can be done with the Cabana version without loss of performance.

Idle time on both the host and device must be minimized for ef-ficient resource use. Increased modularization, in our case sepa-ration of ion tasks from electron tasks, can remove unnecessary sync points and enable asyn-chronous CPU and GPU opera-tion which reduces idle time. Communication can also be done asynchronously, but this was not prioritized since, due to fast data transfer on Summit, host-device communication is only ~1% of total runtime.

The Fortran main program calls a Kokkos C++ function, which in turn calls a Fortran kernel using ISO C Binding:: void electron_push_dispatch(int N) // called by Fortran main auto push_lambda = KOKKOS_LAMBDA( const int idx) electron_push(idx); // call to Fortran kernel Kokkos::parallel_for( range_policy, push_lambda )

This interface and others were shortened to convenient macros: KOKKOS_OP(electron_push_dispatch, electron_push)

The particles use flexible Array of Structures of Arrays data layout in Cabana. If on GPU, the parallel_for loops over particles; if on CPU, it loops over vectors:

N = USE_GPU ? N_PTL : N_PTL / VEC_LEN; Kokkos::RangePolicy<ExecutionSpace> range_policy( 0, N);

Host-device data transfer is still handled with Cuda Fortran. Data is shared between the main program and the kernel via shared Fortran modules rather than passed arguments.

Figure credit: Steve Abbott