Particle-Particle Particle-Mesh (P3M) on Knights Landing...
Transcript of Particle-Particle Particle-Mesh (P3M) on Knights Landing...
![Page 1: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/1.jpg)
Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors
William McDoniel Ahmed E. Ismail Paolo Bientinesi
SIAM CSE ‘17Atlanta
Thanks to: Klaus-Dieter Oertel, Georg Zitzlsberger, and Mike BrownFunded as part of an Intel Parallel Computing Center
1
![Page 2: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/2.jpg)
2
Intermolecular Forces
The forces on atoms are commonly taken to be the result of independent pair-wise interactions.
Lennard-Jones potential:
Where the force on an atom is given by:
But long-range forces can be important!
The electrical potential only decreases as 1/r and doesn’t perfectly cancel for polar molecules.
Interfaces can also create asymmetries that inhibit cancellation.
𝛷"# = % 4'()*'+
𝜖𝜎𝑟/0
12
−𝜎𝑟/0
4
�⃗� = −𝛻𝛷
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
1 1.5 2 2.5 3 3.5
V /
εr / σ
Re
pu
lsiv
e
Attractive
Re
pu
lsiv
e
Attractive
Cu
toff
rc
![Page 3: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/3.jpg)
3
Particle-Particle Particle-Mesh
l PPPM1 approximates long-range forces without requiring pair-wise calculations.
Four Steps:
1. Determine the charge distribution ρ by mapping particle charges to a grid.
2. Take the Fourier transform of the charge distribution to find the potential:
3. Obtain forces due to all interactions as the gradient of Φ by inverse Fourier transform:
4. Map forces back to the particles.
�⃗� = −𝛻𝛷
𝛻2𝛷 = −𝜌𝜖9
1. Hockney and Eastwood, 1988
![Page 4: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/4.jpg)
4
Profiling LAMMPS
We use the USER-OMP implementation of LAMMPS as a baseline.Typically: rc is 6 angstroms, relative error is 0.0001, and stencil size is 5.
The work in FFTs increases rapidly at low cutoffs.The non-FFT work in PPPM is insensitive to grid size.Sometimes the FFTs take surprisingly long.
Water benchmark:40.5k atoms884k FFT grid points
![Page 5: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/5.jpg)
5
Charge Mapping
Stencil coefficients are polynomials of order stencil size.3x[stencil size] of them are computed.
Loop over cubic stencil and contribute to grid points
Loop over atoms in MPI rank
USER-OMP Implementation
![Page 6: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/6.jpg)
6
Charge Mapping
Loop over atoms in MPI rank
USER-OMP Implementation
![Page 7: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/7.jpg)
7
Charge Mapping
Stencil coefficients are polynomials of order stencil size.3x[stencil size] are computed.
USER-OMP Implementation
![Page 8: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/8.jpg)
8
Charge Mapping
Loop over cubic stencil and contribute to grid points
USER-OMP Implementation
![Page 9: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/9.jpg)
9
Charge Mapping
Our Implementation
Thread over atoms
#pragma simdfor coefficients
USER-OMP Implementation
![Page 10: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/10.jpg)
10
Charge Mapping
Innermost loop vectorized with bigger stencil.Private grids prevent race conditions.
Our Implementation
![Page 11: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/11.jpg)
11
Distributing Forces ikVery similar to charge mapping:Computes stencil coefficientsLoops over stencil points.
More work and accesses more memory
#pragma simdaround atom loop
Update 3 force components
Water benchmark:40.5k atoms884k FFT grid points
![Page 12: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/12.jpg)
12
Distributing Forces ik
Update 3 force components
![Page 13: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/13.jpg)
13
Distributing Forces ikInner SIMD
![Page 14: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/14.jpg)
14
10% faster with tripcount instead of 7
50% faster with 8 instead of 7
Reduction of force component arrays
Distributing Forces ikInner SIMD
![Page 15: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/15.jpg)
15
16-iteration loops are faster on KNL, even with extra 0s
Repacking vdx and vdy into vdxy, vdz into vdz0 (done outside atom loop)
3 vector operations instead of 4: 60% faster
Distributing Forces ik
![Page 16: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/16.jpg)
16
Distributing Forces adDifferent “flavors” of PPPM have the same overall structure
6 coefficients are computed for each stencil point
![Page 17: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/17.jpg)
17
Distributing Forces adDifferent “flavors” of PPPM have the same overall structure
Only one set of grid values is used to compute every component of the potential by choosing different combinations of coefficients
![Page 18: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/18.jpg)
18
Distributing Forces adDifferent “flavors” of PPPM have the same overall structure
Work is done after the stencil loop to convert potential for each atom into force
![Page 19: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/19.jpg)
19
Subroutine Speedup
![Page 20: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/20.jpg)
20
Overall Speedup (1 core / 1 thread)
Together with optimization of the pair interactions (by Mike Brown of Intel), we achieve overall speedsups of 2-3x.
PPPM speedup shifts the optimal cutoff lower, while pair interaction speedup shifts it higher.
![Page 21: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/21.jpg)
21
Overall Speedup (parallel)
In parallel, the FFTs become more expensive – other than occasionally communicating atoms moving through the domain, this is the only communication.
The runtime-optimal cutoff rises and work should be shifted into pair interactions.
If we choose a cutoff based on few processors, scalability is very bad!
![Page 22: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/22.jpg)
22
Overall Speedup (parallel)
Scalability worsens across multiple nodes (64 to 128 cores).
We end up with worse overall scaling but better real performance because everything except the FFTs is much faster.
You can pick cutoffs that make scalability look good but this is misleading.
A better-scaling method for solving Poisson’s Equation is needed (MSM?).
![Page 23: Particle-Particle Particle-Mesh (P3M) on Knights Landing …hpac.cs.umu.se/ipcc/siamcse17_mcdoniel_talk.pdf · 2017. 6. 2. · SIAM CSE ‘17 Atlanta Thanks to: Klaus-Dieter Oertel,](https://reader034.fdocuments.us/reader034/viewer/2022051607/602c06cfefab0f16df6b05af/html5/thumbnails/23.jpg)
Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors
William McDoniel Ahmed E. Ismail Paolo Bientinesi
SIAM CSE ‘17Atlanta
Thanks to: Klaus-Dieter Oertel, Georg Zitzlsberger, and Mike BrownFunded as part of an Intel Parallel Computing Center
23