Improving PME on distributed computer systems
Transcript of Improving PME on distributed computer systems
Carsten Kutzner
Max-Planck-Institut für Biophysikalische Chemie, GöttingenTheoretical and Computational Biophysics Department
Improving PME on distributed computer systems
Outline
how to get optimum performance from MD simulations with PME
Recap: Particle-Mesh-Ewald (PME)
parallel PME in GROMACS
understanding the PME parameters
tuning the performance of an example benchmark
L
long range
short range
‣ N atoms at ri with charges qi in neutral box (PBC)
‣ electrostatic potential (conditionally convergent, s-l-o-w)
‣ Coulomb: strong variation at small r, smooth at large r
‣ Ewald summation: split up U with
choose f = erfc(r)
‣ discrete mesh (DFT / FFT)
1r
=f (r)
r+
1 � f (r)r
(1)
U =12
N
!i, j=1
!
!!n"Z3
qiq j
|!ri j +!nL| (1)
!! (1)
Recap: PME
≈ 0 beyond cutoff (SR)
slowly varying func,needs only few wave vectors
in Fourier space (LR)
rc
PME time step
1. SR part can be directly calculated
2. LR part: need FT charge density
1. spread charges on grid: each charge is spread on pme_order grid points in x,y,z
2. FFT to reciprocal space
3. solve PME
4. FFT back yields potential on grid points
5. extrapolate forces on the particles from grid with splines of order pme_order
Parallel PME time step
‣ particle redistributionxyz-domains → x-slabs
‣ communicate grid overlaps
‣ parallel data transpose within FFT needs all-to-all communication (2x to and from reciprocal space)
‣ force redistributionslabs → domains
0 1 2 3 4 50
1
2
3
4
5
Parallel PME: scaling bottlenecks
‣ all-to-all communication becomes increasingly expensive with N (latency increases by N2)
‣ all-to-all might overload network,MPI_Alltoall performance = f(N, MPI lib, network type, ...)
‣ PME grid cannot always be distributed evenly among all processors: 90 points → 64 CPUs?
N=6
N•(N–1) messages
N=3
Multiple-Program Multiple-Data PME‣ direct and reciprocal space contributions can be computed independently
‣ assign a subset of the processors for recip. space with -npmempirun -np 5 mdrun -npme 2
PP (Particle-Particle) processorsdirect space contribution
PME processorsFourier space contribution
forces
coordinates & charges
CPU 1 CPU 2 CPU 3 CPU 4 CPU 5
+ MPI_Alltoall on only 1/4 to 1/2 of the processors, reducing the communication to 1/16 to 1/4
+ better distribution of PME grid onto processors
‣ today typically 2, 4, or 8 CPUs share a network interconnect
‣ PP/PME node interleaving
‣ higher bandwidth on multi-CPU nodes
‣ let the LR processors use all available network bandwidth
Multiple-Program Multiple-Data PME
PME
switch
PP
PME
PMEPME
PME PME
PP
PP
mpirun -np 8 mdrun -npme 4 -ddorder pp_pme ... -ddorder interleave
switch
PP
PME
Knobs to turn
node separation yes/no,# of PME nodes
node order: PP_PME, interleaved,...
DD grid, dlb yes/no
compiler & MPI library!
PME grid points
cutoff radius
to shift real/recip load @ same accuracy: multiply both rc and fourierspacing by factor
PME order 4, 6, 8, ...but we don’t want to enlarge the grid overlap in the parallel case!
0 5 10 15 20 25 300
5
10
15
20
processors
sp
ee
du
p
MVAPICH 1.0.0
OpenMPI 1.2.5
IMPI 3.1
same accuracy: affects accuracy:
Example benchmark system
‣ SBMV in water, 4.5 M particles
‣ cutoffs 1 nm
‣ PME order 4, grid spacing 0.135 nm (2753 grid)
‣ Intel 2.66 GHz, 8 cores / node
‣ Infiniband Interconnect
Easy GROMACS benchmarking
‣ performance measurement in GROMACS: build-in timers
‣ e. g. ns/day output from log file
‣ make a “short” test tpr... but not too short!
0 50 100 150 2000.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
time step
time
for 1
0 st
eps
/ t0
8 CPUs DD 8x1x116 CPUs DD 4x4x116 CPUs DD 8x2x132 CPUs DD 8x2x264 CPUs DD 4x4x4128 CPUs DD 8x4x4
SBMV benchmark,steps 1-10, 2-20, ...(step 0 omitted)
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guess
Performance tuning example
4x5x5+92
16x3x2+32
16x3x1+16
8x3x1+84x3x1+4
8x1x1
mpirun -np N mdrun
DD XYZ+npme
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guess
load imbalance force:
5.4 %
4x5x5+92
16x3x2+32
16x3x1+16
8x3x1+84x3x1+4
8x1x1
Performance tuning example
mpirun -np N mdrun -v
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guess
load imbalance force:
0.1 %
4x5x5+92
16x3x2+32
16x3x1+16
8x3x1+84x3x1+4
8x1x1
-dlb yes
Performance tuning example
mpirun -np N mdrun
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guess
4x5x5+92
16x3x2+32
16x3x1+16
8x3x1+84x3x1+4
8x1x1
1.2
1.7
2.0 2.01.9
PME mesh/force imbalance
Performance tuning example
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guessno pp/pme splitting
PME mesh/force imbalance
Performance tuning example
1.2
1.7
2.0 2.01.9
mpirun -np N mdrun -npme 0
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guessno pp/pme splitting
PME mesh/force imbalance
1.1
1.1
92
32
16
84
# PME nodes
96
48
28
00
optimized
1.1
optimized
grid8x2x2 (auto)→ 8x4x1 (opt)
Performance tuning example
1.2
1.7
2.0 2.01.9
rc / nm grid spacing / nm grid dimensions
A 1,00 0,135 275^3
B 1,07 0,144 256^3
C 1,20 0,160 231^3
A B C
Shifting load to real space
0.135 0.14 0.145 0.15 0.155 0.1623
24
25
26
27
28
PME mesh spacing
tota
l CPU
tim
e pe
r ste
p [s
] 8 processors192 processors
8
88
96
64
48 # processors taking part in LR calculation
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guessreal/recip load
rc / nm grid spacing / nm grid dimensions
A 1,00 0,135 275^3
B 1,07 0,144 256^3
C 1,20 0,160 231^3
A
B
C
A
B
C
Performance tuning example
0 50 100 150 2000
50
100
150
200
processors
spee
dup
0 50 100 150 20020
25
30
35
processors
tota
l CPU
tim
e pe
r ste
p [s
]gmx−guessoptimized
77%
76%
75%
76%78%
98%
95%
97%
97%
100%101%
100%
Performance tuning example
Tuning Howto
make a short (>>100 steps) benchmark tpr, run with -v
start with automatic settings, apply performance hints
few CPUs:
turn on dlb if there is a load imbalance in the force
turn on/off PME nodes
many CPUs:
adjust # of PME nodes if an imbalance pme mesh/force is reported
adjust real/reciprocal space workload by multiplying both rc and fourierspacing by (same) factor near 1
might want to try other DD grids:
hight # of x-slabs mimicks the data organization for PME
can be favourable NOT to decompose in one direction (no x/f communication pulse needed in that direction)
0 1 2 3 4 50
1
2
3
4
5
Acknowledgments
‣ The GROMACS developers
‣ David van der Spoel
‣ Erik Lindahl
‣ Berk Hess
‣ The Theoretical and Computational Biophysics Department
‣ Mareike Zink
‣ Gerrit Groenhof
‣ Bert de Groot
‣ Helmut Grubmüller