Improving PME on distributed computer systems

Carsten Kutzner

Max-Planck-Institut für Biophysikalische Chemie, GöttingenTheoretical and Computational Biophysics Department

Improving PME on distributed computer systems

Outline

how to get optimum performance from MD simulations with PME

Recap: Particle-Mesh-Ewald (PME)

parallel PME in GROMACS

understanding the PME parameters

tuning the performance of an example benchmark

L

long range

short range

‣ N atoms at ri with charges qi in neutral box (PBC)

‣ electrostatic potential (conditionally convergent, s-l-o-w)

‣ Coulomb: strong variation at small r, smooth at large r

‣ Ewald summation: split up U with

choose f = erfc(r)

‣ discrete mesh (DFT / FFT)

1r

=f (r)

r+

1 � f (r)r

(1)

U =12

N

!i, j=1

!

!!n"Z3

qiq j

|!ri j +!nL| (1)

!! (1)

Recap: PME

≈ 0 beyond cutoff (SR)

slowly varying func,needs only few wave vectors

in Fourier space (LR)

rc

PME time step

1. SR part can be directly calculated

2. LR part: need FT charge density

1. spread charges on grid: each charge is spread on pme_order grid points in x,y,z

2. FFT to reciprocal space

3. solve PME

4. FFT back yields potential on grid points

5. extrapolate forces on the particles from grid with splines of order pme_order

Parallel PME time step

‣ particle redistributionxyz-domains → x-slabs

‣ communicate grid overlaps

‣ parallel data transpose within FFT needs all-to-all communication (2x to and from reciprocal space)

‣ force redistributionslabs → domains

0 1 2 3 4 50

1

2

3

4

5

Parallel PME: scaling bottlenecks

‣ all-to-all communication becomes increasingly expensive with N (latency increases by N2)

‣ all-to-all might overload network,MPI_Alltoall performance = f(N, MPI lib, network type, ...)

‣ PME grid cannot always be distributed evenly among all processors: 90 points → 64 CPUs?

N=6

N•(N–1) messages

N=3

Multiple-Program Multiple-Data PME‣ direct and reciprocal space contributions can be computed independently

‣ assign a subset of the processors for recip. space with -npmempirun -np 5 mdrun -npme 2

PP (Particle-Particle) processorsdirect space contribution

PME processorsFourier space contribution

forces

coordinates & charges

CPU 1 CPU 2 CPU 3 CPU 4 CPU 5

+ MPI_Alltoall on only 1/4 to 1/2 of the processors, reducing the communication to 1/16 to 1/4

+ better distribution of PME grid onto processors

‣ today typically 2, 4, or 8 CPUs share a network interconnect

‣ PP/PME node interleaving

‣ higher bandwidth on multi-CPU nodes

‣ let the LR processors use all available network bandwidth

Multiple-Program Multiple-Data PME

PME

switch

PP

PME

PMEPME

PME PME

PP

PP

mpirun -np 8 mdrun -npme 4 -ddorder pp_pme ... -ddorder interleave

switch

PP

PME

Knobs to turn

node separation yes/no,# of PME nodes

node order: PP_PME, interleaved,...

DD grid, dlb yes/no

compiler & MPI library!

PME grid points

cutoff radius

to shift real/recip load @ same accuracy: multiply both rc and fourierspacing by factor

PME order 4, 6, 8, ...but we don’t want to enlarge the grid overlap in the parallel case!

0 5 10 15 20 25 300

5

10

15

20

processors

sp

ee

du

p

MVAPICH 1.0.0

OpenMPI 1.2.5

IMPI 3.1

same accuracy: affects accuracy:

Example benchmark system

‣ SBMV in water, 4.5 M particles

‣ cutoffs 1 nm

‣ PME order 4, grid spacing 0.135 nm (2753 grid)

‣ Intel 2.66 GHz, 8 cores / node

‣ Infiniband Interconnect

Easy GROMACS benchmarking

‣ performance measurement in GROMACS: build-in timers

‣ e. g. ns/day output from log file

‣ make a “short” test tpr... but not too short!

0 50 100 150 2000.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

time step

time

for 1

0 st

eps

/ t0

8 CPUs DD 8x1x116 CPUs DD 4x4x116 CPUs DD 8x2x132 CPUs DD 8x2x264 CPUs DD 4x4x4128 CPUs DD 8x4x4

SBMV benchmark,steps 1-10, 2-20, ...(step 0 omitted)

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guess

Performance tuning example

4x5x5+92

16x3x2+32

16x3x1+16

8x3x1+84x3x1+4

8x1x1

mpirun -np N mdrun

DD XYZ+npme

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guess

load imbalance force:

5.4 %

4x5x5+92

16x3x2+32

16x3x1+16

8x3x1+84x3x1+4

8x1x1


mpirun -np N mdrun -v

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guess

load imbalance force:

0.1 %

4x5x5+92

16x3x2+32

16x3x1+16

8x3x1+84x3x1+4

8x1x1

-dlb yes


mpirun -np N mdrun

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guess

4x5x5+92

16x3x2+32

16x3x1+16

8x3x1+84x3x1+4

8x1x1

1.2

1.7

2.0 2.01.9

PME mesh/force imbalance


0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guessno pp/pme splitting



1.2

1.7

2.0 2.01.9

mpirun -np N mdrun -npme 0

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guessno pp/pme splitting


1.1

1.1

92

32

16

84

# PME nodes

96

48

28

00

optimized

1.1

optimized

grid8x2x2 (auto)→ 8x4x1 (opt)


1.2

1.7

2.0 2.01.9

rc / nm grid spacing / nm grid dimensions

A 1,00 0,135 275^3

B 1,07 0,144 256^3

C 1,20 0,160 231^3

A B C

Shifting load to real space

0.135 0.14 0.145 0.15 0.155 0.1623

24

25

26

27

28

PME mesh spacing

tota

l CPU

tim

e pe

r ste

p [s

] 8 processors192 processors

8

88

96

64

48 # processors taking part in LR calculation

0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guessreal/recip load

rc / nm grid spacing / nm grid dimensions

A 1,00 0,135 275^3

B 1,07 0,144 256^3

C 1,20 0,160 231^3

A

B

C

A

B

C


0 50 100 150 2000

50

100

150

200

processors

spee

dup

0 50 100 150 20020

25

30

35

processors

tota

l CPU

tim

e pe

r ste

p [s

]gmx−guessoptimized

77%

76%

75%

76%78%

98%

95%

97%

97%

100%101%

100%


Tuning Howto

make a short (>>100 steps) benchmark tpr, run with -v

start with automatic settings, apply performance hints

few CPUs:

turn on dlb if there is a load imbalance in the force

turn on/off PME nodes

many CPUs:

adjust # of PME nodes if an imbalance pme mesh/force is reported

adjust real/reciprocal space workload by multiplying both rc and fourierspacing by (same) factor near 1

might want to try other DD grids:

hight # of x-slabs mimicks the data organization for PME

can be favourable NOT to decompose in one direction (no x/f communication pulse needed in that direction)

0 1 2 3 4 50

1

2

3

4

5

Acknowledgments

‣ The GROMACS developers

‣ David van der Spoel

‣ Erik Lindahl

‣ Berk Hess

‣ The Theoretical and Computational Biophysics Department

‣ Mareike Zink

‣ Gerrit Groenhof

‣ Bert de Groot

‣ Helmut Grubmüller

Improving PME on distributed computer systems

Documents

Transcript of Improving PME on distributed computer systems