PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...

Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)

01062 Dresden, Germany

E-mail: [email protected]: http://www.vampir.eu/

IU BloomingtonPervasive Technology Institute

2719 East 10th StreetBloomington, Indiana 47408

E-mail: [email protected]: http://pti.iu.edu/

In 2008 a collaboration between the Technische Universität Dresden, Center for Information Services and High Performance Computing (ZIH), in Germany and the Indiana University, Pervasive Technology Institute (PTI), was founded to mutually benefit from common research areas.

These research topics include:‣ Data-centric computing‣ Computing for biological and life sciences‣ Wide area distributed file systems‣ Parallel computing performance

This collaboration involves ZIH maintaining a stack of performance evaluation tools inside the NSF FutureGrid project. For the analysis an HPC system called Xray, a Cray XT5m which is part of the FutureGrid hardware, was used. Xray consists of Quad-core 64-bit AMD Opteron series 2000 processors. PGI compilers version 9.0.4 and the optimized xtpe-barcelona module provided by Cray have been used.

The first step in the code analysis is to identify the best performing parameter set. This configuration represents the most promising candidate for further parallel analysis.

Serial Analysis

Collaboration betweenPTI and ZIH

Molecular DynamicsA molecular dynamics code simulating the diffusion in dense nuclear matter in white dwarf stars is analyzed in this collaboration. The code is highly configurable allowing MPI, OpenMP, or hybrid runs and additional fine tuning with a range of parameters.

Parallel AnalysisAim of the parallel analysis is to measure the scalability limits of the different parallel code implementations and to detect bottlenecks possibly preventing further parallel efficiency. This work has been done with the parallel analysis framework Vampir.

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

T. William, M. Weber, D. Röhrig - ZIH, TU DresdenD. K. Berry, R. Henschel - UITS, IU BloomingtonJ. Hughto, A. S. Schneider, C. J. Horowitz - Physics, IU Bloomington

Authors

PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODEEvaluating Serial, Thread and Process-Parallel Performance

Planning

Implem

entationwith Vam

pirTrace

Testing

Eval

uatio

nw

ith V

ampi

r

mailto:[email protected]


http://www.vampir.eu


http://www.iub.edu/

http://www.iub.edu/

mailto:pti%40indiana.edu


http://pti.iu.edu

http://pti.iu.edu







The analyzed molecular dynamics code (MD) has been developed at Indiana University for simulating dense nuclear matter in white dwarf and neutron stars.

Matter in these compact stellar objects consists of either completely ionized atoms, or else free neutrons and protons.

The MD code models these systems as classical particles interacting via a screened Coulomb interaction. The electrons are not modeled explicitly, but are treated as a uniform background charge that serves to screen the Coulomb interaction.

The movie shows a 5125 (5k) particle simulation of oxygen ions which are forming microcrystals:

‣ 1280 Carbon ions

‣ 3740 Oxygen ions

‣ 100 Neon ions

By studying the motion of ions over a long simulation time, one can calculate the self-diffusion coefficient of ions, an important quantity for understanding the behavior of white dwarfs.

MOLECULAR DYNAMICS CODEOverview

MD-Code Ion Mix Movements





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







MD is a hybrid MPI/OpenMP parallel code written in Fortran. Its main features are:

‣ Simple particle-particle algorithm‣ Long range interaction with large cut-off

sphere‣ Cutoff radius is half the box edge length‣ Each ion interacts with about half the

others‣ Interactions are calculated in a pair of

nested do loops‣ Outer loop goes over "target" particles‣ Inner loop goes over "source" particles

‣ Targets are assigned in a round-robin fashion to MPI processes‣ Within each MPI process, the work is

shared among OpenMP threads

During compilation the input parameter PP0x decides what file is used for the simulation. Inside the PP0x files the parameters Ax and Bx denote different code-blocks that implement equal semantics in different ways. Additionally, the file PP03 provides the define NBS to influence the vector access length. Those parameters have been used throughout the years to re-implement code utilizing newer constructs of Fortran95 that had not been available in Fortran77.

PP01:‣ Original implementation‣ No division into Ax, Bx, or NBS‣ Baseline for other measurements

PP02:‣ A0, A1, A2‣ B0, B1, B2‣ No NBS

PP03:‣ A0, A1, A2‣ B0, B1, B2, B3, B4, B5, B6, B7‣ NBS

Code block in the PP03 file showing different implementations of the same loop selectable by Ax preprocessor macros

The MD code is highly modular, consisting of several different implementations of the same s imulat ion logic . The par t ic le-par t ic le interaction semantics have been implemented multiple times in different ways. Each implementation is stored in an individual source file, called from PP01 to PP03.

MOLECULAR DYNAMICS CODEImplementation Details





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







SCOPE OF THE CODE ANALYSISHardware Details & Parameter Sweep

Paralellismserial: ! mdOpenMP:! md_ompMPI: ! md_mpiHybrid: ! md_mpi_omp

All measurements done on FutureGrid machine Xray:

‣ Cray XT5m‣ 672 cores ‣ 84 nodes‣ 2 CPU per node‣ Quad-Core AMD Opteron(tm) Processor 23 (C2) ‣ 2.4 GHz‣ ...

Serial analysis:

‣ 55k particles ‣ Comparing optimization flags:

! O2! O3! fastsse‣ fastsse == -fast‣ fast - Common optimizations;

! -O2 -Munroll=c:1 -Mnoframe -Mlre! -Mautoinline -Mvect=sse -Mscalarsse ! -Mcache_align -Mflushz

Loop combinationBlock A: ! ! A0-A2Block B: ! ! B1-B7Code-blocking: ! NBS

Particle-typenucleon-nucleonion pureion mix

Time estimate per measurements and particles:

5k ! ~ 300 seconds! ~5 minutes27k ! ~ 4000 seconds! ~1 h55k ! ~ 35000 seconds! ~10 h

Input parametersimulation typenumber of particlesnumber of time steps

Compiler FlagsO2O3SSE. . . .

Code versionOriginal: ! ! PP01Production: ! PP02Research:! ! PP03

>8100 possible combinations





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







SERIAL ANALYSISIdentifying The Best Configuration For Parallel Runs

Runtime for all code cominations





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







SERIAL ANALYSISIdentifying The Best Configuration For Parallel Runs

Runtime in seconds





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions

PAPI_FP_INS








FPU Idle








Branches mispredicted








PAPI_VEC_INS








L1 hit ratio








PAPI_TO_CYC







SOURCE CODE ANALYSISCodeblocks Ax & Bx For Ion-Mix (PP02)

A0:! ! Branching

A1:! ! Arithmetic

A2:! ! Array syntax

Block A





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu








B0:! ! Loop

B1:! ! No Cut-off Sphere

B2:! ! Cut-off Sphere

Block B





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu








B0:! ! Loop

B1:! ! No Cut-off Sphere

B2:! ! Cut-off Sphere

Result: "Calculation does not beat branching"

Block B





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







The MD code is used in production with particle counts between 27k and 55k. Production runs currently use the PP02 implementation of particle-particle interaction semantics. The code blocks selected are A0 and B2, and the vector length (NBS) is set to 32. Measurements show that this configuration is very efficient and yields good performance.

So far, runtimes of larger runs have only been analyzed using the hybrid version. The chart shows speedups on a large Cray XT5 (Kraken) and the resulting efficiency for 27k and 55k particle runs. Up to 1152 cores, the efficiency is higher than 80%. At 4608 cores the efficiency drops below 50% for 27k particles. Using more ions shows the effect of weak scaling that can be used to optimize the efficiency so that even on 4608 cores an efficiency of 58% can be achieved.

Although these results are very promising, Kraken has 112,896 compute cores that could be used for computation. The mid term goal is therefore to optimize the MD code to achieve 50% parallel efficiency for 16000 cores. To achieve this we will take a more detailed look at the 4 different versions of the code (serial, OpenMP, MPI and hybrid).

The code is very efficient for multi-core processors such as the 6-core processors on Kraken, the Cray XT5 at The National Institute for Computational Sciences. Each processor is assigned one

MPI process. The work of each process is then shared among 6 OpenMP threads.

PARALLEL ANALYSISFirst Scalability Evaluation





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







ZIH has more then a decade of expertise in analyzing sequential and parallel codes. As part of the collaboration with IU, ZIH applied its analysis software Vampir to the different versions of MD.

Vampir consists of a tracing framework called VampirTrace that is able to monitor all levels of parallelism of the MD code. Additionally PAPI counters are recorded in order to analyze the cache behavior of the main calculation loop.

In a first step traces of the four versions were generated using only 5k particles to get a broad overview. The next step will be to refine the tracing mechanism using manually instrumented code to avoid the rather huge overhead introduced by OpenMP tracing. This will enable monitoring of at least 4096 cores to get an understanding why the efficiency drops below 50% at this core count.

PARALLEL ANALYSIS WITH VAMPIR





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







OPENMP CODE OPTIMIZATION

Comparison of OpenMP Versions using 1 Node and 1-8 Cores





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu








55k particles, MPI only, 8 nodes, 1 process per node


http://pti.iu.edu








20+ loops of the A0_B2 block


http://pti.iu.edu








55k particles, MPI only, 1 nodes, 8 processes per node


http://pti.iu.edu








20+ loops of the A0_B2 block


http://pti.iu.edu








55k particles, MPI only, 84 nodes, 8 process per node (672 core == full machine run)


http://pti.iu.edu








switching of groups and flushing of VT buffers


http://pti.iu.edu








55k particles, OpenMP only, 1 Node, 8 processes


http://pti.iu.edu








55k particles, Hybrid, 84 MPI processes (1 per Node), 8 Threads per process (672 core == full machine run)


http://pti.iu.edu








Second group, 100 runs


http://pti.iu.edu







SCALABILITY STUDIES ON XRAY

Parallel Efficiency (MPI-only runs)

1 Node 8 Nodes





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







SCALABILITY STUDIES ON XRAY

Parallel Efficiency (Hybrid runs)





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu







To find the optimal serial version of MD, different combinations of the code blocks and compiler flags were tested. The results from the serial analysis built the starting point for the parallel analysis.

Vampir was then successfully applied to the code making it possible to visualize MPI and OpenMP parallelism. An important step for future parallel analysis is to embed PAPI counter directly in the MD code to avoid the overhead caused by VampirTrace. This enables the comparison of the actually achieved performance with the theoretical peak performance using larger core counts.

Additional tests with larger particle counts are planned to exploit the weak scaling capabilities of the code. The hybrid implementation is to be run in different process/thread configurations to research wether fitting MPI/OpenMP usage to the hardware characteristics can further increase performance.

Analysis tests will be re-run on Kraken with up to 16386 cores. The goal is to detect scalability issues and to significantly improve parallel efficiency.

The MD code is currently being ported to GPGPUs. Vampir already supports the analysis of CUDA parallelized code. This analysis support will enable efficient ported code allowing for the utilization of the maximum possible potential provided by GPGPUs.

CONCLUSION AND FUTURE WORK

PlanningIm

plementation

with VampirTrace

Testing

Eval

uatio

nw

ith V

ampi

r





http://www.iub.edu/

http://www.iub.edu/



http://pti.iu.edu

http://pti.iu.edu

http://www.dict.cc/englisch-deutsch/characteristics.html

http://www.dict.cc/englisch-deutsch/characteristics.html

￼PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...

Technology

Transcript of ￼PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...

PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...

Transcript of PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...