PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...
-
Upload
thomas-william -
Category
Technology
-
view
763 -
download
0
description
Transcript of PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel...
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
In 2008 a collaboration between the Technische Universität Dresden, Center for Information Services and High Performance Computing (ZIH), in Germany and the Indiana University, Pervasive Technology Institute (PTI), was founded to mutually benefit from common research areas.
These research topics include:‣ Data-centric computing‣ Computing for biological and life sciences‣ Wide area distributed file systems‣ Parallel computing performance
This collaboration involves ZIH maintaining a stack of performance evaluation tools inside the NSF FutureGrid project. For the analysis an HPC system called Xray, a Cray XT5m which is part of the FutureGrid hardware, was used. Xray consists of Quad-core 64-bit AMD Opteron series 2000 processors. PGI compilers version 9.0.4 and the optimized xtpe-barcelona module provided by Cray have been used.
The first step in the code analysis is to identify the best performing parameter set. This configuration represents the most promising candidate for further parallel analysis.
Serial Analysis
Collaboration betweenPTI and ZIH
Molecular DynamicsA molecular dynamics code simulating the diffusion in dense nuclear matter in white dwarf stars is analyzed in this collaboration. The code is highly configurable allowing MPI, OpenMP, or hybrid runs and additional fine tuning with a range of parameters.
Parallel AnalysisAim of the parallel analysis is to measure the scalability limits of the different parallel code implementations and to detect bottlenecks possibly preventing further parallel efficiency. This work has been done with the parallel analysis framework Vampir.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
T. William, M. Weber, D. Röhrig - ZIH, TU DresdenD. K. Berry, R. Henschel - UITS, IU BloomingtonJ. Hughto, A. S. Schneider, C. J. Horowitz - Physics, IU Bloomington
Authors
PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODEEvaluating Serial, Thread and Process-Parallel Performance
Planning
Implem
entationwith Vam
pirTrace
Testing
Eval
uatio
nw
ith V
ampi
r
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
The analyzed molecular dynamics code (MD) has been developed at Indiana University for simulating dense nuclear matter in white dwarf and neutron stars.
Matter in these compact stellar objects consists of either completely ionized atoms, or else free neutrons and protons.
The MD code models these systems as classical particles interacting via a screened Coulomb interaction. The electrons are not modeled explicitly, but are treated as a uniform background charge that serves to screen the Coulomb interaction.
The movie shows a 5125 (5k) particle simulation of oxygen ions which are forming microcrystals:
‣ 1280 Carbon ions
‣ 3740 Oxygen ions
‣ 100 Neon ions
By studying the motion of ions over a long simulation time, one can calculate the self-diffusion coefficient of ions, an important quantity for understanding the behavior of white dwarfs.
MOLECULAR DYNAMICS CODEOverview
MD-Code Ion Mix Movements
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
MD is a hybrid MPI/OpenMP parallel code written in Fortran. Its main features are:
‣ Simple particle-particle algorithm‣ Long range interaction with large cut-off
sphere‣ Cutoff radius is half the box edge length‣ Each ion interacts with about half the
others‣ Interactions are calculated in a pair of
nested do loops‣ Outer loop goes over "target" particles‣ Inner loop goes over "source" particles
‣ Targets are assigned in a round-robin fashion to MPI processes‣ Within each MPI process, the work is
shared among OpenMP threads
During compilation the input parameter PP0x decides what file is used for the simulation. Inside the PP0x files the parameters Ax and Bx denote different code-blocks that implement equal semantics in different ways. Additionally, the file PP03 provides the define NBS to influence the vector access length. Those parameters have been used throughout the years to re-implement code utilizing newer constructs of Fortran95 that had not been available in Fortran77.
PP01:‣ Original implementation‣ No division into Ax, Bx, or NBS‣ Baseline for other measurements
PP02:‣ A0, A1, A2‣ B0, B1, B2‣ No NBS
PP03:‣ A0, A1, A2‣ B0, B1, B2, B3, B4, B5, B6, B7‣ NBS
Code block in the PP03 file showing different implementations of the same loop selectable by Ax preprocessor macros
The MD code is highly modular, consisting of several different implementations of the same s imulat ion logic . The par t ic le-par t ic le interaction semantics have been implemented multiple times in different ways. Each implementation is stored in an individual source file, called from PP01 to PP03.
MOLECULAR DYNAMICS CODEImplementation Details
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SCOPE OF THE CODE ANALYSISHardware Details & Parameter Sweep
Paralellismserial: ! mdOpenMP:! md_ompMPI: ! md_mpiHybrid: ! md_mpi_omp
All measurements done on FutureGrid machine Xray:
‣ Cray XT5m‣ 672 cores ‣ 84 nodes‣ 2 CPU per node‣ Quad-Core AMD Opteron(tm) Processor 23 (C2) ‣ 2.4 GHz‣ ...
Serial analysis:
‣ 55k particles ‣ Comparing optimization flags:
! O2! O3! fastsse‣ fastsse == -fast‣ fast - Common optimizations;
! -O2 -Munroll=c:1 -Mnoframe -Mlre! -Mautoinline -Mvect=sse -Mscalarsse ! -Mcache_align -Mflushz
Loop combinationBlock A: ! ! A0-A2Block B: ! ! B1-B7Code-blocking: ! NBS
Particle-typenucleon-nucleonion pureion mix
Time estimate per measurements and particles:
5k ! ~ 300 seconds! ~5 minutes27k ! ~ 4000 seconds! ~1 h55k ! ~ 35000 seconds! ~10 h
Input parametersimulation typenumber of particlesnumber of time steps
Compiler FlagsO2O3SSE. . . .
Code versionOriginal: ! ! PP01Production: ! PP02Research:! ! PP03
>8100 possible combinations
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISIdentifying The Best Configuration For Parallel Runs
Runtime for all code cominations
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISIdentifying The Best Configuration For Parallel Runs
Runtime in seconds
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
PAPI_FP_INS
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
FPU Idle
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
Branches mispredicted
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
PAPI_VEC_INS
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
L1 hit ratio
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SERIAL ANALYSISPAPI Aided Analysis Of PP02 Code Versions
PAPI_TO_CYC
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SOURCE CODE ANALYSISCodeblocks Ax & Bx For Ion-Mix (PP02)
A0:! ! Branching
A1:! ! Arithmetic
A2:! ! Array syntax
Block A
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SOURCE CODE ANALYSISCodeblocks Ax & Bx For Ion-Mix (PP02)
B0:! ! Loop
B1:! ! No Cut-off Sphere
B2:! ! Cut-off Sphere
Block B
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SOURCE CODE ANALYSISCodeblocks Ax & Bx For Ion-Mix (PP02)
B0:! ! Loop
B1:! ! No Cut-off Sphere
B2:! ! Cut-off Sphere
Result: "Calculation does not beat branching"
Block B
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
The MD code is used in production with particle counts between 27k and 55k. Production runs currently use the PP02 implementation of particle-particle interaction semantics. The code blocks selected are A0 and B2, and the vector length (NBS) is set to 32. Measurements show that this configuration is very efficient and yields good performance.
So far, runtimes of larger runs have only been analyzed using the hybrid version. The chart shows speedups on a large Cray XT5 (Kraken) and the resulting efficiency for 27k and 55k particle runs. Up to 1152 cores, the efficiency is higher than 80%. At 4608 cores the efficiency drops below 50% for 27k particles. Using more ions shows the effect of weak scaling that can be used to optimize the efficiency so that even on 4608 cores an efficiency of 58% can be achieved.
Although these results are very promising, Kraken has 112,896 compute cores that could be used for computation. The mid term goal is therefore to optimize the MD code to achieve 50% parallel efficiency for 16000 cores. To achieve this we will take a more detailed look at the 4 different versions of the code (serial, OpenMP, MPI and hybrid).
The code is very efficient for multi-core processors such as the 6-core processors on Kraken, the Cray XT5 at The National Institute for Computational Sciences. Each processor is assigned one
MPI process. The work of each process is then shared among 6 OpenMP threads.
PARALLEL ANALYSISFirst Scalability Evaluation
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
ZIH has more then a decade of expertise in analyzing sequential and parallel codes. As part of the collaboration with IU, ZIH applied its analysis software Vampir to the different versions of MD.
Vampir consists of a tracing framework called VampirTrace that is able to monitor all levels of parallelism of the MD code. Additionally PAPI counters are recorded in order to analyze the cache behavior of the main calculation loop.
In a first step traces of the four versions were generated using only 5k particles to get a broad overview. The next step will be to refine the tracing mechanism using manually instrumented code to avoid the rather huge overhead introduced by OpenMP tracing. This will enable monitoring of at least 4096 cores to get an understanding why the efficiency drops below 50% at this core count.
PARALLEL ANALYSIS WITH VAMPIR
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
OPENMP CODE OPTIMIZATION
Comparison of OpenMP Versions using 1 Node and 1-8 Cores
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
OPENMP CODE OPTIMIZATION
Comparison of OpenMP Versions using 1 Node and 1-8 Cores
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 8 nodes, 1 process per node
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
20+ loops of the A0_B2 block
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 1 nodes, 8 processes per node
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
20+ loops of the A0_B2 block
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
55k particles, MPI only, 84 nodes, 8 process per node (672 core == full machine run)
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
switching of groups and flushing of VT buffers
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
55k particles, OpenMP only, 1 Node, 8 processes
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
55k particles, Hybrid, 84 MPI processes (1 per Node), 8 Threads per process (672 core == full machine run)
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
PARALLEL ANALYSIS WITH VAMPIR
Second group, 100 runs
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SCALABILITY STUDIES ON XRAY
Parallel Efficiency (MPI-only runs)
1 Node 8 Nodes
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
SCALABILITY STUDIES ON XRAY
Parallel Efficiency (Hybrid runs)
Technische Universität DresdenZentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden, Germany
E-mail: [email protected]: http://www.vampir.eu/
IU BloomingtonPervasive Technology Institute
2719 East 10th StreetBloomington, Indiana 47408
E-mail: [email protected]: http://pti.iu.edu/
To find the optimal serial version of MD, different combinations of the code blocks and compiler flags were tested. The results from the serial analysis built the starting point for the parallel analysis.
Vampir was then successfully applied to the code making it possible to visualize MPI and OpenMP parallelism. An important step for future parallel analysis is to embed PAPI counter directly in the MD code to avoid the overhead caused by VampirTrace. This enables the comparison of the actually achieved performance with the theoretical peak performance using larger core counts.
Additional tests with larger particle counts are planned to exploit the weak scaling capabilities of the code. The hybrid implementation is to be run in different process/thread configurations to research wether fitting MPI/OpenMP usage to the hardware characteristics can further increase performance.
Analysis tests will be re-run on Kraken with up to 16386 cores. The goal is to detect scalability issues and to significantly improve parallel efficiency.
The MD code is currently being ported to GPGPUs. Vampir already supports the analysis of CUDA parallelized code. This analysis support will enable efficient ported code allowing for the utilization of the maximum possible potential provided by GPGPUs.
CONCLUSION AND FUTURE WORK
PlanningIm
plementation
with VampirTrace
Testing
Eval
uatio
nw
ith V
ampi
r