Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on...
-
Upload
wendy-arnold -
Category
Documents
-
view
216 -
download
1
Transcript of Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on...
Ashraf Bah Rabiou Dr. Valerie E. Taylor Dr. Xingfu [email protected] [email protected] [email protected]
- Computationally intensive applications can be parallelized to be executed on multicore systems to achieve better performance
- MPI and OpenMP are two popular programming paradigms that can be used for that purpose- MPI and OpenMP can be combined in order to explore multiple levels of parallelism on
multicore systems
Results on Hydra
- LBM is based on the kinetic theory, which entails a more fundamental level in studying the fluid than Navier-Strokes equation
- LBM is used for simulating fluid flows in computational physics, aerodynamics, material science, chemical engineering and aerospace engineering
- LBM is computationally intensive- LBM easily exploits features of parallelism- MPI-only LBM was developed by the Aerospace
department at TAMU- It uses D3Q19 (Cubic solid with 19 velocities) as
shown in this figure
MPI OpenMP
Hybrid Implementation
- Message passing model- Process level parallelism- Communication library
- Shared memory model- Thread level parallelism- Compiler directives
- Scrutinize the original MPI-only program to detect the computationally intensive loops- Use OpenMP to parallelize the loops to construct hybrid LBM- Avoid data dependencies within the loops to be parallelized- Determine the right scopes of the variables in order to maintain the accuracy of the program- Hybrid LBM uses MPI for inter-node communication and OpenMP for intra-node
parallelization to achieve multiple level parallelism
- Evaluate the performance of hybrid LBM and compare it with MPI LBM with increasing number of cores on two multicore systems
- Three datasets were used 64x64x64, 128x128x128 and 256x256x256
- Use three performance metrics: execution time, speedup and efficiency for comparison
- Use PowerPack to collect power profiling data for energy consumption analysis
Configuration Dori CS Department Virginia Tech
Hydra Supercomputing Facility at Texas A&M
Number of nodes 8 52
CPUs per node 4 16
Cores per chip 2 2
CPU Type 1.8 GHz AMD Opteron 1.9 GHz IBM Power 5+
Memory per node 6 GB 32 GB/node for 49 nodes64 GB/node for 3 nodes
MPI vs Hybrid execution times using 64x64x64
Summary and Conclusion
Chip Architecture of Hydra (IBM p5-575) Chip Architecture of DoriSpecifications of Both Clusters
Acknowledgment
Experiment Platforms
Hybrid MPI/OpenMP Lattice Boltzmann Application
Methodology
Lattice Boltzmann Method (LBM)
Motivations and Goals
Results on Dori
MPI vs Hybrid on Dori using 64x64x64 dataset
MPI vs Hybrid execution times using 128x128x128
MPI vs Hybrid on Dori using 256x256x256 dataset
- The results above show that MPI LBM outperforms the hybrid on hydra- Because of strong scaling, the execution time decreases with increasing number of
cores, hence the speedup increases with the number of cores- For MPI LBM with 64x64x64 dataset executed on more than 32 cores, its execution
time starts increasing because of communication overhead- Some data points are missing for 128X128x128 because of large memory
requirements- Because of large memory requirements, both hybrid and MPI LBM could not be run
for the problem size of 256x256x256
- Implement a hybrid MPI/OpenMP Lattice Boltzmann application to explore multiple levels of parallelism on multicore systems
- Evaluate the performance of this hybrid implementation and compare with the existing MPI-only version on two different multicore systems, and analyze energy consumption
Goals
MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128
MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128
- The results above show that MPI-only outperforms hybrid on Dori, except using 32 cores for 64x64x64 dataset
- For each programming paradigm, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores
- For MPI LBM with 64x64x64 executed on 32 cores, execution time starts increasing- The energy consumption data shows that MPI LBM consumes less energy than
hybrid LBM
Energy consumption data using 64x64x64 dataset
Motivations
- Hybrid version of the parallel LBM program was developed- Our experiment results show that MPI performs better than hybrid on both multicore
systems, Hydra and Dori- Energy consumption results show that MPI consumes less energy than hybrid on Dori- Due to large memory requirements, both hybrid and MPI LBM could not be run for
large problem size such as 256x256x256 and 512x512x512- Through this project, we learned parallel programming using OpenMP and MPI as
well as performance analysis techniques
- I would like to thank Dr Valerie E. Taylor , Dr Xingfu Wu and Charles Lively for being awesome mentors and for providing me with a great deal of information and help necessary for the project.
- This research was supported by the Distributed Research Experience for Undergraduates (DREU) program, as well as the Research Experience for Undergraduates (REU) program at the Texas A&M University's Computer Science and Engineering department.
CORES time system (kJ) cpu (kJ) memory (kj) Hard disk (kJ) motherboard (kJ)1 MPI 66.666 11.242 6.766 1.234 0.555 0.916
1 HYBRID 76.066 12.741 7.669 1.406 0.638 1.0442 MPI 37.934 6.337 3.787 0.703 0.307 0.518
2 HYBRID 43.782 8.158 4.708 1.068 0.365 0.6014 MPI 22.418 3.710 2.224 0.421 0.190 0.310
4 HYBRID 30.022 6.337 3.682 0.818 0.243 0.4118 MPI 17.724 6.189 3.731 0.667 0.296 0.489
8 HYBRID 21.045 8.629 5.246 0.916 0.354 0.58416 MPI 12.524 9.529 5.595 1.177 0.412 0.693
16 HYBRID 13.248 10.534 6.276 1.229 0.455 0.73832 MPI 15.161 21.637 12.784 2.526 1.039 1.683
32 HYBRID 11.929 17.903 10.723 2.088 0.822 1.327