Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on...

1
Ashraf Bah Rabiou Dr. Valerie E. Taylor Dr. Xingfu Wu [email protected] [email protected] [email protected] - Computationally intensive applications can be parallelized to be executed on multicore systems to achieve better performance - MPI and OpenMP are two popular programming paradigms that can be used for that purpose - MPI and OpenMP can be combined in order to explore multiple levels of parallelism on multicore systems Results on Hydra - LBM is based on the kinetic theory, which entails a more fundamental level in studying the fluid than Navier- Strokes equation - LBM is used for simulating fluid flows in computational physics, aerodynamics, material science, chemical engineering and aerospace engineering - LBM is computationally intensive - LBM easily exploits features of parallelism - MPI-only LBM was developed by the Aerospace department at TAMU - It uses D3Q19 (Cubic solid with 19 velocities) as shown in this figure MPI OpenMP Hybrid Implementation - Message passing model - Process level parallelism - Communication library - Shared memory model - Thread level parallelism - Compiler directives - Scrutinize the original MPI-only program to detect the computationally intensive loops - Use OpenMP to parallelize the loops to construct hybrid LBM - Avoid data dependencies within the loops to be parallelized - Determine the right scopes of the variables in order to maintain the accuracy of the program - Hybrid LBM uses MPI for inter-node communication and OpenMP for intra-node parallelization to achieve multiple level parallelism - Evaluate the performance of hybrid LBM and compare it with MPI LBM with increasing number of cores on two multicore systems - Three datasets were used 64x64x64, 128x128x128 and 256x256x256 - Use three performance metrics: execution time, speedup and efficiency for comparison - Use PowerPack to collect power profiling data for energy consumption analysis Configuration Dori CS Department Virginia Tech Hydra Supercomputing Facility at Texas A&M Number of nodes 8 52 CPUs per node 4 16 Cores per chip 2 2 CPU Type 1.8 GHz AMD Opteron 1.9 GHz IBM Power 5+ Memory per node 6 GB 32 GB/node for 49 nodes 64 GB/node for 3 nodes MPI vs Hybrid execution times using 64x64x64 Summary and Conclusion Chip Architecture of Hydra (IBM p5- 575) Chip Architecture of Dori Specifications of Both Clusters Acknowledgment Experiment Platforms Hybrid MPI/OpenMP Lattice Boltzmann Application Methodology Lattice Boltzmann Method (LBM) Motivations and Goals Results on Dori MPI vs Hybrid on Dori using 64x64x64 dataset MPI vs Hybrid execution times using 128x128x128 MPI vs Hybrid on Dori using 256x256x256 dataset - The results above show that MPI LBM outperforms the hybrid on hydra - Because of strong scaling, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores - For MPI LBM with 64x64x64 dataset executed on more than 32 cores, its execution time starts increasing because of communication overhead - Some data points are missing for 128X128x128 because of large memory requirements - Because of large memory requirements, both hybrid and MPI LBM could not be run for the problem size of 256x256x256 - Implement a hybrid MPI/OpenMP Lattice Boltzmann application to explore multiple levels of parallelism on multicore systems - Evaluate the performance of this hybrid implementation and compare with the existing MPI-only version on two different multicore systems, and analyze energy consumption Goals MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128 MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128 - The results above show that MPI-only outperforms hybrid on Dori, except using 32 cores for 64x64x64 dataset - For each programming paradigm, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores - For MPI LBM with 64x64x64 executed on 32 cores, execution time starts increasing - The energy consumption data shows that MPI LBM consumes less energy than hybrid LBM Energy consumption data using 64x64x64 dataset Motivations - Hybrid version of the parallel LBM program was developed - Our experiment results show that MPI performs better than hybrid on both multicore systems, Hydra and Dori - Energy consumption results show that MPI consumes less energy than hybrid on Dori - Due to large memory requirements, both hybrid and MPI LBM could not be run for large problem size such as 256x256x256 and 512x512x512 - Through this project, we learned parallel programming using OpenMP and MPI as well as performance analysis techniques - I would like to thank Dr Valerie E. Taylor , Dr Xingfu Wu and Charles Lively for being awesome mentors and for providing me with a great deal of information and help necessary for the project. - This research was supported by the Distributed Research Experience for Undergraduates (DREU) program, as well as the Research Experience for Undergraduates (REU) program at the Texas A&M University's Computer Science and Engineering department. CORES time system (kJ) cpu (kJ) memory (kj) Hard disk (kJ) motherboard (kJ) 1 MPI 66.666 11.242 6.766 1.234 0.555 0.916 1 HYBRID 76.066 12.741 7.669 1.406 0.638 1.044 2 MPI 37.934 6.337 3.787 0.703 0.307 0.518 2 HYBRID 43.782 8.158 4.708 1.068 0.365 0.601 4 MPI 22.418 3.710 2.224 0.421 0.190 0.310 4 HYBRID 30.022 6.337 3.682 0.818 0.243 0.411 8 MPI 17.724 6.189 3.731 0.667 0.296 0.489 8 HYBRID 21.045 8.629 5.246 0.916 0.354 0.584 16 MPI 12.524 9.529 5.595 1.177 0.412 0.693 16 HYBRID 13.248 10.534 6.276 1.229 0.455 0.738 32 MPI 15.161 21.637 12.784 2.526 1.039 1.683 32 HYBRID 11.929 17.903 10.723 2.088 0.822 1.327

Transcript of Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on...

Page 1: Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Ashraf Bah Rabiou Dr. Valerie E. Taylor Dr. Xingfu [email protected] [email protected] [email protected]

- Computationally intensive applications can be parallelized to be executed on multicore systems to achieve better performance

- MPI and OpenMP are two popular programming paradigms that can be used for that purpose- MPI and OpenMP can be combined in order to explore multiple levels of parallelism on

multicore systems

Results on Hydra

- LBM is based on the kinetic theory, which entails a more fundamental level in studying the fluid than Navier-Strokes equation

- LBM is used for simulating fluid flows in computational physics, aerodynamics, material science, chemical engineering and aerospace engineering

- LBM is computationally intensive- LBM easily exploits features of parallelism- MPI-only LBM was developed by the Aerospace

department at TAMU- It uses D3Q19 (Cubic solid with 19 velocities) as

shown in this figure

MPI OpenMP

Hybrid Implementation

- Message passing model- Process level parallelism- Communication library

- Shared memory model- Thread level parallelism- Compiler directives

- Scrutinize the original MPI-only program to detect the computationally intensive loops- Use OpenMP to parallelize the loops to construct hybrid LBM- Avoid data dependencies within the loops to be parallelized- Determine the right scopes of the variables in order to maintain the accuracy of the program- Hybrid LBM uses MPI for inter-node communication and OpenMP for intra-node

parallelization to achieve multiple level parallelism

- Evaluate the performance of hybrid LBM and compare it with MPI LBM with increasing number of cores on two multicore systems

- Three datasets were used 64x64x64, 128x128x128 and 256x256x256

- Use three performance metrics: execution time, speedup and efficiency for comparison

- Use PowerPack to collect power profiling data for energy consumption analysis

Configuration Dori CS Department Virginia Tech

Hydra Supercomputing Facility at Texas A&M

Number of nodes 8 52

CPUs per node 4 16

Cores per chip 2 2

CPU Type 1.8 GHz AMD Opteron 1.9 GHz IBM Power 5+

Memory per node 6 GB 32 GB/node for 49 nodes64 GB/node for 3 nodes

MPI vs Hybrid execution times using 64x64x64

Summary and Conclusion

Chip Architecture of Hydra (IBM p5-575) Chip Architecture of DoriSpecifications of Both Clusters

Acknowledgment

Experiment Platforms

Hybrid MPI/OpenMP Lattice Boltzmann Application

Methodology

Lattice Boltzmann Method (LBM)

Motivations and Goals

Results on Dori

MPI vs Hybrid on Dori using 64x64x64 dataset

MPI vs Hybrid execution times using 128x128x128

MPI vs Hybrid on Dori using 256x256x256 dataset

- The results above show that MPI LBM outperforms the hybrid on hydra- Because of strong scaling, the execution time decreases with increasing number of

cores, hence the speedup increases with the number of cores- For MPI LBM with 64x64x64 dataset executed on more than 32 cores, its execution

time starts increasing because of communication overhead- Some data points are missing for 128X128x128 because of large memory

requirements- Because of large memory requirements, both hybrid and MPI LBM could not be run

for the problem size of 256x256x256

- Implement a hybrid MPI/OpenMP Lattice Boltzmann application to explore multiple levels of parallelism on multicore systems

- Evaluate the performance of this hybrid implementation and compare with the existing MPI-only version on two different multicore systems, and analyze energy consumption

Goals

MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128

MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128

- The results above show that MPI-only outperforms hybrid on Dori, except using 32 cores for 64x64x64 dataset

- For each programming paradigm, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores

- For MPI LBM with 64x64x64 executed on 32 cores, execution time starts increasing- The energy consumption data shows that MPI LBM consumes less energy than

hybrid LBM

Energy consumption data using 64x64x64 dataset

Motivations

- Hybrid version of the parallel LBM program was developed- Our experiment results show that MPI performs better than hybrid on both multicore

systems, Hydra and Dori- Energy consumption results show that MPI consumes less energy than hybrid on Dori- Due to large memory requirements, both hybrid and MPI LBM could not be run for

large problem size such as 256x256x256 and 512x512x512- Through this project, we learned parallel programming using OpenMP and MPI as

well as performance analysis techniques

- I would like to thank Dr Valerie E. Taylor , Dr Xingfu Wu and Charles Lively for being awesome mentors and for providing me with a great deal of information and help necessary for the project.

- This research was supported by the Distributed Research Experience for Undergraduates (DREU) program, as well as the Research Experience for Undergraduates (REU) program at the Texas A&M University's Computer Science and Engineering department.

CORES time system (kJ) cpu (kJ) memory (kj) Hard disk (kJ) motherboard (kJ)1 MPI 66.666 11.242 6.766 1.234 0.555 0.916

1 HYBRID 76.066 12.741 7.669 1.406 0.638 1.0442 MPI 37.934 6.337 3.787 0.703 0.307 0.518

2 HYBRID 43.782 8.158 4.708 1.068 0.365 0.6014 MPI 22.418 3.710 2.224 0.421 0.190 0.310

4 HYBRID 30.022 6.337 3.682 0.818 0.243 0.4118 MPI 17.724 6.189 3.731 0.667 0.296 0.489

8 HYBRID 21.045 8.629 5.246 0.916 0.354 0.58416 MPI 12.524 9.529 5.595 1.177 0.412 0.693

16 HYBRID 13.248 10.534 6.276 1.229 0.455 0.73832 MPI 15.161 21.637 12.784 2.526 1.039 1.683

32 HYBRID 11.929 17.903 10.723 2.088 0.822 1.327