AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading...
Transcript of AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading...
AcuSolve Performance Benchmark and Profiling
October 2011
2
Note
• The following research was performed under the HPC
Advisory Council activities
– Participating vendors: AMD, Dell, Mellanox, Altair
– Compute resource: HPC Advisory Council Cluster Center
• For more info please refer to
– http://www.amd.com
– http://www.dell.com
– http://www.mellanox.com
– http://www.altairhyperworks.com/Product,54,AcuSolve.aspx
3
AcuSolve
• AcuSolve
– AcuSolve™ is a leading general-purpose finite element-based
Computational Fluid Dynamics (CFD) flow solver with superior robustness,
speed, and accuracy
– AcuSolve can be used by designers and research engineers with all levels
of expertise, either as a standalone product or seamlessly integrated into a
powerful design and analysis application
– With AcuSolve, users can quickly obtain quality solutions without iterating
on solution procedures or worrying about mesh quality or topology
4
Objectives
• The following was done to provide best practices – AcuSolve performance benchmarking
– Understanding AcuSolve communication patterns
– Ways to increase AcuSolve productivity
– Network interconnects comparisons
• The presented results will demonstrate – The scalability of the compute environment
– The capability of AcuSolve to achieve scalable productivity
– Considerations for performance optimizations
5
Test Cluster Configuration
• Dell™ PowerEdge™ C6145 6-node Quad-socket (288-core) cluster
– AMD™ Opteron™ 6174 (code name “Magny-Cours”) 12-cores @ 2.2 GHz CPUs
– Memory: 128GB memory per node DDR3 1066MHz
• Mellanox ConnectX-3 VPI adapters for 56Gb/s FDR InfiniBand and 40Gb/s Ethernet
• Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch
• Fulcrum-based 10Gb/s Ethernet Switch
• OS: RHEL 6.1, MLNX-OFED 1.5.3 InfiniBand SW stack
• MPI: Platform MPI 7.1
• Application: AcuSolve 1.8a
• Benchmark workload: Pipe_fine, 2 meshes
– 350 axial nodes, 1.52 million mesh points total, 8.89 million tetrahedral elements
– 700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements
• The pipe_fine test computes the steady state flow conditions for the turbulent flow (Re = 30000)
of water in a pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water
enters the inlet at room temperature conditions.
6
About Dell PowerEdge™ Platform Advantages
Best of breed technologies and partners
Combination of AMD™ Opteron™ 6100 series platform and Mellanox ConnectX
InfiniBand on Dell HPC
Solutions provide the ultimate platform for speed and scale
• Dell PowerEdge C6145 system delivers 8 socket performance in dense 2U form factor
• Up to 48 core/32DIMMs per server – 2016 core in 42U enclosure
Integrated stacks designed to deliver the best price/performance/watt • 2x more memory and processing power in half of the space
• Energy optimized low flow fans, improved power supplies and dual SD modules
Optimized for long-term capital and operating investment protection • System expansion
• Component upgrades and feature releases
7
AcuSolve Performance – Threads Per Node
• AcuSolve allows running in MPI-thread hybrid mode
– Allow MPI process to focus on message passing while threads for computation
• The optimal thread count are different for the datasets
– Using 12 threads per node is the most optimal for the dataset with 350 axial nodes
– Using 24 threads per node is the most optimal for the dataset with 700 axial nodes
Higher is better
6 nodes
InfiniBand QDR
8
AcuSolve Performance – Interconnect
48 Cores/Node Higher is better
• InfiniBand QDR delivers the best performance for AcuSolve
– Seen up to 75% better performance than 10GigE on 6-node (12 threads per node)
– Seen up to 99% better performance than 1GigE on 6-node (12 threads per node)
• Network bandwidth enables AcuSolve to scale
– Higher throughput allows AcuSolve to achieve higher productivity
75%
34% 67%
99%
9
AcuSolve Performance – CPU Frequency
• Higher CPU core frequency enables higher job performance
– Seen 28% more jobs produced by running CPU core at 2200MHz instead of 1800MHz
– The increase in CPU core frequencies can directly improve the overall job performance
Higher is better
28%
48 Threads/Node
10
AcuSolve Profiling – MPI/User Time Ratio
• Communication time has a major role for AcuSolve
– Communication time occupies the majority of run time after 4 nodes for this benchmark
– High speed interconnect becomes crucial as the node number grows
48 Threads/Node
11
AcuSolve Profiling – MPI/User Run Time
• InfiniBand reduces CPU overhead for processing network data
– Better network communication reduces time in computation and in communication
– InfiniBand offloads network transfers to HCA which CPU to focus on computation
• The Ethernet solutions causes job to run slower
12 Threads/Node
12
AcuSolve Profiling – Number of MPI Calls
• The most used MPI functions are for data transfers – MPI_Recv and MPI_Isend – Reflects that AcuSolve does communication and requires good network throughput
• The number of calls increases proportionally as the cluster scales
48 Threads/Node
13
AcuSolve Profiling – Time Spent of MPI Calls
• The time in communications is taken place in the following MPI functions: – InfiniBand: MPI_Allreduce(41%), MPI_Recv(30%), MPI_Barrier(24%) – 10GigE: MPI_Allreduce(58%), MPI_Recv(32%), MPI_Barrier(9%) – 1GigE: MPI_Recv(54%), MPI_Barrier(29%), MPI_Allreduce(16%)
14
AcuSolve Profiling – MPI Message Sizes
• Majority of the MPI messages are small to medium message sizes
– In the ranges of between 0B and 256B
• The ratio of the message distribution are very close between the 2 datasets
– The dataset with 700 mesh points has much larger number of messages
15
AcuSolve Profiling – Data Transfer By Process
• Data transferred to each MPI rank are not evenly distributed
– Data transfer to the rank is “mirrored” according to the rank numbers
– Amount of data grows as the cluster scales
– From around 20GB max per rank on 4-node up to around 80GB per rank for 6-node
16
AcuSolve Profiling – Aggregated Data Transfer
• Aggregated data transfer refers to:
– Total amount of data being transferred in the network between all MPI ranks collectively
• The total data transfer jumps unexpectedly as the cluster scales
– For both datasets, a sizable amount of data being sent and received across the network
– As a compute node being added, more generally data communications will take place
InfiniBand QDR
17
Summary
• AcuSolve is a CFD application that has the capability to scale to many nodes
• MPI-thread Hybrid mode:
– Allow MPI process to focus on message passing while threads for computation
– Selecting a suitable thread count can have a huge impact on performance and productivity
• CPU:
– AcuSolve has a high demand for good CPU utilization
– Higher CPU core frequency allows AcuSolve to achieve higher performance
• Interconnects:
– InfiniBand QDR can deliver great network throughput needed for scaling to many nodes
– 10GigE and 1GigE takes away CPU runtime for handling network transfers
– Interconnect becomes crucial after 4 nodes as more time is spent on MPI for these datasets
• Profiling:
– Sizable load of data is exchanged in the network
– MPI calls are mostly concentrated for data transfers instead of data synchronization
18 18
Thank You HPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein