AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading...

18
AcuSolve Performance Benchmark and Profiling October 2011

Transcript of AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading...

Page 1: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

AcuSolve Performance Benchmark and Profiling

October 2011

Page 2: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

2

Note

• The following research was performed under the HPC

Advisory Council activities

– Participating vendors: AMD, Dell, Mellanox, Altair

– Compute resource: HPC Advisory Council Cluster Center

• For more info please refer to

– http://www.amd.com

– http://www.dell.com

– http://www.mellanox.com

– http://www.altairhyperworks.com/Product,54,AcuSolve.aspx

Page 3: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

3

AcuSolve

• AcuSolve

– AcuSolve™ is a leading general-purpose finite element-based

Computational Fluid Dynamics (CFD) flow solver with superior robustness,

speed, and accuracy

– AcuSolve can be used by designers and research engineers with all levels

of expertise, either as a standalone product or seamlessly integrated into a

powerful design and analysis application

– With AcuSolve, users can quickly obtain quality solutions without iterating

on solution procedures or worrying about mesh quality or topology

Page 4: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

4

Objectives

• The following was done to provide best practices – AcuSolve performance benchmarking

– Understanding AcuSolve communication patterns

– Ways to increase AcuSolve productivity

– Network interconnects comparisons

• The presented results will demonstrate – The scalability of the compute environment

– The capability of AcuSolve to achieve scalable productivity

– Considerations for performance optimizations

Page 5: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

5

Test Cluster Configuration

• Dell™ PowerEdge™ C6145 6-node Quad-socket (288-core) cluster

– AMD™ Opteron™ 6174 (code name “Magny-Cours”) 12-cores @ 2.2 GHz CPUs

– Memory: 128GB memory per node DDR3 1066MHz

• Mellanox ConnectX-3 VPI adapters for 56Gb/s FDR InfiniBand and 40Gb/s Ethernet

• Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch

• Fulcrum-based 10Gb/s Ethernet Switch

• OS: RHEL 6.1, MLNX-OFED 1.5.3 InfiniBand SW stack

• MPI: Platform MPI 7.1

• Application: AcuSolve 1.8a

• Benchmark workload: Pipe_fine, 2 meshes

– 350 axial nodes, 1.52 million mesh points total, 8.89 million tetrahedral elements

– 700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements

• The pipe_fine test computes the steady state flow conditions for the turbulent flow (Re = 30000)

of water in a pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water

enters the inlet at room temperature conditions.

Page 6: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

6

About Dell PowerEdge™ Platform Advantages

Best of breed technologies and partners

Combination of AMD™ Opteron™ 6100 series platform and Mellanox ConnectX

InfiniBand on Dell HPC

Solutions provide the ultimate platform for speed and scale

• Dell PowerEdge C6145 system delivers 8 socket performance in dense 2U form factor

• Up to 48 core/32DIMMs per server – 2016 core in 42U enclosure

Integrated stacks designed to deliver the best price/performance/watt • 2x more memory and processing power in half of the space

• Energy optimized low flow fans, improved power supplies and dual SD modules

Optimized for long-term capital and operating investment protection • System expansion

• Component upgrades and feature releases

Page 7: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

7

AcuSolve Performance – Threads Per Node

• AcuSolve allows running in MPI-thread hybrid mode

– Allow MPI process to focus on message passing while threads for computation

• The optimal thread count are different for the datasets

– Using 12 threads per node is the most optimal for the dataset with 350 axial nodes

– Using 24 threads per node is the most optimal for the dataset with 700 axial nodes

Higher is better

6 nodes

InfiniBand QDR

Page 8: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

8

AcuSolve Performance – Interconnect

48 Cores/Node Higher is better

• InfiniBand QDR delivers the best performance for AcuSolve

– Seen up to 75% better performance than 10GigE on 6-node (12 threads per node)

– Seen up to 99% better performance than 1GigE on 6-node (12 threads per node)

• Network bandwidth enables AcuSolve to scale

– Higher throughput allows AcuSolve to achieve higher productivity

75%

34% 67%

99%

Page 9: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

9

AcuSolve Performance – CPU Frequency

• Higher CPU core frequency enables higher job performance

– Seen 28% more jobs produced by running CPU core at 2200MHz instead of 1800MHz

– The increase in CPU core frequencies can directly improve the overall job performance

Higher is better

28%

48 Threads/Node

Page 10: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

10

AcuSolve Profiling – MPI/User Time Ratio

• Communication time has a major role for AcuSolve

– Communication time occupies the majority of run time after 4 nodes for this benchmark

– High speed interconnect becomes crucial as the node number grows

48 Threads/Node

Page 11: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

11

AcuSolve Profiling – MPI/User Run Time

• InfiniBand reduces CPU overhead for processing network data

– Better network communication reduces time in computation and in communication

– InfiniBand offloads network transfers to HCA which CPU to focus on computation

• The Ethernet solutions causes job to run slower

12 Threads/Node

Page 12: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

12

AcuSolve Profiling – Number of MPI Calls

• The most used MPI functions are for data transfers – MPI_Recv and MPI_Isend – Reflects that AcuSolve does communication and requires good network throughput

• The number of calls increases proportionally as the cluster scales

48 Threads/Node

Page 13: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

13

AcuSolve Profiling – Time Spent of MPI Calls

• The time in communications is taken place in the following MPI functions: – InfiniBand: MPI_Allreduce(41%), MPI_Recv(30%), MPI_Barrier(24%) – 10GigE: MPI_Allreduce(58%), MPI_Recv(32%), MPI_Barrier(9%) – 1GigE: MPI_Recv(54%), MPI_Barrier(29%), MPI_Allreduce(16%)

Page 14: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

14

AcuSolve Profiling – MPI Message Sizes

• Majority of the MPI messages are small to medium message sizes

– In the ranges of between 0B and 256B

• The ratio of the message distribution are very close between the 2 datasets

– The dataset with 700 mesh points has much larger number of messages

Page 15: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

15

AcuSolve Profiling – Data Transfer By Process

• Data transferred to each MPI rank are not evenly distributed

– Data transfer to the rank is “mirrored” according to the rank numbers

– Amount of data grows as the cluster scales

– From around 20GB max per rank on 4-node up to around 80GB per rank for 6-node

Page 16: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

16

AcuSolve Profiling – Aggregated Data Transfer

• Aggregated data transfer refers to:

– Total amount of data being transferred in the network between all MPI ranks collectively

• The total data transfer jumps unexpectedly as the cluster scales

– For both datasets, a sizable amount of data being sent and received across the network

– As a compute node being added, more generally data communications will take place

InfiniBand QDR

Page 17: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

17

Summary

• AcuSolve is a CFD application that has the capability to scale to many nodes

• MPI-thread Hybrid mode:

– Allow MPI process to focus on message passing while threads for computation

– Selecting a suitable thread count can have a huge impact on performance and productivity

• CPU:

– AcuSolve has a high demand for good CPU utilization

– Higher CPU core frequency allows AcuSolve to achieve higher performance

• Interconnects:

– InfiniBand QDR can deliver great network throughput needed for scaling to many nodes

– 10GigE and 1GigE takes away CPU runtime for handling network transfers

– Interconnect becomes crucial after 4 nodes as more time is spent on MPI for these datasets

• Profiling:

– Sizable load of data is exchanged in the network

– MPI calls are mostly concentrated for data transfers instead of data synchronization

Page 18: AcuSolve Performance Benchmark and Profiling · 3 AcuSolve • AcuSolve –AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver

18 18

Thank You HPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein