RADIOSS 13.0 Performance Benchmark and Profiling · • Guideline: Most optimal workload...
Transcript of RADIOSS 13.0 Performance Benchmark and Profiling · • Guideline: Most optimal workload...
RADIOSS 13.0Performance Benchmark and Profiling
April 2015
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: Intel, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The following was done to provide best practices
– RADIOSS performance overview
– Understanding RADIOSS communication patterns
– Ways to increase RADIOSS productivity
– MPI libraries comparisons
• For more info please refer to
– http://www.altair.com
– http://www.dell.com
– http://www.intel.com
– http://www.mellanox.com
3
Objectives
• The following was done to provide best practices– RADIOSS performance benchmarking
– Interconnect performance comparisons
– MPI performance comparison
– Understanding RADIOSS communication patterns
• The presented results will demonstrate – The scalability of the compute environment to provide nearly linear
application scalability
– The capability of RADIOSS to achieve scalable productivity
4
RADIOSS by Altair
• Altair RADIOSS
– Structural analysis solver for highly non-linear problems under dynamic loadings
– Consists of features for:
• multiphysics simulation and advanced materials such as composites
– Highly differentiated for Scalability, Quality and Robustness
• RADIOSS is used across all industry worldwide
– Improves crashworthiness, safety, and manufacturability of structural designs
• RADIOSS has established itself as an industry standard
– for automotive crash and impact analysis for over 20 years
5
Test Cluster Configuration
• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster
– Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Max Perf set in BIOS)
– OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack
– Memory: 64GB memory, DDR3 2133 MHz
– Hard Drives: 1TB 7.2 RPM SATA 2.5”
• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch
• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters
• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters
• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch
• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0
• Application: Altair RADIOSS 13.0
• Benchmark datasets:
– Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated
6
About Intel® Cluster Ready
• Intel® Cluster Ready systems make it practical to use a cluster to increase
your simulation and modeling productivity
– Simplifies selection, deployment, and operation of a cluster
• A single architecture platform supported by many OEMs, ISVs, cluster
provisioning vendors, and interconnect providers
– Focus on your work productivity, spend less management time on the cluster
• Select Intel Cluster Ready
– Where the cluster is delivered ready to run
– Hardware and software are integrated and configured together
– Applications are registered, validating execution on the Intel Cluster Ready
architecture
– Includes Intel® Cluster Checker tool, to verify functionality and periodically check
cluster health
• RADIOSS is Intel Cluster Ready
7
PowerEdge R730Massive flexibility for data intensive operations
• Performance and efficiency
– Intelligent hardware-driven systems management
with extensive power management features
– Innovative tools including automation for
parts replacement and lifecycle manageability
– Broad choice of networking technologies from GbE to IB
– Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits
– Designed for performance workloads
• from big data analytics, distributed storage or distributed computing
where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities
– Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
– Large memory footprint (Up to 768GB / 24 DIMMs)
– High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server
• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
8
RADIOSS Performance – Interconnect (MPP)
• EDR InfiniBand provides better scalability performance than Ethernet
– 70 times better performance than 1GbE at 16 nodes / 448 cores
– 4.8x better performance than 10GbE at 16 nodes / cores
– Ethernet solutions does not scale beyond 4 nodes with pure MPI
28 Processes/Node
4.8x
Intel MPI
Higher is better
70x
9
RADIOSS Performance – Interconnect (MPP)
• EDR InfiniBand provides better scalability performance
– EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 448 cores
– Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes
28 Processes/Node
25%28%
Higher is better
10
RADIOSS Performance – CPU Cores
• Running more cores per node generally improves overall performance
– Seen improvement of 18% from 20 to 28 cores per node at 8 nodes
– Improvement seems not as consistent at higher node counts
• Guideline: Most optimal workload distribution is 4000 elements/process
– For test case of 1 million elements, most optimal core sizes is ~256 cores
– 4000 elements per process should provides sufficient workload for each process
– Hybrid MPP (HMPP) provides way to achieve additional scalability on more CPUs
6%
Intel MPIHigher is better
18%
11
RADIOSS Performance – Simulation Time
• Increasing simulation time increase the run time at a faster rate
– Increasing a 8ms simulation to 80ms can result in much longer runtime
– 10x longer simulation run can result in a 14x in the runtime
– Contacts usually become more severe at the middle of the run,
so it costs more complexity and CPU utilization and therefore cost/cycle increases
Intel MPIHigher is better
14x
14x
13x
12
RADIOSS Profiling – % Time Spent on MPI
• RADIOSS utilizes point-to-point communications in most data transfers
• The most time MPI consuming calls is MPI_Waitany() and MPI_Wait()
– MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%)
28 Processes/NodeMPP Mode
13
RADIOSS Performance – IMPI Tuning (MPP)
• Tuning Intel MPI collective algorithm can improve performance
– MPI profile shows about 20% of runtime spent on MPI_Allreduce communications
– Default algorithm in Intel MPI is Recursive Doubling
– The default algorithm is the best among all tested for MPP
Intel MPI
Higher is better 28 Processes/Node
14
RADIOSS Performance – MPI Libraries (MPP)
• Both HPC-X and Intel MPI performs similarly
– MPI profile shows about 20% of MPI time spent on Allreduce communications
– MPI collective operations (such as Allreduce) can potentially be optimized
– Support for Open MPI is new for RADIOSS
– HPC-X is a tuned MPI distribution based on the latest Open MPI
Higher is better 28 Processes/Node
15
RADIOSS Hybrid MPP Parallelization
• Highly parallel code
– Multi-level parallelization
– Domain decomposition MPI parallelization
– Multithreading OpenMP
• Enhanced performance
– Best scalability in the marketplace
– High efficiency on large HPC clusters
– Unique, proven method for rich scalability over thousands of cores for FEA
– Flexibility -- easy tuning of MPI & OpenMP
– Robustness -- parallel arithmetic allows perfect repeatability in parallel
16
RADIOSS Performance – Hybrid MPP version
• Enabling Hybrid MPP mode unlocks the RADIOSS scalability
– At larger scale, productivity improves as more threads involves
– As more threads involved, amount of communications by processes are reduced
– At 32 nodes/896 cores, best configuration is 1 process per socket to spawn 14 threads each
– 28 threads/1 PPN is not advised due to breach of data locality across different CPU socket
• The following environment setting and tuned flags are used for Intel MPI:
– I_MPI_PIN_DOMAIN auto
– I_MPI_ADJUST_ALLREDUCE 5
– I_MPI_ADJUST_BCAST 1
– KMP_AFFINITY compact
– KMP_STACKSIZE 400m
– ulimit -s unlimited
EDR InfiniBand
Intel MPI
3.7x
32%70%
17
RADIOSS Profiling – Number of MPI Calls
• For MPP utilizes most non-blocking calls for communications
– MPI_Recv, MPI_Waitany, MPI_Allreduce are used most of the time
• For HMPP, communication behavior has changed
– Higher time percentage in MPI_Waitany, MPI_Allreduce, and MPI_Recv
• MPI Communication behavior changed from previous RADIOSS version
– Most likely due to more CPU cores available on the current cluster
At 32 Nodes
HMPP, 2PPN / 14 ThreadsMPP, 28PPN
18
RADIOSS Profiling – MPI Message Sizes
• The most time consuming MPI communications are:
– MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B
– MPI_Waitany: Messages are: 48B, 8B, 384B
– MPI_Allreduce: Most message sizes appears at 80B
28 Processes/NodePure MPP
HMPP, 2PPN / 14 ThreadsMPP, 28PPN
19
RADIOSS Performance – Intel MPI Tuning (DP)
• For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP
– For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default
– For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default
– Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5
– Flags for OFA: I_MPI_OFA_USE_XRC 1. For DAPL: ofa-v2-mlx5_0-1u provider
27%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
44%
20
RADIOSS Performance – Interconnect (HMPP)
• EDR InfiniBand provides better scalability performance than Ethernet
– 214% better performance than 1GbE at 16 nodes
– 104% better performance than 10GbE at 16 nodes
– InfiniBand typically outperforms other interconnect in collective operations
214%
104%
2 PPN / 14 OpenMP
Intel MPI
Higher is better
21
RADIOSS Performance – Interconnect (HMPP)
• EDR InfiniBand provides better scalability performance than FDR IB
– EDR IB outperforms FDR IB by 27% at 32 nodes
– Improvement for EDR InfiniBand occurs at high node count
27%
2 PPN / 14 OpenMP
Intel MPI
Higher is better
22
RADIOSS Performance–Floating Point Precision
• Single precision job runs faster double precision
– SP provides 47% speedup than DP
– Similar scalability is seen for double precision tests
47%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
23
RADIOSS Performance – CPU Frequency
• Increasing CPU core frequency enables higher job efficiency
– 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed)
– 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed)
• Increase in performance gain exceeds the increase CPU frequencies
– CPU bound application see higher benefit of using CPU with higher frequencies
29%
Intel MPI
Higher is better 2 PPN / 14 OpenMP
18%
24
RADIOSS Performance – System Generations
• Intel E5-2680v3 (Haswell) cluster outperforms prior generations
– Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes
• System components used:
– Thor: 2-socket Intel [email protected], 2133MHz DIMMs, EDR IB, v13.0
– Jupiter: 2-socket Intel [email protected], 1600MHz DIMMs, FDR IB, v12.0
– Janus: 2-socket Intel [email protected], 1333MHz DIMMs, QDR IB, v12.0
Single Precision
100%
238%
25
RADIOSS Profiling – Memory Required
• Differences in memory consumption between MPP and HMPP
– MPP: Memory required to run the workload is ~5GB per node
– HMPP: Approximately 400MB per node is needed (as there are 2 PPN)
– Considered as a small workload but good enough to observe application behavior
28 CPU Cores/NodeAt 32 Nodes
HMPP, 2PPN / 14 ThreadsMPP, 28PPN
26
RADIOSS – Summary
• RADIOSS is designed to perform at large scale HPC environment
– Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP
– Hybrid MPP version enhanced RADIOSS scalability
• 2 MPI processes per node (or 1 MPI process per socket), 14 threads each
– Additional CPU cores generally accelerating time to solution performance
– Intel E5-2680v3 (Haswell) cluster outperforms prior generations
• Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes
• Network and MPI Tuning
– EDR InfiniBand outperforms other Ethernet-based interconnects in scalability
– EDR InfiniBand delivers higher scalability performance than FDR and QDR IB
– Tuning environment parameters is important to maximize performance
– Tuning MPI collective ops helps RADIOSS to achieve even better scalability
2727
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein