RADIOSS 13.0 Performance Benchmark and Profiling · • Guideline: Most optimal workload...

RADIOSS 13.0Performance Benchmark and Profiling

April 2015

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: Intel, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The following was done to provide best practices

– RADIOSS performance overview

– Understanding RADIOSS communication patterns

– Ways to increase RADIOSS productivity

– MPI libraries comparisons

• For more info please refer to

– http://www.altair.com

– http://www.dell.com

– http://www.intel.com

– http://www.mellanox.com

http://www.altair.com/

http://www.hp.com/go/hpc

http://www.hp.com/go/hpc

http://www.mellanox.com/

3

Objectives

• The following was done to provide best practices– RADIOSS performance benchmarking

– Interconnect performance comparisons

– MPI performance comparison

– Understanding RADIOSS communication patterns

• The presented results will demonstrate – The scalability of the compute environment to provide nearly linear

application scalability

– The capability of RADIOSS to achieve scalable productivity

4

RADIOSS by Altair

• Altair RADIOSS

– Structural analysis solver for highly non-linear problems under dynamic loadings

– Consists of features for:

• multiphysics simulation and advanced materials such as composites

– Highly differentiated for Scalability, Quality and Robustness

• RADIOSS is used across all industry worldwide

– Improves crashworthiness, safety, and manufacturability of structural designs

• RADIOSS has established itself as an industry standard

– for automotive crash and impact analysis for over 20 years

5

Test Cluster Configuration

• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster

– Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Max Perf set in BIOS)

– OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack

– Memory: 64GB memory, DDR3 2133 MHz

– Hard Drives: 1TB 7.2 RPM SATA 2.5”

• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch

• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters

• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters

• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch

• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0

• Application: Altair RADIOSS 13.0

• Benchmark datasets:

– Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated

6

About Intel® Cluster Ready

• Intel® Cluster Ready systems make it practical to use a cluster to increase

your simulation and modeling productivity

– Simplifies selection, deployment, and operation of a cluster

• A single architecture platform supported by many OEMs, ISVs, cluster

provisioning vendors, and interconnect providers

– Focus on your work productivity, spend less management time on the cluster

• Select Intel Cluster Ready

– Where the cluster is delivered ready to run

– Hardware and software are integrated and configured together

– Applications are registered, validating execution on the Intel Cluster Ready

architecture

– Includes Intel® Cluster Checker tool, to verify functionality and periodically check

cluster health

• RADIOSS is Intel Cluster Ready

7

PowerEdge R730Massive flexibility for data intensive operations

• Performance and efficiency

– Intelligent hardware-driven systems management

with extensive power management features

– Innovative tools including automation for

parts replacement and lifecycle manageability

– Broad choice of networking technologies from GbE to IB

– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)

– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

8

RADIOSS Performance – Interconnect (MPP)

• EDR InfiniBand provides better scalability performance than Ethernet

– 70 times better performance than 1GbE at 16 nodes / 448 cores

– 4.8x better performance than 10GbE at 16 nodes / cores

– Ethernet solutions does not scale beyond 4 nodes with pure MPI

28 Processes/Node

4.8x

Intel MPI

Higher is better

70x

9

RADIOSS Performance – Interconnect (MPP)

• EDR InfiniBand provides better scalability performance

– EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 448 cores

– Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes

28 Processes/Node

25%28%

Higher is better

10

RADIOSS Performance – CPU Cores

• Running more cores per node generally improves overall performance

– Seen improvement of 18% from 20 to 28 cores per node at 8 nodes

– Improvement seems not as consistent at higher node counts

• Guideline: Most optimal workload distribution is 4000 elements/process

– For test case of 1 million elements, most optimal core sizes is ~256 cores

– 4000 elements per process should provides sufficient workload for each process

– Hybrid MPP (HMPP) provides way to achieve additional scalability on more CPUs

6%

Intel MPIHigher is better

18%

11

RADIOSS Performance – Simulation Time

• Increasing simulation time increase the run time at a faster rate

– Increasing a 8ms simulation to 80ms can result in much longer runtime

– 10x longer simulation run can result in a 14x in the runtime

– Contacts usually become more severe at the middle of the run,

so it costs more complexity and CPU utilization and therefore cost/cycle increases

Intel MPIHigher is better

14x

14x

13x

12

RADIOSS Profiling – % Time Spent on MPI

• RADIOSS utilizes point-to-point communications in most data transfers

• The most time MPI consuming calls is MPI_Waitany() and MPI_Wait()

– MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%)

28 Processes/NodeMPP Mode

13

RADIOSS Performance – IMPI Tuning (MPP)

• Tuning Intel MPI collective algorithm can improve performance

– MPI profile shows about 20% of runtime spent on MPI_Allreduce communications

– Default algorithm in Intel MPI is Recursive Doubling

– The default algorithm is the best among all tested for MPP

Intel MPI

Higher is better 28 Processes/Node

14

RADIOSS Performance – MPI Libraries (MPP)

• Both HPC-X and Intel MPI performs similarly

– MPI profile shows about 20% of MPI time spent on Allreduce communications

– MPI collective operations (such as Allreduce) can potentially be optimized

– Support for Open MPI is new for RADIOSS

– HPC-X is a tuned MPI distribution based on the latest Open MPI

Higher is better 28 Processes/Node

15

RADIOSS Hybrid MPP Parallelization

• Highly parallel code

– Multi-level parallelization

– Domain decomposition MPI parallelization

– Multithreading OpenMP

• Enhanced performance

– Best scalability in the marketplace

– High efficiency on large HPC clusters

– Unique, proven method for rich scalability over thousands of cores for FEA

– Flexibility -- easy tuning of MPI & OpenMP

– Robustness -- parallel arithmetic allows perfect repeatability in parallel

16

RADIOSS Performance – Hybrid MPP version

• Enabling Hybrid MPP mode unlocks the RADIOSS scalability

– At larger scale, productivity improves as more threads involves

– As more threads involved, amount of communications by processes are reduced

– At 32 nodes/896 cores, best configuration is 1 process per socket to spawn 14 threads each

– 28 threads/1 PPN is not advised due to breach of data locality across different CPU socket

• The following environment setting and tuned flags are used for Intel MPI:

– I_MPI_PIN_DOMAIN auto

– I_MPI_ADJUST_ALLREDUCE 5

– I_MPI_ADJUST_BCAST 1

– KMP_AFFINITY compact

– KMP_STACKSIZE 400m

– ulimit -s unlimited

EDR InfiniBand

Intel MPI

3.7x

32%70%

17

RADIOSS Profiling – Number of MPI Calls

• For MPP utilizes most non-blocking calls for communications

– MPI_Recv, MPI_Waitany, MPI_Allreduce are used most of the time

• For HMPP, communication behavior has changed

– Higher time percentage in MPI_Waitany, MPI_Allreduce, and MPI_Recv

• MPI Communication behavior changed from previous RADIOSS version

– Most likely due to more CPU cores available on the current cluster

At 32 Nodes

HMPP, 2PPN / 14 ThreadsMPP, 28PPN

18

RADIOSS Profiling – MPI Message Sizes

• The most time consuming MPI communications are:

– MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B

– MPI_Waitany: Messages are: 48B, 8B, 384B

– MPI_Allreduce: Most message sizes appears at 80B

28 Processes/NodePure MPP


19

RADIOSS Performance – Intel MPI Tuning (DP)

• For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP

– For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default

– For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default

– Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5

– Flags for OFA: I_MPI_OFA_USE_XRC 1. For DAPL: ofa-v2-mlx5_0-1u provider

27%

Intel MPI

Higher is better 2 PPN / 14 OpenMP

44%

20

RADIOSS Performance – Interconnect (HMPP)

• EDR InfiniBand provides better scalability performance than Ethernet

– 214% better performance than 1GbE at 16 nodes

– 104% better performance than 10GbE at 16 nodes

– InfiniBand typically outperforms other interconnect in collective operations

214%

104%

2 PPN / 14 OpenMP

Intel MPI

Higher is better

21

RADIOSS Performance – Interconnect (HMPP)

• EDR InfiniBand provides better scalability performance than FDR IB

– EDR IB outperforms FDR IB by 27% at 32 nodes

– Improvement for EDR InfiniBand occurs at high node count

27%

2 PPN / 14 OpenMP

Intel MPI

Higher is better

22

RADIOSS Performance–Floating Point Precision

• Single precision job runs faster double precision

– SP provides 47% speedup than DP

– Similar scalability is seen for double precision tests

47%

Intel MPI


23

RADIOSS Performance – CPU Frequency

• Increasing CPU core frequency enables higher job efficiency

– 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed)

– 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed)

• Increase in performance gain exceeds the increase CPU frequencies

– CPU bound application see higher benefit of using CPU with higher frequencies

29%

Intel MPI


18%

24

RADIOSS Performance – System Generations

• Intel E5-2680v3 (Haswell) cluster outperforms prior generations

– Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes

• System components used:

– Thor: 2-socket Intel [email protected], 2133MHz DIMMs, EDR IB, v13.0

– Jupiter: 2-socket Intel [email protected], 1600MHz DIMMs, FDR IB, v12.0

– Janus: 2-socket Intel [email protected], 1333MHz DIMMs, QDR IB, v12.0

Single Precision

100%

238%

25

RADIOSS Profiling – Memory Required

• Differences in memory consumption between MPP and HMPP

– MPP: Memory required to run the workload is ~5GB per node

– HMPP: Approximately 400MB per node is needed (as there are 2 PPN)

– Considered as a small workload but good enough to observe application behavior

28 CPU Cores/NodeAt 32 Nodes


26

RADIOSS – Summary

• RADIOSS is designed to perform at large scale HPC environment

– Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP

– Hybrid MPP version enhanced RADIOSS scalability

• 2 MPI processes per node (or 1 MPI process per socket), 14 threads each

– Additional CPU cores generally accelerating time to solution performance

– Intel E5-2680v3 (Haswell) cluster outperforms prior generations

• Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes

• Network and MPI Tuning

– EDR InfiniBand outperforms other Ethernet-based interconnects in scalability

– EDR InfiniBand delivers higher scalability performance than FDR and QDR IB

– Tuning environment parameters is important to maximize performance

– Tuning MPI collective ops helps RADIOSS to achieve even better scalability

2727

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

RADIOSS 13.0 Performance Benchmark and Profiling · • Guideline: Most optimal workload...

Documents

Transcript of RADIOSS 13.0 Performance Benchmark and Profiling · • Guideline: Most optimal workload...