Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application...

29
Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council

Transcript of Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application...

Page 1: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment

Pak Lui, Application Performance ManagerHPC Advisory Council

Page 2: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

2

Agenda

• Introduction to HPC Advisory Council• Benchmark Configuration• Performance Benchmark Testing and Results• Summary• Q&A / For More Information

Page 3: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

3

The HPC Advisory Council

• World-wide HPC organization (360+ members)• Bridges the gap between HPC usage and its full potential• Provides best practices and a support/development center• Explores future technologies and future developments• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage• Leading edge solutions and technology demonstrations

Page 4: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

4

HPC Advisory Council Members

Page 5: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

5

HPC Council Board

• HPC Advisory Council ChairmanGilad Shainer - [email protected]

• HPC Advisory Council Media Relations and Events DirectorBrian Sparks - [email protected]

• HPC Advisory Council China Events Manager Blade Meng - [email protected]

• Director of the HPC Advisory Council, AsiaTong Liu - [email protected]

• HPC Advisory Council HPC|Works SIG Chair and Cluster Center ManagerPak Lui - [email protected]

• HPC Advisory Council Director of Educational Outreach Scot Schultz – [email protected]

• HPC Advisory Council Programming AdvisorTarick Bedeir - [email protected]

• HPC Advisory Council HPC|Scale SIG ChairRichard Graham – [email protected]

• HPC Advisory Council HPC|Cloud SIG ChairWilliam Lu – [email protected]

• HPC Advisory Council HPC|GPU SIG ChairSadaf Alam – [email protected]

• HPC Advisory Council India OutreachGoldi Misra – [email protected]

• Director of the HPC Advisory Council Switzerland Center of Excellence and HPC|Storage SIG ChairHussein Harake – [email protected]

• HPC Advisory Council Workshop Program DirectorEric Lantz – [email protected]

• HPC Advisory Council Research Steering Committee Director Cydney Stevens - [email protected]

Page 6: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

6

HPC Advisory Council HPC Center

InfiniBand-based Storage (Lustre) Juniper Dodecas

Plutus Janus Athena

Vesta Mercury Mala

Lustre FS 512 cores

456 cores 192 cores

704 cores 384 cores

256 cores

16 GPUs

64 cores

Page 7: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

7

Special Interest Subgroups Missions

• HPC|Scale– To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and

proprietary based supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services.

• HPC|Cloud– To explore usage of HPC components as part of the creation of external/public/internal/private

cloud computing environments.• HPC|Works

– To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines.

• HPC|Storage– To demonstrate how to build high-performance storage solutions and their affect on application

performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions, and to expose more users to the potential of Lustre over high-speed networks.

• HPC|GPU– To explore usage models of GPU components as part of next generation compute environments

and potential optimizations for GPU based computing.• HPC|FSI

– To explore the usage of high-performance computing solutions for low latency trading, more productive simulations (such as Monte Carlo) and overall more efficient financial services.

Page 8: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

8

HPC Advisory Council

• HPC Advisory Council (HPCAC)– 360+ members– http://www.hpcadvisorycouncil.com/ – Application best practices, case studies (Over 150)– Benchmarking center with remote access for users– World-wide workshops – Value add for your customers to stay up to date

and in tune to HPC market

• 2013 Workshops– USA (Stanford University) – January 2013– Switzerland – March 2013– Germany (ISC’13) – June 2013– Spain – Sep 2013– China (HPC China) – Oct 2013

• For more information – www.hpcadvisorycouncil.com– [email protected]

Page 9: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

9

ISC’14 – Student Cluster Competition

• University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the ISC’14 show-floor

• The Student Cluster Competition is designed to introduce the next generation of students to the high performance computing world and community

Page 10: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

10

FLOW-3D/MP Performance Study

• Research performed under the HPC Advisory Council activities– Participating vendors: Intel, Dell, Mellanox– Compute resource - HPC Advisory Council Cluster Center

• Objectives– Give overview of FLOW-3D/MP performance– Compare different MPI libraries, network interconnects and

others– Understand FLOW-3D/MP communication patterns– Provide best practices to increase FLOW-3D/MP productivity

Page 11: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

11

FLOW-3D/MP

• FLOW-3D/MP is a powerful and highly-accurate CFD software – Provides engineers valuable insight into many physical flow processes

• FLOW-3D/MP is the ideal computational fluid dynamics software – To use in the design phase as well as in improving production processes– Provides special capabilities for accurately predicting free-surface flows

• FLOW-3D/MP is a standalone, all-inclusive CFD package– Includes an integrated GUI that ties components from problem setup to post-

processing

Page 12: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

12

Test Cluster Configuration

• Dell™ PowerEdge™ R720xd 32-node “Jupiter” cluster

– 16-node Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs

– 16-node Dual-Socket Ten-Core Intel E5-2680 V2 @ 2.80 GHz CPUs

– Memory: 64GB memory, DDR3 1600 MHz

– OS: RHEL 6.2, OFED 1.5.3-3.1.0 (for v4.2) and 2.0-3.0.0 (for v5.0) InfiniBand SW stack

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

• Mellanox Connect-IB FDR InfiniBand adapters and ConnectX-3 Ethernet adapters

• Mellanox SwitchX SX6036 InfiniBand switch

• MPI (vendor provided): Intel MPI 3.2.0.011 (v4.2) and 4.0.0.025 (v5.0)

• Application: FLOW-3D/MP 4.2 and 5.0

• Benchmarks:

– Lid Driven Cavity Flow - Completely filled with fluid (in this case air)

– P2 Engine Block - A casting application with free-surface (one-fluid simulation with sharp

interface tracking)

Page 13: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

13

About Intel® Cluster Ready

• Intel® Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity– Simplifies selection, deployment, and operation of a cluster

• A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers– Focus on your work productivity, spend less management time on the cluster

• Select Intel Cluster Ready– Where the cluster is delivered ready to run– Hardware and software are integrated and configured together– Applications are registered, validating execution on the Intel Cluster Ready

architecture– Includes Intel® Cluster Checker tool, to verify functionality and periodically check

cluster health

• FLOW-3D/MP is Intel Cluster Ready certified

Page 14: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

14

PowerEdge R720xdMassive flexibility for data intensive operations

• Performance and efficiency– Intelligent hardware-driven systems management

with extensive power management features– Innovative tools including automation for

parts replacement and lifecycle manageability– Broad choice of networking technologies from 1GbE to IB– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server • Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

Page 15: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

15

FLOW-3D/MP Performance – MPI vs Hybrid

• FLOW-3D/MP supports OpenMP Hybrid mode– Up to 327% of higher performance than MPI mode at 16 nodes

• Hybrid version enables higher scalability versus pure MPI version– Hybrid version delivers better scalability after 4 nodes– MPI processes would spawn OpenMP threads for computation on CPU cores– Streamline and reduce communication endpoints to improve scalability

Intel E5-2680 V2 FDR InfiniBand

327%

*Performance Rating = Jobs/Day

Page 16: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

16

• Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads

– Which uses “-genv I_MPI_PIN_DOMAIN node” as specified in runhyd_par script

• For best performance on 2P Platforms:

– Use “-genv I_MPI_PIN_DOMAIN socket” in runhyd_par script

– 2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads

– The flag allows to each MPI process to spawn threads within its own socket

– Instead of both MPI processes sharing the same socket

FLOW-3D/MP Performance – Hybrid Mode

110%

178%

FDR InfiniBandFLOW-3D/MP 5.0

Page 17: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

17

• 2PPN of 8 threads can provide better performance than 1PPN of 16 threads

– Threads of the MPI process causes threads to spawn within the same socket

– With the “I_MPI_PIN_DOMAIN=socket” specified in the runhyd_par script

• Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads

– With the “I_MPI_PIN_DOMAIN=node” specified in the runhyd_par script

• The flag is modified to “socket” to allow spawning of threads within a

socket

– For the case of 2PPN of 8 threads

FLOW-3D/MP Performance – Hybrid Mode

FDR InfiniBand

9%35%

FLOW-3D/MP 4.2

Page 18: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

18

FLOW-3D/MP Performance – Processors

• Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster– Performs 70% better than X5670 cluster at 16 nodes

• System components used:– Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks– Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk

FLOW-3D/MP 4.2

70%

FDR InfiniBand

Page 19: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

19

FLOW-3D/MP Performance – Processors

• Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster– Performs 8% better than X5670 cluster at 16 nodes

• System components used:– Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz – Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz

8%

5%

FDR InfiniBandFLOW-3D/MP 5.0

Page 20: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

20

FLOW-3D/MP Performance – Software Versions

• FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode– Provides up to 37% faster in Hybrid mode at 16-node

• Running in MPI mode shows slightly longer runtime than previous version– Appears to be caused by a change in communication algorithm– More MPI collective operations are being used compared to prior version

16 MPI Processes/Node

37%

*Performance Rating = Jobs/Day

Page 21: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

21

FLOW-3D/MP Performance – Network

• InfiniBand FDR provides better scalability performance than Ethernet – Scalability gap widens as more nodes involved in simulation– FDR InfiniBand provides up to 246% better performance than 1GbE– FDR InfiniBand delivers up to 39% better performance than 10GbE– Hybrid mode is shown

39%

16 MPI Processes/Node

246%

*Performance Rating = Jobs/Day

Page 22: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

22

FLOW-3D/MP Profiling – Time Ratio

• InfiniBand FDR reduces the communication time at scale– InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job– 10GbE consumes 39% of total time, while 1GbE consumes about 75%

• IB RDMA technology allows communication to bypass CPU involvement– Reduces CPU overhead in handling communication – Which leaves more time for application processing

Page 23: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

23

FLOW-3D/MP Profiling – # of MPI Calls

• Overall runtime reduces as more nodes take part of the MPI job– More compute nodes reduce runtime by spreading out the workload

• Computation time drops while the communication time stays flat– As cluster scales, MPI time stays constantly at the same level

16 MPI Processes/NodePure MPI Mode

Page 24: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

24

FLOW-3D/MP Profiling – Communication Time

• The most time consumed MPI functions are:– FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%)

• InfiniBand reduces more time in Collective Operations than Ethernet– Collective communications are most used in FLOW-3D v5.0

– Those communications account for the highest communication time

Page 25: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

25

FLOW-3D/MP Profiling – # of MPI Calls

• There is a wide range of message sizes seen:– MPI_Allreduce: Concentration between 4B to 16B– MPI_Waitall: Around 4MB– MPI_Bcast: Around 1-4MB

Page 26: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

26

FLOW-3D/MP Performance – File System

• Storing data files on local FS or tmpfs would improve performance– Scalability is limited by NFS when running at scale after 8 nodes– NFS used in this case is over 1GbE network

Higher is better InfiniBand FDR

25%

Page 27: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

27

FLOW-3D/MP Profiling – Disk IO Time

• File IO access occurs during certain period during the MPI solver

– Large spikes for writing the restart and spatial data

– Files are directed to write to local instead of NFS to avoid IO bottleneck

Page 28: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

28

FLOW-3D/MP – Summary

• Scalability

– FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version

• Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2

• Performance

– Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores

– Allocate MPI process to “proper” socket in Hybrid mode allows performance to jump 178%

– Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode

• Network

– InfiniBand FDR allows the best scalability performance with 56Gbps rate

• Outperforms by 246% over 1GbE at 16-node (320 cores)

• Outperforms by 39% over 10GbE at 16-node (320 cores)

– RDMA technology in InfiniBand allows bypassing CPU for network transfer

• This offload reduces CPU overhead in handling communication; thus CPU can focus on application

• Profiling– MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time– InfiniBand can process Collective Operations in network faster than Ethernet

– Large concentration on small messages, typical for latency sensitive HPC applications

Page 29: Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

2929

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty.   The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein.  HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

• Special thanks to Anup Gokarn of Flow Science• Questions?

– Pak Lui– [email protected]