Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application...
-
Upload
aryan-haith -
Category
Documents
-
view
225 -
download
2
Transcript of Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application...
Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment
Pak Lui, Application Performance ManagerHPC Advisory Council
2
Agenda
• Introduction to HPC Advisory Council• Benchmark Configuration• Performance Benchmark Testing and Results• Summary• Q&A / For More Information
3
The HPC Advisory Council
• World-wide HPC organization (360+ members)• Bridges the gap between HPC usage and its full potential• Provides best practices and a support/development center• Explores future technologies and future developments• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage• Leading edge solutions and technology demonstrations
4
HPC Advisory Council Members
5
HPC Council Board
• HPC Advisory Council ChairmanGilad Shainer - [email protected]
• HPC Advisory Council Media Relations and Events DirectorBrian Sparks - [email protected]
• HPC Advisory Council China Events Manager Blade Meng - [email protected]
• Director of the HPC Advisory Council, AsiaTong Liu - [email protected]
• HPC Advisory Council HPC|Works SIG Chair and Cluster Center ManagerPak Lui - [email protected]
• HPC Advisory Council Director of Educational Outreach Scot Schultz – [email protected]
• HPC Advisory Council Programming AdvisorTarick Bedeir - [email protected]
• HPC Advisory Council HPC|Scale SIG ChairRichard Graham – [email protected]
• HPC Advisory Council HPC|Cloud SIG ChairWilliam Lu – [email protected]
• HPC Advisory Council HPC|GPU SIG ChairSadaf Alam – [email protected]
• HPC Advisory Council India OutreachGoldi Misra – [email protected]
• Director of the HPC Advisory Council Switzerland Center of Excellence and HPC|Storage SIG ChairHussein Harake – [email protected]
• HPC Advisory Council Workshop Program DirectorEric Lantz – [email protected]
• HPC Advisory Council Research Steering Committee Director Cydney Stevens - [email protected]
6
HPC Advisory Council HPC Center
InfiniBand-based Storage (Lustre) Juniper Dodecas
Plutus Janus Athena
Vesta Mercury Mala
Lustre FS 512 cores
456 cores 192 cores
704 cores 384 cores
256 cores
16 GPUs
64 cores
7
Special Interest Subgroups Missions
• HPC|Scale– To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and
proprietary based supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services.
• HPC|Cloud– To explore usage of HPC components as part of the creation of external/public/internal/private
cloud computing environments.• HPC|Works
– To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines.
• HPC|Storage– To demonstrate how to build high-performance storage solutions and their affect on application
performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions, and to expose more users to the potential of Lustre over high-speed networks.
• HPC|GPU– To explore usage models of GPU components as part of next generation compute environments
and potential optimizations for GPU based computing.• HPC|FSI
– To explore the usage of high-performance computing solutions for low latency trading, more productive simulations (such as Monte Carlo) and overall more efficient financial services.
8
HPC Advisory Council
• HPC Advisory Council (HPCAC)– 360+ members– http://www.hpcadvisorycouncil.com/ – Application best practices, case studies (Over 150)– Benchmarking center with remote access for users– World-wide workshops – Value add for your customers to stay up to date
and in tune to HPC market
• 2013 Workshops– USA (Stanford University) – January 2013– Switzerland – March 2013– Germany (ISC’13) – June 2013– Spain – Sep 2013– China (HPC China) – Oct 2013
• For more information – www.hpcadvisorycouncil.com– [email protected]
9
ISC’14 – Student Cluster Competition
• University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the ISC’14 show-floor
• The Student Cluster Competition is designed to introduce the next generation of students to the high performance computing world and community
10
FLOW-3D/MP Performance Study
• Research performed under the HPC Advisory Council activities– Participating vendors: Intel, Dell, Mellanox– Compute resource - HPC Advisory Council Cluster Center
• Objectives– Give overview of FLOW-3D/MP performance– Compare different MPI libraries, network interconnects and
others– Understand FLOW-3D/MP communication patterns– Provide best practices to increase FLOW-3D/MP productivity
11
FLOW-3D/MP
• FLOW-3D/MP is a powerful and highly-accurate CFD software – Provides engineers valuable insight into many physical flow processes
• FLOW-3D/MP is the ideal computational fluid dynamics software – To use in the design phase as well as in improving production processes– Provides special capabilities for accurately predicting free-surface flows
• FLOW-3D/MP is a standalone, all-inclusive CFD package– Includes an integrated GUI that ties components from problem setup to post-
processing
12
Test Cluster Configuration
• Dell™ PowerEdge™ R720xd 32-node “Jupiter” cluster
– 16-node Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs
– 16-node Dual-Socket Ten-Core Intel E5-2680 V2 @ 2.80 GHz CPUs
– Memory: 64GB memory, DDR3 1600 MHz
– OS: RHEL 6.2, OFED 1.5.3-3.1.0 (for v4.2) and 2.0-3.0.0 (for v5.0) InfiniBand SW stack
– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0
• Mellanox Connect-IB FDR InfiniBand adapters and ConnectX-3 Ethernet adapters
• Mellanox SwitchX SX6036 InfiniBand switch
• MPI (vendor provided): Intel MPI 3.2.0.011 (v4.2) and 4.0.0.025 (v5.0)
• Application: FLOW-3D/MP 4.2 and 5.0
• Benchmarks:
– Lid Driven Cavity Flow - Completely filled with fluid (in this case air)
– P2 Engine Block - A casting application with free-surface (one-fluid simulation with sharp
interface tracking)
13
About Intel® Cluster Ready
• Intel® Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity– Simplifies selection, deployment, and operation of a cluster
• A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers– Focus on your work productivity, spend less management time on the cluster
• Select Intel Cluster Ready– Where the cluster is delivered ready to run– Hardware and software are integrated and configured together– Applications are registered, validating execution on the Intel Cluster Ready
architecture– Includes Intel® Cluster Checker tool, to verify functionality and periodically check
cluster health
• FLOW-3D/MP is Intel Cluster Ready certified
14
PowerEdge R720xdMassive flexibility for data intensive operations
• Performance and efficiency– Intelligent hardware-driven systems management
with extensive power management features– Innovative tools including automation for
parts replacement and lifecycle manageability– Broad choice of networking technologies from 1GbE to IB– Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits– Designed for performance workloads
• from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities– Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
– Large memory footprint (Up to 768GB / 24 DIMMs)– High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server • Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
15
FLOW-3D/MP Performance – MPI vs Hybrid
• FLOW-3D/MP supports OpenMP Hybrid mode– Up to 327% of higher performance than MPI mode at 16 nodes
• Hybrid version enables higher scalability versus pure MPI version– Hybrid version delivers better scalability after 4 nodes– MPI processes would spawn OpenMP threads for computation on CPU cores– Streamline and reduce communication endpoints to improve scalability
Intel E5-2680 V2 FDR InfiniBand
327%
*Performance Rating = Jobs/Day
16
• Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads
– Which uses “-genv I_MPI_PIN_DOMAIN node” as specified in runhyd_par script
• For best performance on 2P Platforms:
– Use “-genv I_MPI_PIN_DOMAIN socket” in runhyd_par script
– 2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads
– The flag allows to each MPI process to spawn threads within its own socket
– Instead of both MPI processes sharing the same socket
FLOW-3D/MP Performance – Hybrid Mode
110%
178%
FDR InfiniBandFLOW-3D/MP 5.0
17
• 2PPN of 8 threads can provide better performance than 1PPN of 16 threads
– Threads of the MPI process causes threads to spawn within the same socket
– With the “I_MPI_PIN_DOMAIN=socket” specified in the runhyd_par script
• Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads
– With the “I_MPI_PIN_DOMAIN=node” specified in the runhyd_par script
• The flag is modified to “socket” to allow spawning of threads within a
socket
– For the case of 2PPN of 8 threads
FLOW-3D/MP Performance – Hybrid Mode
FDR InfiniBand
9%35%
FLOW-3D/MP 4.2
18
FLOW-3D/MP Performance – Processors
• Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster– Performs 70% better than X5670 cluster at 16 nodes
• System components used:– Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks– Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk
FLOW-3D/MP 4.2
70%
FDR InfiniBand
19
FLOW-3D/MP Performance – Processors
• Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster– Performs 8% better than X5670 cluster at 16 nodes
• System components used:– Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz – Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz
8%
5%
FDR InfiniBandFLOW-3D/MP 5.0
20
FLOW-3D/MP Performance – Software Versions
• FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode– Provides up to 37% faster in Hybrid mode at 16-node
• Running in MPI mode shows slightly longer runtime than previous version– Appears to be caused by a change in communication algorithm– More MPI collective operations are being used compared to prior version
16 MPI Processes/Node
37%
*Performance Rating = Jobs/Day
21
FLOW-3D/MP Performance – Network
• InfiniBand FDR provides better scalability performance than Ethernet – Scalability gap widens as more nodes involved in simulation– FDR InfiniBand provides up to 246% better performance than 1GbE– FDR InfiniBand delivers up to 39% better performance than 10GbE– Hybrid mode is shown
39%
16 MPI Processes/Node
246%
*Performance Rating = Jobs/Day
22
FLOW-3D/MP Profiling – Time Ratio
• InfiniBand FDR reduces the communication time at scale– InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job– 10GbE consumes 39% of total time, while 1GbE consumes about 75%
• IB RDMA technology allows communication to bypass CPU involvement– Reduces CPU overhead in handling communication – Which leaves more time for application processing
23
FLOW-3D/MP Profiling – # of MPI Calls
• Overall runtime reduces as more nodes take part of the MPI job– More compute nodes reduce runtime by spreading out the workload
• Computation time drops while the communication time stays flat– As cluster scales, MPI time stays constantly at the same level
16 MPI Processes/NodePure MPI Mode
24
FLOW-3D/MP Profiling – Communication Time
• The most time consumed MPI functions are:– FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%)
• InfiniBand reduces more time in Collective Operations than Ethernet– Collective communications are most used in FLOW-3D v5.0
– Those communications account for the highest communication time
25
FLOW-3D/MP Profiling – # of MPI Calls
• There is a wide range of message sizes seen:– MPI_Allreduce: Concentration between 4B to 16B– MPI_Waitall: Around 4MB– MPI_Bcast: Around 1-4MB
26
FLOW-3D/MP Performance – File System
• Storing data files on local FS or tmpfs would improve performance– Scalability is limited by NFS when running at scale after 8 nodes– NFS used in this case is over 1GbE network
Higher is better InfiniBand FDR
25%
27
FLOW-3D/MP Profiling – Disk IO Time
• File IO access occurs during certain period during the MPI solver
– Large spikes for writing the restart and spatial data
– Files are directed to write to local instead of NFS to avoid IO bottleneck
28
FLOW-3D/MP – Summary
• Scalability
– FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version
• Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2
• Performance
– Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores
– Allocate MPI process to “proper” socket in Hybrid mode allows performance to jump 178%
– Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode
• Network
– InfiniBand FDR allows the best scalability performance with 56Gbps rate
• Outperforms by 246% over 1GbE at 16-node (320 cores)
• Outperforms by 39% over 10GbE at 16-node (320 cores)
– RDMA technology in InfiniBand allows bypassing CPU for network transfer
• This offload reduces CPU overhead in handling communication; thus CPU can focus on application
• Profiling– MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time– InfiniBand can process Collective Operations in network faster than Ethernet
– Large concentration on small messages, typical for latency sensitive HPC applications
2929
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You
• Special thanks to Anup Gokarn of Flow Science• Questions?
– Pak Lui– [email protected]