Understanding Hardware Selection to Speedup Your CFD and ...
Transcript of Understanding Hardware Selection to Speedup Your CFD and ...
![Page 1: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/1.jpg)
© 2015 ANSYS, Inc. June 18, 2015 1
Understanding Hardware Selection to Speedup Your CFD and FEA Simulations
![Page 2: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/2.jpg)
© 2015 ANSYS, Inc. June 18, 2015 2
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 3: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/3.jpg)
© 2015 ANSYS, Inc. June 18, 2015 3
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 4: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/4.jpg)
© 2015 ANSYS, Inc. June 18, 2015 4
Most Users Constrained by Hardware
Source: HPC Usage survey with over 1,800 ANSYS respondents
![Page 5: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/5.jpg)
© 2015 ANSYS, Inc. June 18, 2015 5
Problem Statement
I am not achieving the performance and throughput I was
expecting from my hardware & software
Image courtesy of Intel Corporation
![Page 6: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/6.jpg)
© 2015 ANSYS, Inc. June 18, 2015 6
Building A Balanced System Is The Key To Improving Your Experience
If Your System Is
Slow So Are Your
Engineers &
Analysts Processors
Memory
Storage
Networks
![Page 7: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/7.jpg)
© 2015 ANSYS, Inc. June 18, 2015 7
What Hardware Configuration to Select?
The right combination of hardware and software
leads to maximum efficiency
SMP vs. DMP
HDD vs. SSD
Interconnects?
Clusters?
GPUs? CPUs?
![Page 8: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/8.jpg)
© 2015 ANSYS, Inc. June 18, 2015 8
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 9: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/9.jpg)
© 2015 ANSYS, Inc. June 18, 2015 9
HPC Hardware Terminology
Machine 1 (or Node 1)
GPU
Processor 1 (or Socket 1)
Processor 2 (or Socket 2)
Interconnect (GigE or InfiniBand)
Machine N (or Node N)
GPU
Processor 1 (or Socket 1)
Processor 2 (or Socket 2)
![Page 10: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/10.jpg)
© 2015 ANSYS, Inc. June 18, 2015 10
Machine 1 (or Node 1)
Shared Memory Parallel
• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.
• OpenMP is the industry standard.
Processor 1 (or Socket 1)
![Page 11: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/11.jpg)
© 2015 ANSYS, Inc. June 18, 2015 11
Distributed Memory Parallel
• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.
• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.
• MPI (Message Passing Interface) is the industry standard for this.
Machine 1 (or Node 1)
Processor 1 (or Socket 1)
![Page 12: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/12.jpg)
© 2015 ANSYS, Inc. June 18, 2015 12
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 13: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/13.jpg)
© 2015 ANSYS, Inc. June 18, 2015 13
Typical HPC Growth Path
Cluster Users Desktop User Workstation and/or
Server Users
Cloud Solution
![Page 14: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/14.jpg)
© 2015 ANSYS, Inc. June 18, 2015 14
• Ideal for – remote users submitting jobs from a Windows machine to a Linux cluster or local
users submitting jobs to a Linux cluster – users that do not have enough power (memory or graphics) on their local
workstation to build large meshes or view graphics.
• ANSYS 16.0 supports the following remote visualization applications – Nice Desktop Cloud Visualiation (DCV) 2013
• Linux server + Linux/Windows client
– OpenText Exceed onDemand 8 SP2/SP3 • Linux server + Linux/Windows client
– RealVNC Enterprise Edition 5.0.4 (with VirtualGL) • Linux server + Linux/Windows client
– (on Windows cluster: Microsoft Remote Desktop) • Hardware requirements for remote visualization servers require:
– GPU capable video cards – large amounts of RAM accessible for multiple user availability when running
ANSYS applications and pre/post processing
Remote Visualization
![Page 15: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/15.jpg)
© 2015 ANSYS, Inc. June 18, 2015 15
Virtual Desktop (VDI) Support
Key focus area at ANSYS (internal use & software QA)
Focus on GPU Pass-Through • One GPU per VM, up to 8 VMs per machine (K1, K2 cards);
memory constraints will limit in any case
vGPU (NVIDIA Grid) as it matures; testing internally
Not SW rendering, Not Shared GPU (too slow)
Supported at R16.0:
![Page 16: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/16.jpg)
© 2015 ANSYS, Inc. June 18, 2015 16
Desktop Server Cluster (with 3rd party scheduler)
The Remote Solve Manager (RSM) is a GUI-based, job queuing system that distributes simulation tasks to (shared) computing resources
RSM enables tasks to be • Run in background mode on the local machine • Sent to a remote compute machine • Broken into a series of jobs for parallel processing across a variety of computers
RSM as a scheduler RSM as a transport mechanism
• Submits to RSM itself.
• Unit recognition: jobs (e.g. a run of a solver such as CFX, Fluent or Mechanical)
• Submits through RSM to a high-level scheduler such as LSF, PBS Pro, Windows HPC Server 2008 R2 / 2012, and Univa Grid Engine (at R15.0).
• Unit recognition: cores
ANSYS Remote Solve Manager (RSM)
![Page 17: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/17.jpg)
© 2015 ANSYS, Inc. June 18, 2015 17
Submission from a client to a centralized (shared) compute resource, allowing • back-ground queuing on a centralized machine • multiple users to share a common, usually large
memory/fast machine (compared to client machine)
RSM Usage Scenarios
![Page 18: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/18.jpg)
© 2015 ANSYS, Inc. June 18, 2015 18
Submission from a client to a centralized (shared) compute resource, allowing • back-ground queuing on a centralized machine • multiple users to share a common, usually large
memory/fast machine (compared to client machine)
Submission from a client to multiple (shared) compute resources, allowing • back-ground queuing on a centralized machine that
submits to other machines (compute servers) • multiple users to share user workstations (often at night)
using the RSM “Limit Times for Job Submission” feature
RSM Usage Scenarios
![Page 19: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/19.jpg)
© 2015 ANSYS, Inc. June 18, 2015 19
Submission from a client to a centralized (shared) compute resource, allowing • back-ground queuing on a centralized machine • multiple users to share a common, usually large
memory/fast machine (compared to client machine)
Submission from a client to a centralized (shared) compute resource with a job scheduler, allowing • back-ground queuing on a centralized machine that
submits to a job scheduler (e.g. LSF) • multiple users to run multi-node jobs on shared compute
resources
Submission from a client to multiple (shared) compute resources, allowing • back-ground queuing on a centralized machine that
submits to other machines (compute servers) • multiple users to share user workstations (often at night)
using the RSM “Limit Times for Job Submission” feature
RSM Usage Scenarios
![Page 20: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/20.jpg)
© 2015 ANSYS, Inc. June 18, 2015 20
• Improved robustness and scalability • Added support for Univa Grid Engine • Added support for Mechanical/MAPDL restart • Non-root users on Linux can now use RSM wizard • Enriched support for RSM customization • Added component override for design point update • Improved efficiency of Design Point updates…
Design objectives: • Equal fresh and exhaust gas mass flow
distribution to each cylinder • To minimize the overall pressure drop Input parameters: • Radii of 3 fillets near inlet (8 design points)
~5.0x speed-up over sequential execution
Parametric, Optimization of Intake Manifold
Initial
Optimized
Recent Enhancements in RSM
![Page 21: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/21.jpg)
© 2015 ANSYS, Inc. June 18, 2015 21
• Know your hardware lifecycle
• Have a goal in mind for what you want to achieve.
• Using Licensing productively
• Using ANSYS provided processes effectively.
Guidelines :
![Page 22: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/22.jpg)
© 2015 ANSYS, Inc. June 18, 2015 22
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 23: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/23.jpg)
© 2015 ANSYS, Inc. June 18, 2015 23
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects? Clusters?
CPUs? GPU/Phi?
![Page 24: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/24.jpg)
© 2015 ANSYS, Inc. June 18, 2015 24
Understanding the effect of clock speed
Generally, ANSYS applications scale with clock frequency
• Cost/performance argues for high clock (but maybe not top bin)
ANSYS DMP benchmarks (8 core)
• Clock effect is highest for sparse solver
Using higher clock speed is always helpful to realize productivity gains
![Page 25: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/25.jpg)
© 2015 ANSYS, Inc. June 18, 2015 25
Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570 x5570
2 x (2 x 6) = 24 cores
x5670
x5670
![Page 26: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/26.jpg)
© 2015 ANSYS, Inc. June 18, 2015 26
Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570 x5570
2 x (2 x 6) = 24 cores
x5670
x5670
Consider memory per core!
![Page 27: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/27.jpg)
© 2015 ANSYS, Inc. June 18, 2015 27
Understanding the effect of memory bandwidth - Is 16 Cores Equal to 16 Cores?
2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores
x5570
x5570 x5670
x5670
Using less cores per node can be helpful to realize productivity gains
![Page 28: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/28.jpg)
© 2015 ANSYS, Inc. June 18, 2015 28
Understanding the effect of memory bandwidth - ANSYS Mechanical
Consider memory per core!
![Page 29: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/29.jpg)
© 2015 ANSYS, Inc. June 18, 2015 29
Understanding the effect of memory speed • We can see here the effect of
memory speed.
• This has implications on how you build your hardware.
• Some processors types have slower memory speeds by default.
• On other processors non-optimally filling the memory channels can slow the memory speed.
• Has an effect on memory bandwidth
Using higher memory speed can be helpful to realize productivity gains
![Page 30: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/30.jpg)
© 2015 ANSYS, Inc. June 18, 2015 30
Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS CFD
• Turbo Boost (Intel)/ Turbo Core(AMD) is a form of over-clocking that allows you to give more GHz to individual processors when others are idle.
• With the Intel’s have seen variable performance with this ranging between 0-8% improvement depending on the numbers of cores in use.
• The graph below for CFX on a Intel X5550. This only sees a maximum of 2.5% improvement.
![Page 31: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/31.jpg)
© 2015 ANSYS, Inc. June 18, 2015 31
• We can see that relative to 1 core we can see good performance gains in many cases by using Turbo Boost on the E5 processor family.
Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS Mechanical
Using Turbo Boost / Core can be helpful to realize productivity gains - particularly for lower core counts
![Page 32: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/32.jpg)
© 2015 ANSYS, Inc. June 18, 2015 32
Hyper-threading Evaluation of Hyperthreading on ANSYS/FLUENT Performance
iDataplex M3 (Intel Xeon x5670, 2.93 GHz)TURBO: ON
(measurement is improvement relative ot Hyperthtreading OFF)
0.90
0.95
1.00
1.05
1.10
eddy_417K turbo_500K aircraft_2M sedan_4M truck_14MANSYS/FLUENT Model
Impr
ovem
et d
ue to
Hyp
erth
read
ing
.
HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores)
High
er is
bet
ter
Hyper-threading is NOT recommended
![Page 33: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/33.jpg)
© 2015 ANSYS, Inc. June 18, 2015 34
Generation to Generation - ANSYS Mechanical
Optimized for Intel Xeon E5 v3 processors: • ANSYS Mechanical 16.0 performs well on the latest Intel
processor architecture • Haswell processor-based system is 20% to 40% faster than Sandy
Bridge processor-based system for a variety of benchmarks
![Page 34: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/34.jpg)
© 2013 ANSYS, Inc. June 18, 2015 35
Ivy Bridge = “Tick” release of Sandy Bridge • Similar micro architecture, more cores, reduced power • Expect similar core-to-core performance on Ivy Bridge and Sandy Bridge • Improved node-to-node
Single-node performance of ANSYS Fluent 14.5 over six benchmark cases • 2x8 core Sandy Bridge vs. 2x12 core Ivy Bridge
50% performance boost matches core count increase • Scaling maintained on higher core density • Achieved via efficient memory use (and higher RAM speed)
ANSYS Fluent on Intel Ivy Bridge Ivy Bridge vs. Sandy Bridge – Single Node
Case Ivy Bridge Sandy Bridge Ratio turbo_500k 11755.1 7926.6 1.5 eddy_417k 2981.9 2192.9 1.4 aircraft_2m 2668.7 1797.2 1.5 sedan_4m 2070.7 1466.9 1.4 truck_14m 215.0 146.1 1.5
truck_poly_14m 233.1 156.7 1.5
![Page 35: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/35.jpg)
© 2013 ANSYS, Inc. June 18, 2015 36
Multi-node performance of ANSYS Fluent 14.5 – Up to 192 cores
Nearly identical core-to-core scaling confirms system “balance” for Fluent
ANSYS Fluent Ivy Bridge vs. Sandy Bridge – Scaling
0.0
500.0
1000.0
1500.0
2000.0
2500.0
0 64 128 192 256 320 384
Solv
er R
atin
g
Number of Cores
Truck_14m Solver Rating, Fluent 14.5
SandyBridge
Ivybridge
![Page 36: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/36.jpg)
© 2015 ANSYS, Inc. June 18, 2015 37
• This is a 4 socket vs. 2 socket node comparison. – Xeon E7-4890 v2 2.80 GHz (4 socket) – Xeon E5-2697 v2 2.70 GHz (2 socket)
• From the per node comparison you’d assume it was better to go with the 4 socket.
• Per core however the 2 socket is the better choice.
• Both are not showing linear scalability as they are running on all the cores per node (bandwidth constrained)
Per Node vs. Per Core Comparisons
![Page 37: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/37.jpg)
© 2015 ANSYS, Inc. June 18, 2015 38
Case Details: – Flow through a Combustor – Number of cells: 12 Million – Cell Type: Polyhedra – Models used: Realizable K-ε
turbulence – Pressure based coupled, species
transport, Least Square cell based, pseudo transient
ANSYS Application Example
Generation to Generation - ANSYS Fluent
![Page 38: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/38.jpg)
© 2015 ANSYS, Inc. June 18, 2015 39
Case Details: • External flow over a passenger sedan • Number of cells: 4 Million • Cell Type: Mixed • Models used: Standard K-ε turbulence • Solver: Pressure based coupled, steady,
Green-Gauss cell based
ANSYS Application Example
Generation to Generation - ANSYS Fluent
![Page 39: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/39.jpg)
© 2015 ANSYS, Inc. June 18, 2015 40
• Faster cores mean faster solution
• Faster memory means faster solution
• Memory bandwidth is an important factor for (linear) scale-ability
• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.
• In general hyper threading should not be used because of licensing implications.
• Be careful when looking at comparisons! Make sure you are comparing like with like!
Recap
![Page 40: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/40.jpg)
© 2015 ANSYS, Inc. June 18, 2015 41
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects? Clusters?
CPUs? GPU/Phi?
![Page 41: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/41.jpg)
© 2015 ANSYS, Inc. June 18, 2015 42
• Need fast interconnects to feed fast processors – Two main characteristics for each interconnect: latency and bandwidth – Distributed ANSYS is highly bandwidth bound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+ Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07 Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : 32 Number of physical cores available : 32 Number of cores requested : 4 (Distributed Memory Parallel) MPI Type: INTELMPI Core Machine Name Working Directory ---------------------------------------------------- 0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core 1 = 1.171 microseconds Latency time from master to core 2 = 2.251 microseconds Latency time from master to core 3 = 2.225 microseconds Communication speed from master to core 1 = 7934.49 MB/sec Same machine Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband
Understanding the effect of the interconnect
![Page 42: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/42.jpg)
© 2015 ANSYS, Inc. June 18, 2015 43
Understanding the effect of the interconnect - ANSYS Fluent
ANSYS/FLUENT Performance iDataplex M3 (Intel Xeon x5670, 12C 2.93 GHz)
Network: Gigabit, 10-Gigabit, 4X QDR Infiniband (QLogic, Voltaire)Hyperthreading: OFF, TURBO: ON
Models: truck_14M
0
1000
2000
3000
4000
5000
12 24 48 96 192 384 768Number of Cores used by a single job
FLU
ENT
Rat
ing
QLogic Voltaire 10-Gigabit GigabitHi
gher
is b
ette
r
![Page 43: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/43.jpg)
© 2015 ANSYS, Inc. June 18, 2015 44
Understanding the effect of the interconnect - ANSYS Fluent
Exhaust Model
7.6M cells Transient simulation with explicit time stepping for engine startup cycle Fujitsu PRIMERGY CX250 HPC systems (E5-2690v2 with 20 and E5-2697v2 with 24 cores per node, resp.) For CFD we can see the performance
of IB vs GiGE – GiGE starts to drop off after 2 nodes
![Page 44: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/44.jpg)
© 2015 ANSYS, Inc. June 18, 2015 45
Understanding the effect of the interconnect - ANSYS Fluent
For CFD 10 GiGE starts to taper off after 8 nodes
![Page 45: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/45.jpg)
© 2015 ANSYS, Inc. June 18, 2015 46
Understanding the effect of the interconnect - ANSYS Mechanical
V13sp-5 Model
Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) 0
10
20
30
40
50
60
8 cores 16 cores 32 cores 64 cores 128 cores
Ratin
g (r
uns/
day)
Interconnect Performance
Gigabit Ethernet
DDR Infiniband
![Page 46: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/46.jpg)
© 2015 ANSYS, Inc. June 18, 2015 47
Understanding the effect of the interconnect - ANSYS Mechanical
For ANSYS Mechanical GiGE does not scale to more than 1 node!
![Page 47: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/47.jpg)
© 2015 ANSYS, Inc. June 18, 2015 48
GiGE (Gigabit Ethernet) – 1 Gbits/sec ( 100 MB/sec )
10 GiGE – 10 Gbits/sec ( 1000 MB/sec )
Myrinet (Myricom, Inc) – 2 Gbits/sec ( 250 MB/sec ) – Myri 10G – 10 Gbits/sec (4th generation Myrinet)
Infiniband (many vendors/speeds) – SDR/DDR/QDR – 1x, 4x, 12x – http://en.wikipedia.org/wiki/List_of_device_bandwidths
Not recommended!!
Bare minimum!!
Understanding the effect of the interconnect - ANSYS Mechanical
RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes
![Page 48: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/48.jpg)
© 2015 ANSYS, Inc. June 18, 2015 49
• 10GiGE and Infiniband are recommended for HPC Clusters. • Currently Infiniband only for large clusters is
recommended • QDR should be more than adequate for small to medium
clusters. FDR for large clusters.
• For more than 1 node you will see performance decrease using GiGE. • For Mechanical users do not use GiGE at all if their jobs
span more than one node.
Recap
![Page 49: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/49.jpg)
© 2015 ANSYS, Inc. June 18, 2015 50
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects? Clusters?
CPUs? GPU/Phi?
![Page 50: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/50.jpg)
© 2015 ANSYS, Inc. June 18, 2015 51
Parallel file systems
▪ NFS Server and/or master node causes IO bottleneck
▪ Master node causes IO bottleneck
▪ IO scales with cluster
![Page 51: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/51.jpg)
© 2015 ANSYS, Inc. June 18, 2015 52
• The example across from here is using GPFS for Mechanical.
• Notice how it is very similar in speed to a local RAID 0 configuration (4 x 15k SAS)
Parallel file systems - ANSYS Mechanical
![Page 52: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/52.jpg)
© 2015 ANSYS, Inc. June 18, 2015 53
Parallel I/O is based on MPI-IO
Implemented for data file read and write
A single file is written collectively by the nodes
Suited for parallel file systems • Does not work on NFS
Support for Panasas, PVFS2, HP/SFS, IBM/GPFS, EMC/MPFS2, Lustre
Files cannot be written directly compressed but can be compressed asynchronously
Understanding the effect of I/O - ANSYS Fluent
![Page 53: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/53.jpg)
© 2015 ANSYS, Inc. June 18, 2015 54
Legacy NAS Serial IO Parallel IO Parallel IO(RAID-10, CW)
176 Cores 229.42 386.72 1182.60 1644.15
1644.15
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
Writ
e Da
ta F
ile T
hrou
ghpu
t (M
B/s)
Truck-111m
Para
llel I
O =
7x
( Leg
acy-
NAS
)
Par
alle
l IO
= 4
x ( S
eria
l-IO
)
Truck-111million (uses DES model with the segregated implicit solver)
Panasas layout available with MPI-IO Hints
in Fluent 14.5
Understanding the effect of I/O - ANSYS Fluent
![Page 54: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/54.jpg)
© 2015 ANSYS, Inc. June 18, 2015 55
Landing Gear Noise Predictions using Scale-Resolving Simulations (180M cell model using pressure based segregated solver)
Understanding the effect of I/O - ANSYS Fluent
![Page 55: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/55.jpg)
© 2015 ANSYS, Inc. June 18, 2015 56
Mesh File Location Async I/O Time
15M Cas NFS OFF 217s
15M Cas NFS ON 62s
15M Dat NFS OFF 113s
15M Dat NFS ON 8s
30M Cas NFS OFF 207s
30M Cas NFS ON 75s
30M Dat NFS OFF 144s
30M Dat NFS ON 10s
Asynchronous I/O for Linux Fluent Total write time 3-5x quicker over NFS Even larger speed-ups on bigger cases and local disk (up to 10x)
Understanding the effect of I/O - ANSYS Fluent
![Page 56: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/56.jpg)
© 2015 ANSYS, Inc. June 18, 2015 57
Understanding the effect of I/O - ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Ratin
g (jo
bs/d
ay)
#Machine X #Core Memory
SP-5 (in-core) R14.5 Benchmark Results
![Page 57: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/57.jpg)
© 2015 ANSYS, Inc. June 18, 2015 58
Understanding the effect of I/O - ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Ratin
g (jo
bs/d
ay)
#Machine X #Core Memory
SP-5 (in-core) R14.5 Benchmark Results
![Page 58: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/58.jpg)
© 2015 ANSYS, Inc. June 18, 2015 59
Understanding the effect of I/O - ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Ratin
g (jo
bs/d
ay)
#Machine X #Core Memory
SP-5 (in-core) R14.5 Benchmark Results
![Page 59: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/59.jpg)
© 2015 ANSYS, Inc. June 18, 2015 60
• IO is very important for Mechanical Solver o Raid 0 mandatory for multiple disks o SSD’s recommended for speed, 15k SAS drives
• FLUENT and CFX for most customers won’t require fast local disk access (for most type of job)
• Parallel file systems can meet the requirements of both types of solver.
Recap
![Page 60: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/60.jpg)
© 2015 ANSYS, Inc. June 18, 2015 61
Is Your Hardware Ready for HPC? - ANSYS Mechanical
100
200
400
600
800
1000
1200
I/O [Mb/s]
RAM [Gb] 8 16 32 48 64 96 128
2x S
SD
1x S
SD
2x S
AS
1x S
AS
0.2 Mdof
2 Mdof
4 Mdof
> 6 Mdof
![Page 61: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/61.jpg)
© 2015 ANSYS, Inc. June 18, 2015 62
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects? Clusters?
CPUs? GPU/Phi?
![Page 62: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/62.jpg)
© 2015 ANSYS, Inc. June 18, 2015 63
DMP Outperforming SMP
6 Mio Degrees of Freedom Plasticity, Contact Bolt pretension 4 load steps
![Page 63: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/63.jpg)
© 2015 ANSYS, Inc. June 18, 2015 64
DMP: Good Performance at High Core Counts
Number of Cores Number of Cores
• Intel Xeon E5-2690 processors (2.9 GHz, 16 cores total) • 128 GB of RAM
10.7 Mio Degrees of Freedom Static, linear, structural 1 load step
1 Mio Degrees of Freedom Harmonic, linear, structural 4 frequencies
![Page 64: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/64.jpg)
© 2015 ANSYS, Inc. June 18, 2015 65
0
5
10
15
20
25
0 8 16 24 32 40 48 56 64
Spee
dup
Solution Scalability
Minimum time to solution more important than scaling
ANSYS Mechanical 14.5 DMP Enabling Scalability at High Core Counts
V14sp-5 Model
Turbine geometry 2.1 million DOF Static, nonlinear analysis 1 loadstep, 7 substeps, 25 equilibrium iterations 8-node Linux cluster (with 8 cores per node)
![Page 65: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/65.jpg)
© 2015 ANSYS, Inc. June 18, 2015 66
1.3x 1.7x
2.7x 2.4x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 8 cores
by an enhanced domain decomposition method
ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts
8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
![Page 66: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/66.jpg)
© 2015 ANSYS, Inc. June 18, 2015 67
1.6x 1.8x
3.8x 4.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 16 cores
ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts
by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
![Page 67: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/67.jpg)
© 2015 ANSYS, Inc. June 18, 2015 68
1.8x 2.2x
3.9x
5.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spee
dup
over
R14
.5
Improved Scaling at 32 cores
ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts
by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
![Page 68: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/68.jpg)
© 2015 ANSYS, Inc. June 18, 2015 70
ANSYS Mechanical 16.0 Faster Performance at Higher Core Counts
Continually improving Core Solver Rating to 128 cores
Courtesy of HP
![Page 69: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/69.jpg)
© 2015 ANSYS, Inc. June 18, 2015 71
ANSYS Mechanical 15.0 HPC & Solver Technology Improvements
• Improved Scalability of Distributed solver at higher core counts
• NEW Subspace eigen solver supports Shared and Distributed Parallel technology
• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics
Coupled Acoustic, 1.2 M DOF, Full Harmonic Response
2.09 MDOFs first 20 modes
![Page 70: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/70.jpg)
© 2015 ANSYS, Inc. June 18, 2015 72
GPU/Phi? HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects? Clusters?
CPUs?
![Page 71: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/71.jpg)
© 2015 ANSYS, Inc. June 18, 2015 73
GPUs are accelerators and can significantly speed up your simulations • GPUs work hand in hand with CPUs
Most ANSYS GPU acceleration is user-transparent • Only requirement is to inform ANSYS of how many GPUs to use
Schematic of a CPU with an attached GPU accelerator • CPU begins/ends job, GPU manages heavy computations
Some Basics ANSYS Software on NVIDIA GPUs
![Page 72: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/72.jpg)
© 2015 ANSYS, Inc. June 18, 2015 74
GPU-based Model: Radiation Heat Transfer using OptiX
GPU-based Solver: Coupled Algebraic Multigrid (AMG) PBNS linear solver
Operating Systems: Both Linux and Win64 for workstations and servers
Parallel Methods: Shared and distributed memory
Supported GPUs: Tesla K40, Tesla K80, and Quadro 6000
Multi-GPU Support: Full multi-GPU and multi-node support
Model Suitability: Unlimited (hardware dependent)
GPU Accelerator Capability - ANSYS Fluent
![Page 73: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/73.jpg)
© 2015 ANSYS, Inc. June 18, 2015 75
ANSYS Fluent on GPU Performance of Pressure-Based Solver
Sedan Model
Sedan geometry 3.6M mixed cells Steady, turbulent External aerodynamics Coupled PBNS, DP CPU: Intel Xeon E5-2680; 8 cores GPU: 2 X Tesla K40
CPU + GPU
Segregated solver
1.9x
Higher is
Better
Coupled solver CPU only CPU only
15 Jobs/day
12 Jobs/day
27 Jobs/day
Convergence criteria: 10e-03 for all variables; No of iterations until convergence: segregated CPU-2798 iterations (7070 secs); coupled CPU-967 iterations (5900 secs); coupled 985 iterations (3150 secs)
NOTE: Times for total solution until convergence
![Page 74: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/74.jpg)
© 2015 ANSYS, Inc. June 18, 2015 76
ANSYS Fluent on GPU Performance of Pressure-Based Solver
CPU
Benefit
100% 100%
200%
GPU
Cost
11 Jobs/day
33 Jobs/day
CPU-only solution cost
Simulation productivity from CPU-only system
Additional productivity from GPUs
Additional cost of adding GPUs 40%
Truck Model
External aerodynamics 14 million cells Steady, k-ε turbulence Coupled PBNS, DP 2 nodes each with dual Intel Xeon E5-2698 V3 (16 CPU cores) and dual Tesla K80 GPUs
Higher is
Better
Simulation productivity (with an HPC Workgroup 64 license)
64 CPU cores 56 CPU cores + 4 Tesla K80
![Page 75: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/75.jpg)
© 2015 ANSYS, Inc. June 18, 2015 77
ANSYS Fluent on GPU Better Speedup on Larger Models
Truck Model
NOTE: Reported times are per
iteration 14 million cells
13
9.5
111 million cells
36
18
144 CPU cores
1.4 X 2 X
Lower is
Better
36 CPU cores
36 CPU cores + 12 GPUs
ANSY
S Fl
uent
Tim
e (S
ec)
External aerodynamics Steady, k-ε turbulence Double-precision solver CPU: Intel Xeon E5-2667; 12 cores per node GPU: Tesla K40, 4 per node
144 CPU cores + 48 GPUs
![Page 76: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/76.jpg)
© 2015 ANSYS, Inc. June 18, 2015 78
NVIDIA-GPU Solution Fit for ANSYS Fluent
Yes
No
Pressure-based coupled
solver?
Pressure–based coupled solver
Best-fit for GPUs
Segregated solver Is it a
steady-state analysis?
No
Consider switching to the pressure-based coupled solver for better performance (faster convergence) and further speedups with GPUs. Please see the next slide.
Yes
Is it single-phase & flow dominant?
Not ideal for GPUs
CFD analysis
No
![Page 77: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/77.jpg)
© 2015 ANSYS, Inc. June 18, 2015 79
NVIDIA-GPU Solution Fit for ANSYS Fluent - Supported Hardware Configurations
CPU
G
PU
CPU
G
PU
CPU
G
PU
CPU
G
PU
Some nodes with 16 processes and some with 12 processes
Some nodes with 2 GPUs some with 1 GPU
15 processes not divisible by 2 GPUs
● Homogeneous process distribution ● Homogeneous GPU selection ● Number of processes be an exact
multiple of number of GPUs
![Page 78: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/78.jpg)
© 2015 ANSYS, Inc. June 18, 2015 80
ANSYS Fluent - Power Consumption Study
• Adding GPUs to a CPU-only node resulted in 2.1x speed up while reducing energy consumption by 38%
![Page 79: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/79.jpg)
© 2015 ANSYS, Inc. June 18, 2015 81
NVIDIA-GPU Solution Fit for ANSYS Fluent
GPUs accelerate the AMG solver portion of the CFD analysis, thus benefit problems with relatively high %AMG • Coupled solvers have high %AMG in the range of 60-70% • Fine meshes and low-dissipation problems have high %AMG
In some cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)
The whole problem must fit on GPUs for the calculations to proceed • In pressure-based coupled solver, each million cells need approx. 4 GB of GPU memory • High-memory cards such as Tesla K40 or Quadro K6000 are ideal
Moving scalar equations such as turbulence may not benefit much because of low workloads (using ‘scalar yes’ option in ‘amg-options’)
Better performance on lower CPU core counts • A ratio of 3 or 4 CPU cores to 1 GPU is recommended
![Page 80: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/80.jpg)
© 2015 ANSYS, Inc. June 18, 2015 82
GPU Accelerator Capability - ANSYS Mechanical
Supports majority of ANSYS structural mechanics solvers: • Covers both sparse direct and PCG iterative solvers • Only a few minor limitations
Ease of use: • Requires at least one supported GPU card to be installed • No rebuild, no additional installation steps Performance: • Offer significantly faster time to solution • Should never slow down your simulation
V14sp-5 Model
![Page 81: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/81.jpg)
© 2015 ANSYS, Inc. June 18, 2015 83
Influence of GPU Accelerator on Speedup
5.9x 3.7x 2.4x
ANSYS Mechanical Model – Impeller Impeller geometry of ~2M DOF, solid FEs
Normal modes analysis using cyclic symmetry
ANSYS Mechanical SMP and Block-Lanczos solver
Speedup Impeller 2M DOF Normal modes 4 cores + GPU
= 2.4x speedup vs. 4 cores
ANSYS Mechanical Model – Speaker Speaker geometry of ~0.7M DOF, solid FEs
Vibroacoustic harmonic analysis for one frequency
ANSYS Mechanical distributed sparse solver
Speaker 0.7M DOF Harmonic analysis
4 cores + GPU = 2.7x speedup
vs. 4 cores
Speedup
![Page 82: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/82.jpg)
© 2015 ANSYS, Inc. June 18, 2015 84
NVIDIA-GPU Solution Fit for ANSYS Mechanical
GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs • Characterized by both high DOF and high factorization requirements • Models with solid elements (such as castings) and have >500K DOF experience good
speedups
Better performance when run on DMP mode over SMP mode
GPU and system memories both play important roles in performance • Sparse solver:
– Bulkier and/or higher-order FE models are good and will be accelerated – If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla K20
or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000).
• PCG/JCG solver: – Memory saving (MSAVE) option should be turned off for enabling GPUs – Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs
![Page 83: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/83.jpg)
© 2015 ANSYS, Inc. June 18, 2015 87
GPU Achievements ANSYS Mechanical 16.0 Supporting Newest GPUs
6 CPU cores + K80 GPU
1.8x
8 CPU cores 6 CPU cores + K80 GPU
2.3x
8 CPU cores
Higher is
Better
159 Jobs/day 135
Jobs/day
247 Jobs/day
371 Jobs/day
V15sp-4 Model
Turbine geometry 3.2 million DOF SOLID187 elements Static, nonlinear analysis Sparse direct solver
V15sp-5 Model
Ball grid array geometry 6.0 million DOF Static, nonlinear analysis Sparse direct solver
Distributed ANSYS Mechanical 16.0 with Intel Xeon E5-2697v2 2.7 GHz 8-core CPU; Tesla K80 GPU with boost clocks.
![Page 84: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/84.jpg)
© 2015 ANSYS, Inc. June 18, 2015 89
GPUs can offer significantly faster time to solution
Lower core
counts favor a single GPU
Higher core
counts favor
multiple GPUs
Courtesy of HP
GPU Achievements ANSYS Mechanical 15.0 Supporting Newest GPUs
![Page 85: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/85.jpg)
© 2015 ANSYS, Inc. June 18, 2015 92
GPU Achievements ANSYS Mechanical 16.0 Supporting Xeon Phi
Background: • ANSYS Mechanical 15.0 was the
first commercial FEA program to support Intel Xeon Phi coprocessor
• It was limited to shared memory parallelism (SMP) on Linux only
Intel Xeon Phi coprocessor support • R16 now supports distributed
memory parallelism (DMP) and Windows
3.6
1.8
5.1
3.0
7.0
4.7
9.8
6.0
14.4
0
4
8
12
16
No Xeon Phi Xeon Phi
Spee
dup
1 core2 cores4 cores8 cores16 cores
![Page 86: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/86.jpg)
© 2015 ANSYS, Inc. June 18, 2015 93
GPU Achievements ANSYS License Scheme for GPU and Phi
6 CPU Cores + 2 GPU/Phi 1 x ANSYS HPC Pack 4 CPU Cores + 4 GPU/Phi
Licensing Examples:
Total 8 HPC Tasks (4 GPU/Phi Max)
2 x ANSYS HPC Pack Total 32 HPC Tasks (16 GPU/Phi Max)
Example of Valid Configurations:
24 CPU Cores + 8 GPU/Phi
(Total Use of 2 Compute Nodes)
.
.
.
.
. (Applies to all schemes: ANSYS HPC, ANSYS HPC Pack, ANSYS HPC Workgroup)
![Page 87: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/87.jpg)
© 2015 ANSYS, Inc. June 18, 2015 95
HDD vs. SSD
Maximizing Performance – Putting it Together
The right combination of hardware and software
leads to maximum efficiency
SMP vs. DMP
Interconnects?
Clusters?
CPUs? GPU/Phi?
![Page 88: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/88.jpg)
© 2015 ANSYS, Inc. June 18, 2015 96
#1 Rule Avoid waiting for I/O to complete • Always check if job is I/O bound or compute bound
– Check output file for CPU and Elapsed times • When Elapsed time >> main thread CPU time I/O bound
– Consider adding more RAM or faster hard drive configuration • When Elapsed time ≈ main thread CPU time Compute bound
– Considering moving simulation to a machine with newer, faster processors – Consider using Distributed ANSYS (DMP) instead of SMP – Consider running on more CPU cores or possibly using GPU(s)
Total CPU time for main thread : 159.8 seconds . . . . . . Elapsed Time (sec) = 398.000 Date = 03/21/2013
Maximizing Performance – ANSYS Mechanical
![Page 89: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/89.jpg)
© 2015 ANSYS, Inc. June 18, 2015 97
Maximizing Performance – ANSYS Mechanical
How to improve an I/O bound simulation – First consider adding more RAM
• Always the best option for optimal performance • Allows the operating system to cache file data in memory
– Next consider improving the I/O configuration • Need fast hard drives to feed fast processors
– Consider SSDs – Higher bandwidths and extremely low seek times
– Consider RAID configurations RAID 0 – for speed RAID 1,5 – for redundancy RAID 10 – for speed and redundancy
![Page 90: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/90.jpg)
© 2015 ANSYS, Inc. June 18, 2015 98
Example of an I/O bound simulation
0.8x
2.9x 2.7x
5.9x 5.9x
0
1
2
3
4
5
6
7
2 cores, HDD 8 cores, HDD 8 cores, SSD
Rela
tive
Spee
dup
Benefits of SSD and RAM
16 GB RAM128 GB RAM
Maximizing Performance – ANSYS Mechanical
Adding RAM gives biggest gains & allows good scaling
Lack of RAM and slow HDD ruin scaling
Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • One 10k rpm HDD, one SSD • Windows 7
![Page 91: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/91.jpg)
© 2015 ANSYS, Inc. June 18, 2015 99
Maximizing Performance – ANSYS Mechanical
How to improve a compute bound simulation – First consider using newer, faster processors
• New CPU architecture and faster clock speeds always help
– Next consider using parallel processing • DMP virtually always recommended over SMP
• More computations performed in parallel with DMP • Significantly faster speedups achieved using DMP • DMP can take advantage of all resources on a cluster
• Whole new class of problems can be solved!!
– Last consider using GPU acceleration • Can help accelerate critical, time-consuming computations
![Page 92: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/92.jpg)
© 2015 ANSYS, Inc. June 18, 2015 100
Example of a compute bound simulation
Maximizing Performance – ANSYS Mechanical
1.8x
4.0x
11.0x
0
2
4
6
8
10
12
2 cores 8 cores 8 cores, GPU
Rela
tive
Spee
dup
Benefits of DMP and GPU
Xeon x5675
Xeon E5-2670Maximum performance found by adding GPU
Using newer Xeons gives big gain
Using 8 cores gives faster performance
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • 1 Tesla K20c • Windows 7
![Page 93: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/93.jpg)
© 2015 ANSYS, Inc. June 18, 2015 101
Balanced System for Overall Optimum Performance
Maximizing Performance – ANSYS Mechanical
1.0x 2.7x 5.2x
12.5x
0
5
10
15
20
25
30
2 cores 8 cores 8 cores +GPU
8 cores +GPU + SSD
Rela
tive
Spee
dup
Balanced Performance IO Bound
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 16 GB RAM • SSD and SATA disks • 1 Tesla K20c • Windows 7
![Page 94: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/94.jpg)
© 2015 ANSYS, Inc. June 18, 2015 102
Balanced System for Overall Optimum Performance
Maximizing Performance – ANSYS Mechanical
• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • SSD and SATA disks • 1 Tesla K20c • Windows 7
1.0x 2.7x 5.2x
12.5x
5.7x
12.0x
24.8x 27.3x
0
5
10
15
20
25
30
2 cores 8 cores 8 cores +GPU
8 cores +GPU + SSD
Rela
tive
Spee
dup
Balanced Performance
IO Bound
Compute Bound
![Page 95: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/95.jpg)
© 2015 ANSYS, Inc. June 18, 2015 103
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
![Page 96: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/96.jpg)
© 2015 ANSYS, Inc. June 18, 2015 104
• An important part of specifying an HPC system is to purchase a balanced system.
• There is no point in spending all your money on the processor if the I/O is your biggest bottleneck.
• You are only as good as your slowest component!
Wrap-up - Hardware
![Page 97: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/97.jpg)
© 2015 ANSYS, Inc. June 18, 2015 105
Scalable HPC Licensing
ANSYS HPC (per-process)
ANSYS HPC Pack • HPC product rewarding volume parallel processing for
high-fidelity simulations • Each simulation consumes one or more Packs • Parallel enabled increases quickly with added Packs
ANSYS HPC Workgroup • HPC product rewarding volume parallel processing for
increased simulation throughput within a single co-located workgroup
• 16 to 32768 parallel shared across any number of simulations on a single server
• “Enterprise” options available to deploy and use anywhere in the world
Single HPC solution for FEA/CFD/FSI and any level
of fidelity
2048
32 8
128 512
Parallel Enabled (Cores)
HPC Packs per Simulation 1 2 3 4 5
32768 8192
6 7
![Page 98: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/98.jpg)
© 2015 ANSYS, Inc. June 18, 2015 106
• ANSYS HPC and ANSYS HPC Workgroup gives Flexible use of a pool of licenses.
• ANSYS HPC Pack gives “quick” scale-up but is more restrictive in how users can use it.
• The ability to be more flexible is why HPC Workgroup options cost more than the HPC Packs.
Which type of Licensing is right for me?
![Page 99: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/99.jpg)
© 2015 ANSYS, Inc. June 18, 2015 107
Number of Simultaneous Design Points Enabled HPC license for running parametric FEA or CFD simulations on multiple CPU cores simultaneously, and more cost effectively
64
2
8
Number of HPC Parametric Pack Licenses 1
4
16
32
3 4 5
ANSYS HPC Parametric Pack License
Key Benefits • Ability to automatically and simultaneously
execute design points while consuming just one set of application licenses
• Scalable because number of simultaneous design points enabled increases quickly with added packs
• Amplifies complete workflow because design points can include execution of multiple applications (pre, meshing, solve, HPC, post)
![Page 100: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/100.jpg)
© 2015 ANSYS, Inc. June 18, 2015 108
Click on webinars related to HPC/IT for more and upcoming ones!
Additional Resources - IT Webinars
Watch recorded webinars by clicking below: • Understanding Hardware Selection for ANSYS 15.0 • How to Speed Up ANSYS 15.0 with GPUs • Intel Technologies Enabling Faster, More Effective Simulation • Optimizing Remote Access to Simulation
![Page 101: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/101.jpg)
© 2015 ANSYS, Inc. June 18, 2015 109
White Papers by clicking below: • Optimizing Business Value in High-Performance Engineering Computing
• IBM Application Ready Solutions Reference Architecture for ANSYS
• Intel Solid-State Drives Increase Productivity of Product Design and Simulation
• Value of HPC for Ensuring Product Integrity
Additional Resources - IT White Papers & Technical Briefs
Technical Briefs by clicking below: • Parallel Scalability of ANSYS 15.0 on Hewlett-Packard Systems
• SGI Technology Guide for ANSYS Mechanical Analysts
• SGI Technology Guide for ANSYS Fluent Analysts
• Accelerating ANSYS Fluent 15.0 Using NVIDIA GPUs
![Page 102: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/102.jpg)
© 2015 ANSYS, Inc. June 18, 2015 110
Additional Resources - ANSYS IT Webcast Series
On-demand webinars: • Understanding Hardware Selection for ANSYS 15.0 • How to Speed Up ANSYS 15.0 with GPUs • Cloud Hosting of ANSYS: Gompute On-Demand Solutions • Simplified HPC Clusters for ANSYS Users • Intel Technologies Enabling Faster, More Effective Simulation • Accelerating Time-to-Results with Parallel I/O • Extreme Scalability for High-Fidelity CFD Simulations • Methodology and Tools for Compute Performance at Any Scale • Understanding Hardware Selection for Structural Mechanics • Optimizing Remote Access to Simulation • Scalable Storage and Data Management for Engineering Simulation
http://www.ansys.com/Support/Platform+Support/IT+Solutions+for+ANSYS+Webcast+Series
![Page 103: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/103.jpg)
© 2015 ANSYS, Inc. June 18, 2015 111
Additional Resources
ANSYS Platform Support • http://www.ansys.com/Support/Platform+Support
– Platform Support Policies – Supported Platforms – Supported Hardware – Tested Systems – ANSYS Benchmarks
![Page 104: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/104.jpg)
© 2015 ANSYS, Inc. June 18, 2015 112
ANSYS Partner Solutions – http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners
• Reference configurations • Performance data • White papers • Sales contact points
Performance Data – http://www.ansys.com/benchmarks
Additional Resources
![Page 105: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/105.jpg)
© 2015 ANSYS, Inc. June 18, 2015 113
Additional Resources
The Manual • Sections on best practices and parallel
processing for various solvers • Performance Guide for Mechanical • Installation walkthroughs for installing the
products, parallel processing, licensing and RSM (remote solve manager)
ANSYS Advantage • Online Magazine
![Page 106: Understanding Hardware Selection to Speedup Your CFD and ...](https://reader031.fdocuments.us/reader031/viewer/2022012504/617e8ea946954543180b2dd9/html5/thumbnails/106.jpg)
© 2015 ANSYS, Inc. June 18, 2015 114
• Connect with Me – [email protected]
• Connect with ANSYS, Inc.
– LinkedIn ANSYSInc – Twitter @ANSYS_Inc – Facebook ANSYSInc
• Follow our Blog
– ansys-blog.com
Thank You!