Understanding Hardware Selection to Speedup Your CFD and ...

© 2015 ANSYS, Inc. June 18, 2015 1

Understanding Hardware Selection to Speedup Your CFD and FEA Simulations

© 2015 ANSYS, Inc. June 18, 2015 2

• Why Talking About Hardware

• HPC Terminology

• ANSYS Work-flow

• Hardware Considerations

• Additional resources

Agenda

© 2015 ANSYS, Inc. June 18, 2015 3


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2015 ANSYS, Inc. June 18, 2015 4

Most Users Constrained by Hardware

Source: HPC Usage survey with over 1,800 ANSYS respondents

© 2015 ANSYS, Inc. June 18, 2015 5

Problem Statement

I am not achieving the performance and throughput I was

expecting from my hardware & software

Image courtesy of Intel Corporation

© 2015 ANSYS, Inc. June 18, 2015 6

Building A Balanced System Is The Key To Improving Your Experience

If Your System Is

Slow So Are Your

Engineers &

Analysts Processors

Memory

Storage

Networks

© 2015 ANSYS, Inc. June 18, 2015 7

What Hardware Configuration to Select?

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

HDD vs. SSD

Interconnects?

Clusters?

GPUs? CPUs?

© 2015 ANSYS, Inc. June 18, 2015 8


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2015 ANSYS, Inc. June 18, 2015 9

HPC Hardware Terminology

Machine 1 (or Node 1)

GPU

Processor 1 (or Socket 1)


Interconnect (GigE or InfiniBand)

Machine N (or Node N)

GPU



© 2015 ANSYS, Inc. June 18, 2015 10


Shared Memory Parallel

• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.

• OpenMP is the industry standard.


© 2015 ANSYS, Inc. June 18, 2015 11

Distributed Memory Parallel

• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.

• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.

• MPI (Message Passing Interface) is the industry standard for this.



© 2015 ANSYS, Inc. June 18, 2015 12


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2015 ANSYS, Inc. June 18, 2015 13

Typical HPC Growth Path

Cluster Users Desktop User Workstation and/or

Server Users

Cloud Solution

© 2015 ANSYS, Inc. June 18, 2015 14

• Ideal for – remote users submitting jobs from a Windows machine to a Linux cluster or local

users submitting jobs to a Linux cluster – users that do not have enough power (memory or graphics) on their local

workstation to build large meshes or view graphics.

• ANSYS 16.0 supports the following remote visualization applications – Nice Desktop Cloud Visualiation (DCV) 2013

• Linux server + Linux/Windows client

– OpenText Exceed onDemand 8 SP2/SP3 • Linux server + Linux/Windows client

– RealVNC Enterprise Edition 5.0.4 (with VirtualGL) • Linux server + Linux/Windows client

– (on Windows cluster: Microsoft Remote Desktop) • Hardware requirements for remote visualization servers require:

– GPU capable video cards – large amounts of RAM accessible for multiple user availability when running

ANSYS applications and pre/post processing

Remote Visualization

© 2015 ANSYS, Inc. June 18, 2015 15

Virtual Desktop (VDI) Support

Key focus area at ANSYS (internal use & software QA)

Focus on GPU Pass-Through • One GPU per VM, up to 8 VMs per machine (K1, K2 cards);

memory constraints will limit in any case

vGPU (NVIDIA Grid) as it matures; testing internally

Not SW rendering, Not Shared GPU (too slow)

Supported at R16.0:

© 2015 ANSYS, Inc. June 18, 2015 16

Desktop Server Cluster (with 3rd party scheduler)

The Remote Solve Manager (RSM) is a GUI-based, job queuing system that distributes simulation tasks to (shared) computing resources

RSM enables tasks to be • Run in background mode on the local machine • Sent to a remote compute machine • Broken into a series of jobs for parallel processing across a variety of computers

RSM as a scheduler RSM as a transport mechanism

• Submits to RSM itself.

• Unit recognition: jobs (e.g. a run of a solver such as CFX, Fluent or Mechanical)

• Submits through RSM to a high-level scheduler such as LSF, PBS Pro, Windows HPC Server 2008 R2 / 2012, and Univa Grid Engine (at R15.0).

• Unit recognition: cores

ANSYS Remote Solve Manager (RSM)

© 2015 ANSYS, Inc. June 18, 2015 17

Submission from a client to a centralized (shared) compute resource, allowing • back-ground queuing on a centralized machine • multiple users to share a common, usually large

memory/fast machine (compared to client machine)

RSM Usage Scenarios

© 2015 ANSYS, Inc. June 18, 2015 18



Submission from a client to multiple (shared) compute resources, allowing • back-ground queuing on a centralized machine that

submits to other machines (compute servers) • multiple users to share user workstations (often at night)

using the RSM “Limit Times for Job Submission” feature

RSM Usage Scenarios

© 2015 ANSYS, Inc. June 18, 2015 19



Submission from a client to a centralized (shared) compute resource with a job scheduler, allowing • back-ground queuing on a centralized machine that

submits to a job scheduler (e.g. LSF) • multiple users to run multi-node jobs on shared compute

resources

Submission from a client to multiple (shared) compute resources, allowing • back-ground queuing on a centralized machine that

submits to other machines (compute servers) • multiple users to share user workstations (often at night)

using the RSM “Limit Times for Job Submission” feature

RSM Usage Scenarios

© 2015 ANSYS, Inc. June 18, 2015 20

• Improved robustness and scalability • Added support for Univa Grid Engine • Added support for Mechanical/MAPDL restart • Non-root users on Linux can now use RSM wizard • Enriched support for RSM customization • Added component override for design point update • Improved efficiency of Design Point updates…

Design objectives: • Equal fresh and exhaust gas mass flow

distribution to each cylinder • To minimize the overall pressure drop Input parameters: • Radii of 3 fillets near inlet (8 design points)

~5.0x speed-up over sequential execution

Parametric, Optimization of Intake Manifold

Initial

Optimized

Recent Enhancements in RSM

© 2015 ANSYS, Inc. June 18, 2015 21

• Know your hardware lifecycle

• Have a goal in mind for what you want to achieve.

• Using Licensing productively

• Using ANSYS provided processes effectively.

Guidelines :

© 2015 ANSYS, Inc. June 18, 2015 22


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2015 ANSYS, Inc. June 18, 2015 23

HDD vs. SSD


SMP vs. DMP Interconnects? Clusters?

CPUs? GPU/Phi?

© 2015 ANSYS, Inc. June 18, 2015 24

Understanding the effect of clock speed

Generally, ANSYS applications scale with clock frequency

• Cost/performance argues for high clock (but maybe not top bin)

ANSYS DMP benchmarks (8 core)

• Clock effect is highest for sparse solver

Using higher clock speed is always helpful to realize productivity gains

© 2015 ANSYS, Inc. June 18, 2015 25

Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores?

3 x (2 x 4) = 24 cores

x5570

x5570 x5570

2 x (2 x 6) = 24 cores

x5670

x5670

© 2015 ANSYS, Inc. June 18, 2015 26


3 x (2 x 4) = 24 cores

x5570

x5570 x5570

2 x (2 x 6) = 24 cores

x5670

x5670

Consider memory per core!

© 2015 ANSYS, Inc. June 18, 2015 27


2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores

x5570

x5570 x5670

x5670

Using less cores per node can be helpful to realize productivity gains

© 2015 ANSYS, Inc. June 18, 2015 28

Understanding the effect of memory bandwidth - ANSYS Mechanical

Consider memory per core!

http://www.hp.com/go/wsansys

© 2015 ANSYS, Inc. June 18, 2015 29

Understanding the effect of memory speed • We can see here the effect of

memory speed.

• This has implications on how you build your hardware.

• Some processors types have slower memory speeds by default.

• On other processors non-optimally filling the memory channels can slow the memory speed.

• Has an effect on memory bandwidth

Using higher memory speed can be helpful to realize productivity gains

© 2015 ANSYS, Inc. June 18, 2015 30

Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS CFD

• Turbo Boost (Intel)/ Turbo Core(AMD) is a form of over-clocking that allows you to give more GHz to individual processors when others are idle.

• With the Intel’s have seen variable performance with this ranging between 0-8% improvement depending on the numbers of cores in use.

• The graph below for CFX on a Intel X5550. This only sees a maximum of 2.5% improvement.

© 2015 ANSYS, Inc. June 18, 2015 31

• We can see that relative to 1 core we can see good performance gains in many cases by using Turbo Boost on the E5 processor family.

Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS Mechanical

Using Turbo Boost / Core can be helpful to realize productivity gains - particularly for lower core counts

© 2015 ANSYS, Inc. June 18, 2015 32

Hyper-threading Evaluation of Hyperthreading on ANSYS/FLUENT Performance

iDataplex M3 (Intel Xeon x5670, 2.93 GHz)TURBO: ON

(measurement is improvement relative ot Hyperthtreading OFF)

0.90

0.95

1.00

1.05

1.10

eddy_417K turbo_500K aircraft_2M sedan_4M truck_14MANSYS/FLUENT Model

Impr

ovem

et d

ue to

Hyp

erth

read

ing

.

HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores)

High

er is

bet

ter

Hyper-threading is NOT recommended

© 2015 ANSYS, Inc. June 18, 2015 34

Generation to Generation - ANSYS Mechanical

Optimized for Intel Xeon E5 v3 processors: • ANSYS Mechanical 16.0 performs well on the latest Intel

processor architecture • Haswell processor-based system is 20% to 40% faster than Sandy

Bridge processor-based system for a variety of benchmarks

© 2013 ANSYS, Inc. June 18, 2015 35

Ivy Bridge = “Tick” release of Sandy Bridge • Similar micro architecture, more cores, reduced power • Expect similar core-to-core performance on Ivy Bridge and Sandy Bridge • Improved node-to-node

Single-node performance of ANSYS Fluent 14.5 over six benchmark cases • 2x8 core Sandy Bridge vs. 2x12 core Ivy Bridge

50% performance boost matches core count increase • Scaling maintained on higher core density • Achieved via efficient memory use (and higher RAM speed)

ANSYS Fluent on Intel Ivy Bridge Ivy Bridge vs. Sandy Bridge – Single Node

Case Ivy Bridge Sandy Bridge Ratio turbo_500k 11755.1 7926.6 1.5 eddy_417k 2981.9 2192.9 1.4 aircraft_2m 2668.7 1797.2 1.5 sedan_4m 2070.7 1466.9 1.4 truck_14m 215.0 146.1 1.5

truck_poly_14m 233.1 156.7 1.5

© 2013 ANSYS, Inc. June 18, 2015 36

Multi-node performance of ANSYS Fluent 14.5 – Up to 192 cores

Nearly identical core-to-core scaling confirms system “balance” for Fluent

ANSYS Fluent Ivy Bridge vs. Sandy Bridge – Scaling

0.0

500.0

1000.0

1500.0

2000.0

2500.0

0 64 128 192 256 320 384

Solv

er R

atin

g

Number of Cores

Truck_14m Solver Rating, Fluent 14.5

SandyBridge

Ivybridge

© 2015 ANSYS, Inc. June 18, 2015 37

• This is a 4 socket vs. 2 socket node comparison. – Xeon E7-4890 v2 2.80 GHz (4 socket) – Xeon E5-2697 v2 2.70 GHz (2 socket)

• From the per node comparison you’d assume it was better to go with the 4 socket.

• Per core however the 2 socket is the better choice.

• Both are not showing linear scalability as they are running on all the cores per node (bandwidth constrained)

Per Node vs. Per Core Comparisons

© 2015 ANSYS, Inc. June 18, 2015 38

Case Details: – Flow through a Combustor – Number of cells: 12 Million – Cell Type: Polyhedra – Models used: Realizable K-ε

turbulence – Pressure based coupled, species

transport, Least Square cell based, pseudo transient

ANSYS Application Example

Generation to Generation - ANSYS Fluent

© 2015 ANSYS, Inc. June 18, 2015 39

Case Details: • External flow over a passenger sedan • Number of cells: 4 Million • Cell Type: Mixed • Models used: Standard K-ε turbulence • Solver: Pressure based coupled, steady,

Green-Gauss cell based

ANSYS Application Example

Generation to Generation - ANSYS Fluent

© 2015 ANSYS, Inc. June 18, 2015 40

• Faster cores mean faster solution

• Faster memory means faster solution

• Memory bandwidth is an important factor for (linear) scale-ability

• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.

• In general hyper threading should not be used because of licensing implications.

• Be careful when looking at comparisons! Make sure you are comparing like with like!

Recap

© 2015 ANSYS, Inc. June 18, 2015 41

HDD vs. SSD



CPUs? GPU/Phi?

© 2015 ANSYS, Inc. June 18, 2015 42

• Need fast interconnects to feed fast processors – Two main characteristics for each interconnect: latency and bandwidth – Distributed ANSYS is highly bandwidth bound

+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+ Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07 Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : 32 Number of physical cores available : 32 Number of cores requested : 4 (Distributed Memory Parallel) MPI Type: INTELMPI Core Machine Name Working Directory ---------------------------------------------------- 0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core 1 = 1.171 microseconds Latency time from master to core 2 = 2.251 microseconds Latency time from master to core 3 = 2.225 microseconds Communication speed from master to core 1 = 7934.49 MB/sec Same machine Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

Understanding the effect of the interconnect

© 2015 ANSYS, Inc. June 18, 2015 43

Understanding the effect of the interconnect - ANSYS Fluent

ANSYS/FLUENT Performance iDataplex M3 (Intel Xeon x5670, 12C 2.93 GHz)

Network: Gigabit, 10-Gigabit, 4X QDR Infiniband (QLogic, Voltaire)Hyperthreading: OFF, TURBO: ON

Models: truck_14M

0

1000

2000

3000

4000

5000

12 24 48 96 192 384 768Number of Cores used by a single job

FLU

ENT

Rat

ing

QLogic Voltaire 10-Gigabit GigabitHi

gher

is b

ette

r

© 2015 ANSYS, Inc. June 18, 2015 44


Exhaust Model

7.6M cells Transient simulation with explicit time stepping for engine startup cycle Fujitsu PRIMERGY CX250 HPC systems (E5-2690v2 with 20 and E5-2697v2 with 24 cores per node, resp.) For CFD we can see the performance

of IB vs GiGE – GiGE starts to drop off after 2 nodes

© 2015 ANSYS, Inc. June 18, 2015 45


For CFD 10 GiGE starts to taper off after 8 nodes

© 2015 ANSYS, Inc. June 18, 2015 46

Understanding the effect of the interconnect - ANSYS Mechanical

V13sp-5 Model

Turbine geometry 2,100 K DOF SOLID187 FEs Static, nonlinear One iteration Direct sparse Linux cluster (8 cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Ratin

g (r

uns/

day)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

© 2015 ANSYS, Inc. June 18, 2015 47


For ANSYS Mechanical GiGE does not scale to more than 1 node!

© 2015 ANSYS, Inc. June 18, 2015 48

GiGE (Gigabit Ethernet) – 1 Gbits/sec ( 100 MB/sec )

10 GiGE – 10 Gbits/sec ( 1000 MB/sec )

Myrinet (Myricom, Inc) – 2 Gbits/sec ( 250 MB/sec ) – Myri 10G – 10 Gbits/sec (4th generation Myrinet)

Infiniband (many vendors/speeds) – SDR/DDR/QDR – 1x, 4x, 12x – http://en.wikipedia.org/wiki/List_of_device_bandwidths

Not recommended!!

Bare minimum!!


RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes

© 2015 ANSYS, Inc. June 18, 2015 49

• 10GiGE and Infiniband are recommended for HPC Clusters. • Currently Infiniband only for large clusters is

recommended • QDR should be more than adequate for small to medium

clusters. FDR for large clusters.

• For more than 1 node you will see performance decrease using GiGE. • For Mechanical users do not use GiGE at all if their jobs

span more than one node.

Recap

© 2015 ANSYS, Inc. June 18, 2015 50

HDD vs. SSD



CPUs? GPU/Phi?

© 2015 ANSYS, Inc. June 18, 2015 51

Parallel file systems

▪ NFS Server and/or master node causes IO bottleneck

▪ Master node causes IO bottleneck

▪ IO scales with cluster

© 2015 ANSYS, Inc. June 18, 2015 52

• The example across from here is using GPFS for Mechanical.

• Notice how it is very similar in speed to a local RAID 0 configuration (4 x 15k SAS)

Parallel file systems - ANSYS Mechanical

© 2015 ANSYS, Inc. June 18, 2015 53

Parallel I/O is based on MPI-IO

Implemented for data file read and write

A single file is written collectively by the nodes

Suited for parallel file systems • Does not work on NFS

Support for Panasas, PVFS2, HP/SFS, IBM/GPFS, EMC/MPFS2, Lustre

Files cannot be written directly compressed but can be compressed asynchronously

Understanding the effect of I/O - ANSYS Fluent

© 2015 ANSYS, Inc. June 18, 2015 54

Legacy NAS Serial IO Parallel IO Parallel IO(RAID-10, CW)

176 Cores 229.42 386.72 1182.60 1644.15

1644.15

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

Writ

e Da

ta F

ile T

hrou

ghpu

t (M

B/s)

Truck-111m

Para

llel I

O =

7x

( Leg

acy-

NAS

)

Par

alle

l IO

= 4

x ( S

eria

l-IO

)

Truck-111million (uses DES model with the segregated implicit solver)

Panasas layout available with MPI-IO Hints

in Fluent 14.5


© 2015 ANSYS, Inc. June 18, 2015 55

Landing Gear Noise Predictions using Scale-Resolving Simulations (180M cell model using pressure based segregated solver)


© 2015 ANSYS, Inc. June 18, 2015 56

Mesh File Location Async I/O Time

15M Cas NFS OFF 217s

15M Cas NFS ON 62s

15M Dat NFS OFF 113s

15M Dat NFS ON 8s

30M Cas NFS OFF 207s

30M Cas NFS ON 75s

30M Dat NFS OFF 144s

30M Dat NFS ON 10s

Asynchronous I/O for Linux Fluent Total write time 3-5x quicker over NFS Even larger speed-ups on bigger cases and local disk (up to 10x)


© 2015 ANSYS, Inc. June 18, 2015 57

Understanding the effect of I/O - ANSYS Mechanical

89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16

4XSSD-RAID-0-SATA-3Gb/s


SSD-SATA-6Gb/s

HD(7.2K RPM)-SATA-6Gb/s

29GB 33GB 35.6GB 40.8GB 47.8GB

Ratin

g (jo

bs/d

ay)

#Machine X #Core Memory

SP-5 (in-core) R14.5 Benchmark Results

© 2015 ANSYS, Inc. June 18, 2015 58


89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16



SSD-SATA-6Gb/s


29GB 33GB 35.6GB 40.8GB 47.8GB

Ratin

g (jo

bs/d

ay)



© 2015 ANSYS, Inc. June 18, 2015 59


89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16



SSD-SATA-6Gb/s


29GB 33GB 35.6GB 40.8GB 47.8GB

Ratin

g (jo

bs/d

ay)



© 2015 ANSYS, Inc. June 18, 2015 60

• IO is very important for Mechanical Solver o Raid 0 mandatory for multiple disks o SSD’s recommended for speed, 15k SAS drives

• FLUENT and CFX for most customers won’t require fast local disk access (for most type of job)

• Parallel file systems can meet the requirements of both types of solver.

Recap

© 2015 ANSYS, Inc. June 18, 2015 61

Is Your Hardware Ready for HPC? - ANSYS Mechanical

100

200

400

600

800

1000

1200

I/O [Mb/s]

RAM [Gb] 8 16 32 48 64 96 128

2x S

SD

1x S

SD

2x S

AS

1x S

AS

0.2 Mdof

2 Mdof

4 Mdof

> 6 Mdof

© 2015 ANSYS, Inc. June 18, 2015 62

HDD vs. SSD



CPUs? GPU/Phi?

© 2015 ANSYS, Inc. June 18, 2015 63

DMP Outperforming SMP

6 Mio Degrees of Freedom Plasticity, Contact Bolt pretension 4 load steps

© 2015 ANSYS, Inc. June 18, 2015 64

DMP: Good Performance at High Core Counts

Number of Cores Number of Cores

• Intel Xeon E5-2690 processors (2.9 GHz, 16 cores total) • 128 GB of RAM

10.7 Mio Degrees of Freedom Static, linear, structural 1 load step

1 Mio Degrees of Freedom Harmonic, linear, structural 4 frequencies

© 2015 ANSYS, Inc. June 18, 2015 65

0

5

10

15

20

25

0 8 16 24 32 40 48 56 64

Spee

dup

Solution Scalability

Minimum time to solution more important than scaling

ANSYS Mechanical 14.5 DMP Enabling Scalability at High Core Counts

V14sp-5 Model

Turbine geometry 2.1 million DOF Static, nonlinear analysis 1 loadstep, 7 substeps, 25 equilibrium iterations 8-node Linux cluster (with 8 cores per node)

© 2015 ANSYS, Inc. June 18, 2015 66

1.3x 1.7x

2.7x 2.4x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spee

dup

over

R14

.5

Improved Scaling at 8 cores

by an enhanced domain decomposition method

ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts

8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

© 2015 ANSYS, Inc. June 18, 2015 67

1.6x 1.8x

3.8x 4.0x

0

1

2

3

4

5

6


Spee

dup

over

R14

.5



by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

© 2015 ANSYS, Inc. June 18, 2015 68

1.8x 2.2x

3.9x

5.0x

0

1

2

3

4

5

6


Spee

dup

over

R14

.5



by an enhanced domain decomposition method 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

© 2015 ANSYS, Inc. June 18, 2015 70


Continually improving Core Solver Rating to 128 cores

Courtesy of HP

© 2015 ANSYS, Inc. June 18, 2015 71

ANSYS Mechanical 15.0 HPC & Solver Technology Improvements

• Improved Scalability of Distributed solver at higher core counts

• NEW Subspace eigen solver supports Shared and Distributed Parallel technology

• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics

Coupled Acoustic, 1.2 M DOF, Full Harmonic Response

2.09 MDOFs first 20 modes

© 2015 ANSYS, Inc. June 18, 2015 72

GPU/Phi? HDD vs. SSD



CPUs?

© 2015 ANSYS, Inc. June 18, 2015 73

GPUs are accelerators and can significantly speed up your simulations • GPUs work hand in hand with CPUs

Most ANSYS GPU acceleration is user-transparent • Only requirement is to inform ANSYS of how many GPUs to use

Schematic of a CPU with an attached GPU accelerator • CPU begins/ends job, GPU manages heavy computations

Some Basics ANSYS Software on NVIDIA GPUs

© 2015 ANSYS, Inc. June 18, 2015 74

GPU-based Model: Radiation Heat Transfer using OptiX

GPU-based Solver: Coupled Algebraic Multigrid (AMG) PBNS linear solver

Operating Systems: Both Linux and Win64 for workstations and servers

Parallel Methods: Shared and distributed memory

Supported GPUs: Tesla K40, Tesla K80, and Quadro 6000

Multi-GPU Support: Full multi-GPU and multi-node support

Model Suitability: Unlimited (hardware dependent)

GPU Accelerator Capability - ANSYS Fluent

© 2015 ANSYS, Inc. June 18, 2015 75

ANSYS Fluent on GPU Performance of Pressure-Based Solver

Sedan Model

Sedan geometry 3.6M mixed cells Steady, turbulent External aerodynamics Coupled PBNS, DP CPU: Intel Xeon E5-2680; 8 cores GPU: 2 X Tesla K40

CPU + GPU

Segregated solver

1.9x

Higher is

Better

Coupled solver CPU only CPU only

15 Jobs/day

12 Jobs/day

27 Jobs/day

Convergence criteria: 10e-03 for all variables; No of iterations until convergence: segregated CPU-2798 iterations (7070 secs); coupled CPU-967 iterations (5900 secs); coupled 985 iterations (3150 secs)

NOTE: Times for total solution until convergence

© 2015 ANSYS, Inc. June 18, 2015 76

ANSYS Fluent on GPU Performance of Pressure-Based Solver

CPU

Benefit

100% 100%

200%

GPU

Cost

11 Jobs/day

33 Jobs/day

CPU-only solution cost

Simulation productivity from CPU-only system

Additional productivity from GPUs

Additional cost of adding GPUs 40%

Truck Model

External aerodynamics 14 million cells Steady, k-ε turbulence Coupled PBNS, DP 2 nodes each with dual Intel Xeon E5-2698 V3 (16 CPU cores) and dual Tesla K80 GPUs

Higher is

Better

Simulation productivity (with an HPC Workgroup 64 license)

64 CPU cores 56 CPU cores + 4 Tesla K80

© 2015 ANSYS, Inc. June 18, 2015 77

ANSYS Fluent on GPU Better Speedup on Larger Models

Truck Model

NOTE: Reported times are per

iteration 14 million cells

13

9.5

111 million cells

36

18

144 CPU cores

1.4 X 2 X

Lower is

Better

36 CPU cores

36 CPU cores + 12 GPUs

ANSY

S Fl

uent

Tim

e (S

ec)

External aerodynamics Steady, k-ε turbulence Double-precision solver CPU: Intel Xeon E5-2667; 12 cores per node GPU: Tesla K40, 4 per node

144 CPU cores + 48 GPUs

© 2015 ANSYS, Inc. June 18, 2015 78

NVIDIA-GPU Solution Fit for ANSYS Fluent

Yes

No

Pressure-based coupled

solver?

Pressure–based coupled solver

Best-fit for GPUs

Segregated solver Is it a

steady-state analysis?

No

Consider switching to the pressure-based coupled solver for better performance (faster convergence) and further speedups with GPUs. Please see the next slide.

Yes

Is it single-phase & flow dominant?

Not ideal for GPUs

CFD analysis

No

© 2015 ANSYS, Inc. June 18, 2015 79

NVIDIA-GPU Solution Fit for ANSYS Fluent - Supported Hardware Configurations

CPU

G

PU

CPU

G

PU

CPU

G

PU

CPU

G

PU

Some nodes with 16 processes and some with 12 processes

Some nodes with 2 GPUs some with 1 GPU

15 processes not divisible by 2 GPUs

● Homogeneous process distribution ● Homogeneous GPU selection ● Number of processes be an exact

multiple of number of GPUs

© 2015 ANSYS, Inc. June 18, 2015 80

ANSYS Fluent - Power Consumption Study

• Adding GPUs to a CPU-only node resulted in 2.1x speed up while reducing energy consumption by 38%

© 2015 ANSYS, Inc. June 18, 2015 81

NVIDIA-GPU Solution Fit for ANSYS Fluent

GPUs accelerate the AMG solver portion of the CFD analysis, thus benefit problems with relatively high %AMG • Coupled solvers have high %AMG in the range of 60-70% • Fine meshes and low-dissipation problems have high %AMG

In some cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)

The whole problem must fit on GPUs for the calculations to proceed • In pressure-based coupled solver, each million cells need approx. 4 GB of GPU memory • High-memory cards such as Tesla K40 or Quadro K6000 are ideal

Moving scalar equations such as turbulence may not benefit much because of low workloads (using ‘scalar yes’ option in ‘amg-options’)

Better performance on lower CPU core counts • A ratio of 3 or 4 CPU cores to 1 GPU is recommended

© 2015 ANSYS, Inc. June 18, 2015 82

GPU Accelerator Capability - ANSYS Mechanical

Supports majority of ANSYS structural mechanics solvers: • Covers both sparse direct and PCG iterative solvers • Only a few minor limitations

Ease of use: • Requires at least one supported GPU card to be installed • No rebuild, no additional installation steps Performance: • Offer significantly faster time to solution • Should never slow down your simulation

V14sp-5 Model

© 2015 ANSYS, Inc. June 18, 2015 83

Influence of GPU Accelerator on Speedup

5.9x 3.7x 2.4x

ANSYS Mechanical Model – Impeller Impeller geometry of ~2M DOF, solid FEs

Normal modes analysis using cyclic symmetry

ANSYS Mechanical SMP and Block-Lanczos solver

Speedup Impeller 2M DOF Normal modes 4 cores + GPU

= 2.4x speedup vs. 4 cores

ANSYS Mechanical Model – Speaker Speaker geometry of ~0.7M DOF, solid FEs

Vibroacoustic harmonic analysis for one frequency

ANSYS Mechanical distributed sparse solver

Speaker 0.7M DOF Harmonic analysis

4 cores + GPU = 2.7x speedup

vs. 4 cores

Speedup

© 2015 ANSYS, Inc. June 18, 2015 84

NVIDIA-GPU Solution Fit for ANSYS Mechanical

GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs • Characterized by both high DOF and high factorization requirements • Models with solid elements (such as castings) and have >500K DOF experience good

speedups

Better performance when run on DMP mode over SMP mode

GPU and system memories both play important roles in performance • Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated – If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla K20

or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000).

• PCG/JCG solver: – Memory saving (MSAVE) option should be turned off for enabling GPUs – Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs

© 2015 ANSYS, Inc. June 18, 2015 87

GPU Achievements ANSYS Mechanical 16.0 Supporting Newest GPUs

6 CPU cores + K80 GPU

1.8x

8 CPU cores 6 CPU cores + K80 GPU

2.3x

8 CPU cores

Higher is

Better

159 Jobs/day 135

Jobs/day

247 Jobs/day

371 Jobs/day

V15sp-4 Model

Turbine geometry 3.2 million DOF SOLID187 elements Static, nonlinear analysis Sparse direct solver

V15sp-5 Model

Ball grid array geometry 6.0 million DOF Static, nonlinear analysis Sparse direct solver

Distributed ANSYS Mechanical 16.0 with Intel Xeon E5-2697v2 2.7 GHz 8-core CPU; Tesla K80 GPU with boost clocks.

© 2015 ANSYS, Inc. June 18, 2015 89

GPUs can offer significantly faster time to solution

Lower core

counts favor a single GPU

Higher core

counts favor

multiple GPUs

Courtesy of HP

GPU Achievements ANSYS Mechanical 15.0 Supporting Newest GPUs

© 2015 ANSYS, Inc. June 18, 2015 92

GPU Achievements ANSYS Mechanical 16.0 Supporting Xeon Phi

Background: • ANSYS Mechanical 15.0 was the

first commercial FEA program to support Intel Xeon Phi coprocessor

• It was limited to shared memory parallelism (SMP) on Linux only

Intel Xeon Phi coprocessor support • R16 now supports distributed

memory parallelism (DMP) and Windows

3.6

1.8

5.1

3.0

7.0

4.7

9.8

6.0

14.4

0

4

8

12

16

No Xeon Phi Xeon Phi

Spee

dup

1 core2 cores4 cores8 cores16 cores

© 2015 ANSYS, Inc. June 18, 2015 93

GPU Achievements ANSYS License Scheme for GPU and Phi

6 CPU Cores + 2 GPU/Phi 1 x ANSYS HPC Pack 4 CPU Cores + 4 GPU/Phi

Licensing Examples:

Total 8 HPC Tasks (4 GPU/Phi Max)

2 x ANSYS HPC Pack Total 32 HPC Tasks (16 GPU/Phi Max)

Example of Valid Configurations:

24 CPU Cores + 8 GPU/Phi

(Total Use of 2 Compute Nodes)

.

.

.

.

. (Applies to all schemes: ANSYS HPC, ANSYS HPC Pack, ANSYS HPC Workgroup)

© 2015 ANSYS, Inc. June 18, 2015 95

HDD vs. SSD

Maximizing Performance – Putting it Together

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

Interconnects?

Clusters?

CPUs? GPU/Phi?

© 2015 ANSYS, Inc. June 18, 2015 96

#1 Rule Avoid waiting for I/O to complete • Always check if job is I/O bound or compute bound

– Check output file for CPU and Elapsed times • When Elapsed time >> main thread CPU time I/O bound

– Consider adding more RAM or faster hard drive configuration • When Elapsed time ≈ main thread CPU time Compute bound

– Considering moving simulation to a machine with newer, faster processors – Consider using Distributed ANSYS (DMP) instead of SMP – Consider running on more CPU cores or possibly using GPU(s)

Total CPU time for main thread : 159.8 seconds . . . . . . Elapsed Time (sec) = 398.000 Date = 03/21/2013

Maximizing Performance – ANSYS Mechanical

© 2015 ANSYS, Inc. June 18, 2015 97


How to improve an I/O bound simulation – First consider adding more RAM

• Always the best option for optimal performance • Allows the operating system to cache file data in memory

– Next consider improving the I/O configuration • Need fast hard drives to feed fast processors

– Consider SSDs – Higher bandwidths and extremely low seek times

– Consider RAID configurations RAID 0 – for speed RAID 1,5 – for redundancy RAID 10 – for speed and redundancy

© 2015 ANSYS, Inc. June 18, 2015 98

Example of an I/O bound simulation

0.8x

2.9x 2.7x

5.9x 5.9x

0

1

2

3

4

5

6

7

2 cores, HDD 8 cores, HDD 8 cores, SSD

Rela

tive

Spee

dup

Benefits of SSD and RAM

16 GB RAM128 GB RAM


Adding RAM gives biggest gains & allows good scaling

Lack of RAM and slow HDD ruin scaling

Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • One 10k rpm HDD, one SSD • Windows 7

© 2015 ANSYS, Inc. June 18, 2015 99


How to improve a compute bound simulation – First consider using newer, faster processors

• New CPU architecture and faster clock speeds always help

– Next consider using parallel processing • DMP virtually always recommended over SMP

• More computations performed in parallel with DMP • Significantly faster speedups achieved using DMP • DMP can take advantage of all resources on a cluster

• Whole new class of problems can be solved!!

– Last consider using GPU acceleration • Can help accelerate critical, time-consuming computations

© 2015 ANSYS, Inc. June 18, 2015 100

Example of a compute bound simulation


1.8x

4.0x

11.0x

0

2

4

6

8

10

12

2 cores 8 cores 8 cores, GPU

Rela

tive

Spee

dup

Benefits of DMP and GPU

Xeon x5675

Xeon E5-2670Maximum performance found by adding GPU

Using newer Xeons gives big gain

Using 8 cores gives faster performance

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • 1 Tesla K20c • Windows 7

© 2015 ANSYS, Inc. June 18, 2015 101

Balanced System for Overall Optimum Performance


1.0x 2.7x 5.2x

12.5x

0

5

10

15

20

25

30

2 cores 8 cores 8 cores +GPU

8 cores +GPU + SSD

Rela

tive

Spee

dup

Balanced Performance IO Bound

• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 16 GB RAM • SSD and SATA disks • 1 Tesla K20c • Windows 7

© 2015 ANSYS, Inc. June 18, 2015 102

Balanced System for Overall Optimum Performance


• 2.1 million DOF • Nonlinear static analysis • Direct sparse solver (DSPARSE) • 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total) • 128 GB RAM • SSD and SATA disks • 1 Tesla K20c • Windows 7

1.0x 2.7x 5.2x

12.5x

5.7x

12.0x

24.8x 27.3x

0

5

10

15

20

25

30

2 cores 8 cores 8 cores +GPU

8 cores +GPU + SSD

Rela

tive

Spee

dup

Balanced Performance

IO Bound

Compute Bound

© 2015 ANSYS, Inc. June 18, 2015 104

• An important part of specifying an HPC system is to purchase a balanced system.

• There is no point in spending all your money on the processor if the I/O is your biggest bottleneck.

• You are only as good as your slowest component!

Wrap-up - Hardware

© 2015 ANSYS, Inc. June 18, 2015 105

Scalable HPC Licensing

ANSYS HPC (per-process)

ANSYS HPC Pack • HPC product rewarding volume parallel processing for

high-fidelity simulations • Each simulation consumes one or more Packs • Parallel enabled increases quickly with added Packs

ANSYS HPC Workgroup • HPC product rewarding volume parallel processing for

increased simulation throughput within a single co-located workgroup

• 16 to 32768 parallel shared across any number of simulations on a single server

• “Enterprise” options available to deploy and use anywhere in the world

Single HPC solution for FEA/CFD/FSI and any level

of fidelity

2048

32 8

128 512

Parallel Enabled (Cores)

HPC Packs per Simulation 1 2 3 4 5

32768 8192

6 7

© 2015 ANSYS, Inc. June 18, 2015 106

• ANSYS HPC and ANSYS HPC Workgroup gives Flexible use of a pool of licenses.

• ANSYS HPC Pack gives “quick” scale-up but is more restrictive in how users can use it.

• The ability to be more flexible is why HPC Workgroup options cost more than the HPC Packs.

Which type of Licensing is right for me?

© 2015 ANSYS, Inc. June 18, 2015 107

Number of Simultaneous Design Points Enabled HPC license for running parametric FEA or CFD simulations on multiple CPU cores simultaneously, and more cost effectively

64

2

8

Number of HPC Parametric Pack Licenses 1

4

16

32

3 4 5

ANSYS HPC Parametric Pack License

Key Benefits • Ability to automatically and simultaneously

execute design points while consuming just one set of application licenses

• Scalable because number of simultaneous design points enabled increases quickly with added packs

• Amplifies complete workflow because design points can include execution of multiple applications (pre, meshing, solve, HPC, post)

© 2015 ANSYS, Inc. June 18, 2015 108

Click on webinars related to HPC/IT for more and upcoming ones!

Additional Resources - IT Webinars

Watch recorded webinars by clicking below: • Understanding Hardware Selection for ANSYS 15.0 • How to Speed Up ANSYS 15.0 with GPUs • Intel Technologies Enabling Faster, More Effective Simulation • Optimizing Remote Access to Simulation

http://www.ansys.com/Support/Platform+Support/IT+Solutions+for+ANSYS+Webcast+Series

http://ansys.com/Resource+Library/Webinars/Understanding+Hardware+Selection+for+ANSYS+15.0

http://www.ansys.com/Resource+Library/Webinars/How+to+Speed+Up+ANSYS+15.0+with+GPUs

http://www.ansys.com/Resource+Library/Webinars/Intel+Technologies:+Enabling+Faster,+More+Effective+Simulation

http://www.ansys.com/Resource+Library/Webinars/Optimizing+Remote+Access+to+Simulation

© 2015 ANSYS, Inc. June 18, 2015 109

White Papers by clicking below: • Optimizing Business Value in High-Performance Engineering Computing

• IBM Application Ready Solutions Reference Architecture for ANSYS

• Intel Solid-State Drives Increase Productivity of Product Design and Simulation

• Value of HPC for Ensuring Product Integrity

Additional Resources - IT White Papers & Technical Briefs

Technical Briefs by clicking below: • Parallel Scalability of ANSYS 15.0 on Hewlett-Packard Systems

• SGI Technology Guide for ANSYS Mechanical Analysts

• SGI Technology Guide for ANSYS Fluent Analysts

• Accelerating ANSYS Fluent 15.0 Using NVIDIA GPUs

http://www.ansys.com/Resource Library/White Papers/Optimizing+Business+Value+in+High-Performance+Engineering+Computing+-+White+Paper

http://www.ansys.com/Resource Library/White Papers/IBM+Application+Ready+Solutions+Reference+Architecture+for+ANSYS+-+White+Paper

http://www.ansys.com/Resource Library/White Papers/Intel+Solid-State+Drives+Increase+Productivity+of+Product+Design+and+Simulation+-+White+Paper

http://www.ansys.com/Resource+Library/White+Papers/Value+of+HPC+for+Ensuring+Product+Integrity+-+White+Paper

http://www.ansys.com/Resource Library/Technical Briefs/Parallel+Scalability+of+ANSYS+on+Hewlett-Packard+Systems+-+Application+Brief

http://www.ansys.com/Resource+Library/Technical+Briefs/SGI+Technology+Guide+for+ANSYS+Mechanical+Analysts+-+Application+Brief

http://www.ansys.com/Resource+Library/Technical+Briefs/SGI+Technology+Guide+for+ANSYS+Fluent+Analysts+-+Application+Brief

http://www.ansys.com/Resource Library/Technical Briefs/Accelerating+ANSYS+Fluent+15.0+Using+NVIDIA+GPUs+-+Application+Brief

© 2015 ANSYS, Inc. June 18, 2015 110

Additional Resources - ANSYS IT Webcast Series

On-demand webinars: • Understanding Hardware Selection for ANSYS 15.0 • How to Speed Up ANSYS 15.0 with GPUs • Cloud Hosting of ANSYS: Gompute On-Demand Solutions • Simplified HPC Clusters for ANSYS Users • Intel Technologies Enabling Faster, More Effective Simulation • Accelerating Time-to-Results with Parallel I/O • Extreme Scalability for High-Fidelity CFD Simulations • Methodology and Tools for Compute Performance at Any Scale • Understanding Hardware Selection for Structural Mechanics • Optimizing Remote Access to Simulation • Scalable Storage and Data Management for Engineering Simulation




© 2015 ANSYS, Inc. June 18, 2015 111

Additional Resources

ANSYS Platform Support • http://www.ansys.com/Support/Platform+Support

– Platform Support Policies – Supported Platforms – Supported Hardware – Tested Systems – ANSYS Benchmarks

http://www.ansys.com/Support/Platform+Support

© 2015 ANSYS, Inc. June 18, 2015 112

ANSYS Partner Solutions – http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners

• Reference configurations • Performance data • White papers • Sales contact points

Performance Data – http://www.ansys.com/benchmarks


http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners

http://www.ansys.com/benchmarks

© 2015 ANSYS, Inc. June 18, 2015 113


The Manual • Sections on best practices and parallel

processing for various solvers • Performance Guide for Mechanical • Installation walkthroughs for installing the

products, parallel processing, licensing and RSM (remote solve manager)

ANSYS Advantage • Online Magazine

© 2015 ANSYS, Inc. June 18, 2015 114

• Connect with Me – [email protected]

• Connect with ANSYS, Inc.

– LinkedIn ANSYSInc – Twitter @ANSYS_Inc – Facebook ANSYSInc

• Follow our Blog

– ansys-blog.com

Thank You!

mailto:[email protected]

Understanding Hardware Selection to Speedup Your CFD and ...

Documents

Transcript of Understanding Hardware Selection to Speedup Your CFD and ...