Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia · ANSYS Nexxim 15.0 Q4-2013 + CUDA 5 Kepler...

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]

mailto:[email protected]

2

Introduction of GPUs in HPC Progress of CFD on GPUs Review of OpenFOAM on GPUs Discussion on WRF Developments

Agenda: GPU Progress and Directions for CAE

3

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial Simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry


30X

Gene Sequencing

U of Maryland

Real Application Speedups

4

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics


18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial Simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry


30X

Gene Sequencing

U of Maryland

Real Application Speedups

NOTE: Missing context often fault of NVIDIA and not the organizations referenced

Always Demand Context!

- Full application? Often kernel only without data transfer . . .

- What is the reference CPU? Often old and dusty x86 . . .

- How many CPU cores in the comparison?

Often 1 core . . . but who uses only 1 core nowadays?

5

2832

933

517 517

0

1000

2000

3000

Dual Socket CPU Dual Socket CPU + Tesla C2075

AN

SY

S F

luent

AM

G S

olv

er

Tim

e (

Sec)

2 x Xeon X5650, Only 1 Core Used

1.8x

5.5x

Lower is

Better

2 x Xeon X5650, All 12 Cores Used

Example: ANSYS Fluent GPU Acceleration

Helix geometry

1.2M Tet cells

Unsteady, laminar

Coupled PBNS, DP

AMG F-cycle on CPU

AMG V-cycle on GPU

Helix Model

NOTE: All jobs

solver time only

Preview of ANSYS Fluent 14.5 Performance – by ANSYS, Aug 2012

6

GPU Computing is Mainstream

Chinese

Academy of

Sciences

Edu/Research

Air Force

Research

Laboratory

Naval Research

Laboratory

Government Oil & Gas

Max

Planck

Institute

Mass General

Hospital

Life Sciences Finance Manufacturing

http://www.mscsoftware.com/

7

®

GPUs Now as Common to Servers as CPUs

http://en.wikipedia.org/wiki/File:Logo_groupe_bull.jpg

http://www.dell.com/us/en/gen/df.aspx?refid=df&s=gen&cs=555

8

Supercomputing Momentum With GPUs

0

10

20

30

40

50

60 # of GPU accelerated systems on Top500

Tesla GPUs Launched

First Double Precision GPU

Tesla Fermi 20-series Launched

2007 2008 2009 2010 2011 2012

52 Tesla Accelerated

Systems in June 2012 Top500 List

Kepler Launched

9

ORNL TITAN: #1 on Top500 List of Supercomputers

18,688 Tesla K20X GPUs

27 Petaflops Peak, 17.59 Petaflops on Linpack

90% of Performance from GPUs

#3 on Green500

10

TITAN at ORNL 20+ PetaFlops

18,688 NVIDIA Tesla K20x

NVIDIA GPUs Accelerate HPC at Any Scale

Same GPU Technology from

MAXIMUS Workstations to

TITAN—the Leader of the

Top 500 at Top500.org

MAXIMUS Workstation

http://www.google.com/imgres?imgurl=https://info.ornl.gov/sites/rams09/e_ponce_mojica/PublishingImages/ORNL.jpg&imgrefurl=https://info.ornl.gov/sites/rams09/e_ponce_mojica/Pages/Links.aspx&usg=__hpykGe-sbR1khRcdWdKDBuBulqk=&h=452&w=772&sz=60&hl=en&start=40&zoom=1&um=1&itbs=1&tbnid=3aMbgjoULPjRzM:&tbnh=83&tbnw=142&prev=/images?q=Oak+Ridge+National+Lab&start=36&um=1&hl=en&sa=N&ndsp=18&tbs=isch:1&ei=__xyTbmPMYLEsAOvqODRCw

http://www.top500.org/

11

Over 20 GPU Applications on ORNL Titan

WL-LSMS Role of material disorder, statistics, and fluctuations in nanoscale materials and systems.

S3D How are going to efficiently burn next generation diesel/bio fuels?

CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns/statistics and tropical storms.

Denovo Unprecedented high-fidelity radiation transport calculations that can be used in a variety of nuclear energy and technology applications.

LAMMPS Biofuels: An atomistic model of cellulose (blue) surrounded by lignin molecules comprising a total of 3.3 million atoms. Water not shown.

NRDF Radiation transport – critical to astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging.

12

Germany Juelich

HLRS

Max Planck

TU Dresden

UK Cambridge

EPCC

Oxford

STFC

Japan Tokyo Tech

RIKEN

Tsukuba

Rest of Europe BSC, Spain

CINECA, Italy

CEA, France

CSCS, Switzerland

China NSC, Shenzhen

NSC, Tianjin

CAS IPE

Rest of World MSU, Russia

RAS, Russia

IITs, India

United States Lawrence Livermore National Labs

Oak Ridge National Labs

Sandia National Labs

NOAA

NCSA BlueWaters

Leadership HPC Sites Now GPU Accelerated

13

Tsubame 2.0 Tokyo Institute of Technology

TiTech Winner of 2011 Gordon Bell Prize Achieved with NVIDIA Tesla GPUs

“Peta-scale Phase-Field Simulation for Dendritic

Solidification on the TSUBAME 2.0 Supercomputer”

-- T. Shimokawabe, T. Aoki, et. al.

Special Achievement in Scalability and Time-to Solution

4,224 Tesla GPUs +

2,816 x86 CPUs

http://www.google.com/imgres?imgurl=http://pictures.directnews.co.uk/liveimages/HP_1186_19853592_0_0_4655_300.jpg&imgrefurl=http://www.comms-express.com/news/server-racks/hp-announces-new-server-rack-options-19853592/&usg=__d7oCKaMntAIg5JIwN3Nf2S_muyg=&h=300&w=300&sz=44&hl=en&start=90&zoom=1&um=1&itbs=1&tbnid=UKQc_e__xhRksM:&tbnh=116&tbnw=116&prev=/images?q=hp+proliant+390+server+images&start=72&um=1&hl=en&sa=N&ndsp=18&tbs=isch:1

14

World’s Most Energy Efficient Supercomputer

3150 MFLOPS/Watt

128 Tesla K20 Accelerators

$100k Energy Savings / Yr

300 Tons of CO2 Saved / Yr 0

1000

2000

3000

CINECA Eurora-Tesla K20

NICS Beacon-Greenest Xeon

Phi System

C-DAC- GreenestCPU System

MFLOPS/Watt

CINECA Eurora

“Liquid-Cooled” Eurotech Aurora Tigon

Greener than Xeon Phi, Xeon CPU

15

Accelerated Computing Multi-core plus Many-cores

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Many

Parallel Tasks

10x Performance 5x Energy Efficiency

16

Performance constrained by power

Impossible to optimize for both single

thread performance and power efficiency

the future is hybrid

Few cores optimized for serial work

Most cores optimized throughput

PCIe

Xeon Fast single threads

(serial work) GPU Extreme power-efficiency

(throughput work)

Intel Xeon Phi

Xeon

PCIe

Intel Agrees: Future HPC is Hybrid Computing

17

Development of professional GPUs as co-processing accelerators for x86 CPUs GPUs provide a cost-effective and power-efficient approach to application speedups

Established industry alliances to develop HPC solutions Alliance with ISVs; Customers who develop HPC software; Research organizations

Technical collaborations in applications engineering Investment in PhD engineers who work with HPC software to optimize for GPUs

GPU integration with systems from major hardware vendors HP and several others; Kepler K20 based-systems available since 1Q 2013

NVIDIA HPC Technology and Strategy

Technology

Strategy

18

Top Scientific Apps

Computational

Chemistry

AMBER

CHARMM

GROMACS

LAMMPS

NAMD

DL_POLY

Material Science QMCPACK

Quantum Espresso

GAMESS-US

Gaussian

NWChem

VASP

Climate &

Weather COSMO

GEOS-5

CAM-SE

NIM

WRF

Physics Chroma

Denovo

GTC

GTS

ENZO

MILC

CAE ANSYS Mechanical

MSC Nastran

SIMULIA Abaqus

ANSYS Fluent

OpenFOAM

LS-DYNA

Strong Growth of GPU Accelerated Applications

0

50

100

150

200

2010 2011 2012

# of Apps

40% Increase

61% Increase

Accelerated, In Development

19

207 GPU-Accelerated Applications www.nvidia.com/appscatalog

20

Developer Momentum Continues to Grow

2008 2013

4,000 Academic Papers

150K CUDA Downloads

60 University Courses

100M CUDA –Capable GPUs

1 Supercomputer

430M CUDA-Capable GPUs

50 Supercomputers

1.6M CUDA Downloads

640 University Courses

37,000 Academic Papers

21

How GPU Acceleration is Developed

Application Code

+

GPU CPU 5% of Code

Compute-Intensive Functions

Rest of Sequential CPU Code

Hot Spot

50% -75% of

Profile time

22

Applications

Libraries Programming

Languages OpenACC

Directives

“Drop-In”

Acceleration

GPU-acceleration in

Standard Language

(Fortran, C, C++)

Maximum

Flexibility

Less Portability

More Development

Programming Strategies for GPU Acceleration

23

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra FFT, BLAS,

SPARSE, Matrix

Numerical & Math RAND, Statistics

Data Struct. & AI Sort, Scan, Zero Sum

Visual Processing Image & Video

NVIDIA

cuFFT,

cuBLAS,

cuSPARSE

NVIDIA

Math Lib NVIDIA cuRAND

NVIDIA

NPP

NVIDIA

Video

Encode

GPU AI –

Board

Games

GPU AI –

Path Finding

24

Software Domain Collaborators

LS-DYNA CAE LSTC, NVIDIA

Abaqus/Explicit CAE SIMULIA, NVIDIA

PAM-CRASH CAE ESI, CAPS

WRF Climate/NWP Cray, NVIDIA

COSMO Climate/NWP CSCS, NVIDIA

GEOS-5 Climate/NWP NASA GSFC, PGI

NIM Climate/NWP NOAA, PGI, CAPS, NVIDIA

S3D Combustion Cray, ORNL, Sandia NL, NVIDIA

Select Developments using Directives and OpenACC

www.openacc-standard.org

25

ANSYS and NVIDIA Collaboration Roadmap

Release

ANSYS Mechanical ANSYS Fluent ANSYS EM

13.0 Dec 2010

SMP, Single GPU, Sparse

and PCG/JCG Solvers

ANSYS Nexxim

14.0 Dec 2011

+ Distributed ANSYS;

+ Multi-node Support

Radiation Heat Transfer

(beta)

ANSYS Nexxim

14.5 Nov 2012

+ Multi-GPU Support;

+ Hybrid PCG;

+ Kepler GPU Support

+ Radiation HT;

+ GPU AMG Solver (beta),

Single GPU

ANSYS Nexxim

15.0 Q4-2013

+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver;

+ CUDA 5 Kepler Tuning

ANSYS Nexxim

ANSYS HFSS (Transient)

26

ANSYS Focus on Implicit Sparse Solvers

Application Software

+

GPU CPU - Hand-CUDA Parallel

- GPU Libraries, CUBLAS

- OpenACC Directives

Matrix Operations 50% - 75% of

Profile time,

Small % LoC

(Investigating OpenACC for more tasks on GPU)

Read input, matrix Set-up

Global solution, write output

Matrix Operations

27

164

210

0

100

200

300

400

500

CPU + GPU

CPU OnlyHigher

is Better

ANSYS Mechanical 14.5 GPU Acceleration A

NSY

S M

echanic

al N

um

ber

of

Jobs

Per

Day

Xeon X5690 3.47 GHz 8 Cores + Tesla C2075

Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20

V14sp-5 Model

Turbine geometry

2,100,000 DOF

SOLID187 FEs

Static, nonlinear

One iteration (final

solution requires 25)

Distributed ANSYS 14.5

Direct sparse solver

Results from Supermicro

X9DR3-F, 64GB memory

Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs

Westmere Sandy Bridge

28

164

210

341

395

0

100

200

300

400

500

CPU + GPU

CPU OnlyHigher

is Better

ANSYS Mechanical 14.5 GPU Acceleration A

NSY

S M

echanic

al N

um

ber

of

Jobs

Per

Day

Xeon X5690 3.47 GHz 8 Cores + Tesla C2075

Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20

V14sp-5 Model

Turbine geometry

2,100,000 DOF

SOLID187 FEs

Static, nonlinear

One iteration (final

solution requires 25)

Distributed ANSYS 14.5

Direct sparse solver

Results from Supermicro

X9DR3-F, 64GB memory

Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs

Westmere Sandy Bridge

K20 = 1.9x Acceleration

C2075 = 2.1x Acceleration

29

NVIDIA Use of ANSYS Software in Product Engineering

ANSYS Icepak – active and passive cooling of IC packages

ANSYS Mechanical – large deflection bending of PCBs

ANSYS Mechanical – comfort and fit of 3D emitter glasses

ANSYS Mechanical – shock & vib of solder ball assemblies

30

Select Automotive CAE Application ISV Select CAE Software GPU Status

CSM: Durability (Stress) and Fatigue MSC Nastran Available Today

Road Handling and VPG Adams (for MBD) Evaluation

Powertrain Stress Analysis Abaqus/Standard Available Today

Body NVH MSC Nastran Available Today

Crashworthiness and Safety LS-DYNA Implicit only, beta

CFD: Aerodynamics / Thermal UH ANSYS Fluent Available Today

IC Engine Combustion STAR-CCM+ Evaluation

Aerodynamics / HVAC OpenFOAM Available Today

Plastic Mold Injection Moldflow Available Today

GPU Developments for Automotive CAE


http://www.simulia.com/index.html



31

GPU Developments for Turbine Engine CFD Developer Location Software (Green color indicates CUDA-ready during 2013)

Turbostream England, UK Turbostream 3.0

Oxford / Rolls Royce England, UK OP2 / Hydra

ANSYS USA ANSYS CFD 15.0 (Fluent + CFX)

ANSYS USA ANSYS Fluent 15.0

FluiDyna Germany Culises for OpenFOAM 2.2.0

Vratis Poland Speed-IT for OpenFOAM 2.2.0

Cascade Technologies USA CHARLES

Convergent Science USA Converge CFD

Sandia NL / Oak Ridge NL USA S3D

Naval Research Lab USA JENRE

Aviadvigatel OJSC Russia GHOST CFD

Turbomachinery

Combustor

Nozzle / Noise

32

Dynamic

Parallelism

Kepler Fastest, Most Efficient HPC Architecture Ever

3x Performance per Watt SMX

Easy Speed-up for Legacy

MPI Apps Hyper-Q

Parallel Programming Made

Easier than Ever

33

Tesla Kepler Family World’s Fastest and Most Efficient HPC Accelerators

GPUs

Single

Precision

Peak

(SGEMM)

Double

Precision

Peak

(DGEMM)

Memory

Size

Memory

Bandwidth

(ECC off)

System Solution

Weather & Climate,

Physics, BioChemistry, CAE,

Material Science

K20X 3.95 TF

(2.90 TF)

1.32 TF

(1.22 TF) 6 GB 250 GB/s Server only

K20 3.52 TF

(2.61 TF)

1.17 TF

(1.10 TF) 5 GB 208 GB/s

Server +

Workstation

Image, Signal,

Video, Seismic K10 4.58 TF 0.19 TF 8 GB 320 GB/s Server only

34

2012 2014 2008 2010

DP G

FLO

PS p

er

Watt

Kepler

T10

Fermi

Maxwell

Volta Stacked DRAM

Unified Virtual Memory

Dynamic Parallelism

FP64

CUDA

32

16

8

4

2

1

0.5

NVIDIA CUDA GPU Roadmap

What to Expect from NVIDIA: • Increasing number of more flexible cores

• Larger and faster memories (6 GB today)

• Enhanced programming and standards

• Tighter integration with systems ~ 7x Fermi ~ 2x Kepler

~ 13x Fermi ~ 4x Kepler

35

VIRTUAL DESKTOPS

VIRTUAL MACHINE

NVIDIA Driver

NVIDIA GRID Enabled Virtual Desktop

NVIDIA GRID GPU

VDI

NVIDIA GRID ENABLED Hypervisor

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]

mailto:[email protected]

Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia · ANSYS Nexxim 15.0 Q4-2013 + CUDA 5 Kepler...

Documents

Transcript of Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia · ANSYS Nexxim 15.0 Q4-2013 + CUDA 5 Kepler...