Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia · ANSYS Nexxim 15.0 Q4-2013 + CUDA 5 Kepler...
Transcript of Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia · ANSYS Nexxim 15.0 Q4-2013 + CUDA 5 Kepler...
2
Introduction of GPUs in HPC Progress of CFD on GPUs Review of OpenFOAM on GPUs Discussion on WRF Developments
Agenda: GPU Progress and Directions for CAE
3
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
100X
Astrophysics
RIKEN
149X
Financial Simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, Urbana
30X
Gene Sequencing
U of Maryland
Real Application Speedups
4
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
100X
Astrophysics
RIKEN
149X
Financial Simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, Urbana
30X
Gene Sequencing
U of Maryland
Real Application Speedups
NOTE: Missing context often fault of NVIDIA and not the organizations referenced
Always Demand Context!
- Full application? Often kernel only without data transfer . . .
- What is the reference CPU? Often old and dusty x86 . . .
- How many CPU cores in the comparison?
Often 1 core . . . but who uses only 1 core nowadays?
5
2832
933
517 517
0
1000
2000
3000
Dual Socket CPU Dual Socket CPU + Tesla C2075
AN
SY
S F
luent
AM
G S
olv
er
Tim
e (
Sec)
2 x Xeon X5650, Only 1 Core Used
1.8x
5.5x
Lower is
Better
2 x Xeon X5650, All 12 Cores Used
Example: ANSYS Fluent GPU Acceleration
Helix geometry
1.2M Tet cells
Unsteady, laminar
Coupled PBNS, DP
AMG F-cycle on CPU
AMG V-cycle on GPU
Helix Model
NOTE: All jobs
solver time only
Preview of ANSYS Fluent 14.5 Performance – by ANSYS, Aug 2012
6
GPU Computing is Mainstream
Chinese
Academy of
Sciences
Edu/Research
Air Force
Research
Laboratory
Naval Research
Laboratory
Government Oil & Gas
Max
Planck
Institute
Mass General
Hospital
Life Sciences Finance Manufacturing
7
®
GPUs Now as Common to Servers as CPUs
8
Supercomputing Momentum With GPUs
0
10
20
30
40
50
60 # of GPU accelerated systems on Top500
Tesla GPUs Launched
First Double Precision GPU
Tesla Fermi 20-series Launched
2007 2008 2009 2010 2011 2012
52 Tesla Accelerated
Systems in June 2012 Top500 List
Kepler Launched
9
ORNL TITAN: #1 on Top500 List of Supercomputers
18,688 Tesla K20X GPUs
27 Petaflops Peak, 17.59 Petaflops on Linpack
90% of Performance from GPUs
#3 on Green500
10
TITAN at ORNL 20+ PetaFlops
18,688 NVIDIA Tesla K20x
NVIDIA GPUs Accelerate HPC at Any Scale
Same GPU Technology from
MAXIMUS Workstations to
TITAN—the Leader of the
Top 500 at Top500.org
MAXIMUS Workstation
11
Over 20 GPU Applications on ORNL Titan
WL-LSMS Role of material disorder, statistics, and fluctuations in nanoscale materials and systems.
S3D How are going to efficiently burn next generation diesel/bio fuels?
CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns/statistics and tropical storms.
Denovo Unprecedented high-fidelity radiation transport calculations that can be used in a variety of nuclear energy and technology applications.
LAMMPS Biofuels: An atomistic model of cellulose (blue) surrounded by lignin molecules comprising a total of 3.3 million atoms. Water not shown.
NRDF Radiation transport – critical to astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging.
12
Germany Juelich
HLRS
Max Planck
TU Dresden
UK Cambridge
EPCC
Oxford
STFC
Japan Tokyo Tech
RIKEN
Tsukuba
Rest of Europe BSC, Spain
CINECA, Italy
CEA, France
CSCS, Switzerland
China NSC, Shenzhen
NSC, Tianjin
CAS IPE
Rest of World MSU, Russia
RAS, Russia
IITs, India
United States Lawrence Livermore National Labs
Oak Ridge National Labs
Sandia National Labs
NOAA
NCSA BlueWaters
Leadership HPC Sites Now GPU Accelerated
13
Tsubame 2.0 Tokyo Institute of Technology
TiTech Winner of 2011 Gordon Bell Prize Achieved with NVIDIA Tesla GPUs
“Peta-scale Phase-Field Simulation for Dendritic
Solidification on the TSUBAME 2.0 Supercomputer”
-- T. Shimokawabe, T. Aoki, et. al.
Special Achievement in Scalability and Time-to Solution
4,224 Tesla GPUs +
2,816 x86 CPUs
14
World’s Most Energy Efficient Supercomputer
3150 MFLOPS/Watt
128 Tesla K20 Accelerators
$100k Energy Savings / Yr
300 Tons of CO2 Saved / Yr 0
1000
2000
3000
CINECA Eurora-Tesla K20
NICS Beacon-Greenest Xeon
Phi System
C-DAC- GreenestCPU System
MFLOPS/Watt
CINECA Eurora
“Liquid-Cooled” Eurotech Aurora Tigon
Greener than Xeon Phi, Xeon CPU
15
Accelerated Computing Multi-core plus Many-cores
CPU Optimized for Serial Tasks
GPU Accelerator Optimized for Many
Parallel Tasks
10x Performance 5x Energy Efficiency
16
Performance constrained by power
Impossible to optimize for both single
thread performance and power efficiency
the future is hybrid
Few cores optimized for serial work
Most cores optimized throughput
PCIe
Xeon Fast single threads
(serial work) GPU Extreme power-efficiency
(throughput work)
Intel Xeon Phi
Xeon
PCIe
Intel Agrees: Future HPC is Hybrid Computing
17
Development of professional GPUs as co-processing accelerators for x86 CPUs GPUs provide a cost-effective and power-efficient approach to application speedups
Established industry alliances to develop HPC solutions Alliance with ISVs; Customers who develop HPC software; Research organizations
Technical collaborations in applications engineering Investment in PhD engineers who work with HPC software to optimize for GPUs
GPU integration with systems from major hardware vendors HP and several others; Kepler K20 based-systems available since 1Q 2013
NVIDIA HPC Technology and Strategy
Technology
Strategy
18
Top Scientific Apps
Computational
Chemistry
AMBER
CHARMM
GROMACS
LAMMPS
NAMD
DL_POLY
Material Science QMCPACK
Quantum Espresso
GAMESS-US
Gaussian
NWChem
VASP
Climate &
Weather COSMO
GEOS-5
CAM-SE
NIM
WRF
Physics Chroma
Denovo
GTC
GTS
ENZO
MILC
CAE ANSYS Mechanical
MSC Nastran
SIMULIA Abaqus
ANSYS Fluent
OpenFOAM
LS-DYNA
Strong Growth of GPU Accelerated Applications
0
50
100
150
200
2010 2011 2012
# of Apps
40% Increase
61% Increase
Accelerated, In Development
19
207 GPU-Accelerated Applications www.nvidia.com/appscatalog
20
Developer Momentum Continues to Grow
2008 2013
4,000 Academic Papers
150K CUDA Downloads
60 University Courses
100M CUDA –Capable GPUs
1 Supercomputer
430M CUDA-Capable GPUs
50 Supercomputers
1.6M CUDA Downloads
640 University Courses
37,000 Academic Papers
21
How GPU Acceleration is Developed
Application Code
+
GPU CPU 5% of Code
Compute-Intensive Functions
Rest of Sequential CPU Code
Hot Spot
50% -75% of
Profile time
22
Applications
Libraries Programming
Languages OpenACC
Directives
“Drop-In”
Acceleration
GPU-acceleration in
Standard Language
(Fortran, C, C++)
Maximum
Flexibility
Less Portability
More Development
Programming Strategies for GPU Acceleration
23
GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
Linear Algebra FFT, BLAS,
SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
24
Software Domain Collaborators
LS-DYNA CAE LSTC, NVIDIA
Abaqus/Explicit CAE SIMULIA, NVIDIA
PAM-CRASH CAE ESI, CAPS
WRF Climate/NWP Cray, NVIDIA
COSMO Climate/NWP CSCS, NVIDIA
GEOS-5 Climate/NWP NASA GSFC, PGI
NIM Climate/NWP NOAA, PGI, CAPS, NVIDIA
S3D Combustion Cray, ORNL, Sandia NL, NVIDIA
Select Developments using Directives and OpenACC
www.openacc-standard.org
25
ANSYS and NVIDIA Collaboration Roadmap
Release
ANSYS Mechanical ANSYS Fluent ANSYS EM
13.0 Dec 2010
SMP, Single GPU, Sparse
and PCG/JCG Solvers
ANSYS Nexxim
14.0 Dec 2011
+ Distributed ANSYS;
+ Multi-node Support
Radiation Heat Transfer
(beta)
ANSYS Nexxim
14.5 Nov 2012
+ Multi-GPU Support;
+ Hybrid PCG;
+ Kepler GPU Support
+ Radiation HT;
+ GPU AMG Solver (beta),
Single GPU
ANSYS Nexxim
15.0 Q4-2013
+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS Nexxim
ANSYS HFSS (Transient)
26
ANSYS Focus on Implicit Sparse Solvers
Application Software
+
GPU CPU - Hand-CUDA Parallel
- GPU Libraries, CUBLAS
- OpenACC Directives
Matrix Operations 50% - 75% of
Profile time,
Small % LoC
(Investigating OpenACC for more tasks on GPU)
Read input, matrix Set-up
Global solution, write output
Matrix Operations
27
164
210
0
100
200
300
400
500
CPU + GPU
CPU OnlyHigher
is Better
ANSYS Mechanical 14.5 GPU Acceleration A
NSY
S M
echanic
al N
um
ber
of
Jobs
Per
Day
Xeon X5690 3.47 GHz 8 Cores + Tesla C2075
Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20
V14sp-5 Model
Turbine geometry
2,100,000 DOF
SOLID187 FEs
Static, nonlinear
One iteration (final
solution requires 25)
Distributed ANSYS 14.5
Direct sparse solver
Results from Supermicro
X9DR3-F, 64GB memory
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs
Westmere Sandy Bridge
28
164
210
341
395
0
100
200
300
400
500
CPU + GPU
CPU OnlyHigher
is Better
ANSYS Mechanical 14.5 GPU Acceleration A
NSY
S M
echanic
al N
um
ber
of
Jobs
Per
Day
Xeon X5690 3.47 GHz 8 Cores + Tesla C2075
Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20
V14sp-5 Model
Turbine geometry
2,100,000 DOF
SOLID187 FEs
Static, nonlinear
One iteration (final
solution requires 25)
Distributed ANSYS 14.5
Direct sparse solver
Results from Supermicro
X9DR3-F, 64GB memory
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs
Westmere Sandy Bridge
K20 = 1.9x Acceleration
C2075 = 2.1x Acceleration
29
NVIDIA Use of ANSYS Software in Product Engineering
ANSYS Icepak – active and passive cooling of IC packages
ANSYS Mechanical – large deflection bending of PCBs
ANSYS Mechanical – comfort and fit of 3D emitter glasses
ANSYS Mechanical – shock & vib of solder ball assemblies
30
Select Automotive CAE Application ISV Select CAE Software GPU Status
CSM: Durability (Stress) and Fatigue MSC Nastran Available Today
Road Handling and VPG Adams (for MBD) Evaluation
Powertrain Stress Analysis Abaqus/Standard Available Today
Body NVH MSC Nastran Available Today
Crashworthiness and Safety LS-DYNA Implicit only, beta
CFD: Aerodynamics / Thermal UH ANSYS Fluent Available Today
IC Engine Combustion STAR-CCM+ Evaluation
Aerodynamics / HVAC OpenFOAM Available Today
Plastic Mold Injection Moldflow Available Today
GPU Developments for Automotive CAE
31
GPU Developments for Turbine Engine CFD Developer Location Software (Green color indicates CUDA-ready during 2013)
Turbostream England, UK Turbostream 3.0
Oxford / Rolls Royce England, UK OP2 / Hydra
ANSYS USA ANSYS CFD 15.0 (Fluent + CFX)
ANSYS USA ANSYS Fluent 15.0
FluiDyna Germany Culises for OpenFOAM 2.2.0
Vratis Poland Speed-IT for OpenFOAM 2.2.0
Cascade Technologies USA CHARLES
Convergent Science USA Converge CFD
Sandia NL / Oak Ridge NL USA S3D
Naval Research Lab USA JENRE
Aviadvigatel OJSC Russia GHOST CFD
Turbomachinery
Combustor
Nozzle / Noise
32
Dynamic
Parallelism
Kepler Fastest, Most Efficient HPC Architecture Ever
3x Performance per Watt SMX
Easy Speed-up for Legacy
MPI Apps Hyper-Q
Parallel Programming Made
Easier than Ever
33
Tesla Kepler Family World’s Fastest and Most Efficient HPC Accelerators
GPUs
Single
Precision
Peak
(SGEMM)
Double
Precision
Peak
(DGEMM)
Memory
Size
Memory
Bandwidth
(ECC off)
System Solution
Weather & Climate,
Physics, BioChemistry, CAE,
Material Science
K20X 3.95 TF
(2.90 TF)
1.32 TF
(1.22 TF) 6 GB 250 GB/s Server only
K20 3.52 TF
(2.61 TF)
1.17 TF
(1.10 TF) 5 GB 208 GB/s
Server +
Workstation
Image, Signal,
Video, Seismic K10 4.58 TF 0.19 TF 8 GB 320 GB/s Server only
34
2012 2014 2008 2010
DP G
FLO
PS p
er
Watt
Kepler
T10
Fermi
Maxwell
Volta Stacked DRAM
Unified Virtual Memory
Dynamic Parallelism
FP64
CUDA
32
16
8
4
2
1
0.5
NVIDIA CUDA GPU Roadmap
What to Expect from NVIDIA: • Increasing number of more flexible cores
• Larger and faster memories (6 GB today)
• Enhanced programming and standards
• Tighter integration with systems ~ 7x Fermi ~ 2x Kepler
~ 13x Fermi ~ 4x Kepler
35
VIRTUAL DESKTOPS
VIRTUAL MACHINE
NVIDIA Driver
NVIDIA GRID Enabled Virtual Desktop
NVIDIA GRID GPU
VDI
NVIDIA GRID ENABLED Hypervisor