May 2018
TESLA V100 PERFORMANCE GUIDE
2
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink & HBM2
Efficient Bandwidth
3
WEAK NODESLots of Nodes Interconnected with
Vast Network Overhead
STRONG NODESFew Lightning-Fast Nodes with
Performance of Hundreds of Weak Nodes
4
CUSTOMER REFERENCE POINT IS SERVER NODE
Server Node They’ve Known for 20 Years
2x CPU: $4K
Memory: $1K
Interconnect NIC: $1K
Misc: $1K
Server Cost: $7K
Server Node with GPUs
Server: $7K
4x v100 GPU: $28K
GPU Server Cost: $35K
CUSTOMERS WANT TO COMPARE NODE COST
5
GPU NODE CPU NODE# CPU NODES
TO MATCH 1 GPU NODE
$ SPEND ON CPU NODES
$ SAVED WITH GPU
NODE
AMBER $35K+
$5K (Core networking &
OEM margins)
Total: $40K
$7K +
$2K (Core networking &
OEM margins)
Total: $9K
95 CPU Nodes $855K $815K(95% savings)
GTC 16 CPU Nodes $144K $104K(72% savings)
HOW TESLA SAVES MONEY
6
550+ GPU-ACCELERATED APPLICATIONS
DEFINING THE NEXT GIANT WAVE IN HPC
EVERY DEEP LEARNING FRAMEWORK ACCELERATED
All Top 15 Apps Are Accelerated
OAK RIDGE SUMMITUS’s next fastest supercomputer200+ Petaflop HPC; 3+ Exaflop of AI
ABCI Supercomputer (AIST)Japan’s fastest AI supercomputer
Piz DaintEurope’s fastest supercomputer
#1 IN AI AND HPC ACCELERATION
7
V100 PCIE PERFORMANCE SUMMARYPerformance Equivalency of Single GPU Server vs SKYLAKE CPU Servers
LIFE SCIENCEAMBER
V100 GPU Node Outperforms 190 CPU Nodes
1 Server with GPUs
# of CPU Servers
MATERIAL SCIENCEQUANTUM ESPRESSO
V100 GPU Node Outperforms 12 CPU Nodes
PHYSICSMILC
V100 GPU Node Out Performs 25 CPU Nodes
GEOPHYSICSSPECFEM3D
V100 GPU Node Out Performs 80 CPU Nodes
1 Server with GPUs
# of CPU Servers
# of CPU Servers
1 Server with GPUs
# of CPU Servers
1 Server with GPUs
For benchmark details, input models, see following slides in the guide
8
MOLECULAR DYNAMICS
9
AMBER Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: PME-Cellulose_NVETo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
AMBERMolecular Dynamics
Suite of programs to simulate molecular dynamics on biomolecule
VERSION16.12
ACCELERATED FEATURESPMEMD Explicit Solvent & GB; Explicit & Implicit Solvent, REMD, aMD
SCALABILITYMulti-GPU and Single-Node
MORE INFORMATIONhttp://ambermd.org/gpus
# of CPU Only Servers
1 server 4x V100 GPUs
95 CPU Servers
59xSpeed up vs CPU server
SKL/V100 #’s may change
30x 126x
202 CPU Servers
48 CPU Servers
1 server 8x V100 GPUs
1 server 2x V100 GPUs
10
HOOMD-Blue Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: microsphereTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
1 server 8x V100 GPUs
HOOMD-BlueMolecular Dynamics
Particle dynamics package written grounds up for GPUs
VERSION2.2.2
ACCELERATED FEATURESCPU & GPU versions available
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttp://codeblue.umich.edu/hoomd-blue/index.html
34 CPU Servers
43 CPU Servers
79 CPU Servers
# of CPU Only Servers
30x 38x 70xSpeed up vs CPU server
1 server 2x V100 GPUs
1 server 4x V100 GPUs
11
LAMMPS Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config CUDA Version: CUDA 9.0.176, Dataset: Atomic-Fluid Lennard-Jones 2.5 CutoffTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
7 CPU Servers
LAMMPSMolecular Dynamics
Classical molecular dynamics package
VERSION2018
ACCELERATED FEATURESLennard-Jones, Gay-Berne, Tersoff, many more potentials
SCALABILITYMulti-GPU and Multi-Node
More Informationhttp://lammps.sandia.gov/index.html
20 CPU Servers
# of CPU Only Servers
1 server 2x V100 GPUs
6x 18xSpeed up vs CPU server
1 server 4x V100 GPUs
11 CPU Servers
1 server 8x V100 GPUs
10x
12
NAMD Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176, Dataset: STMVTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
NAMDGeoscience (Oil & Gas)
Designed for high-performance simulationof large molecular systems
VERSION2.13
ACCELERATED FEATURESFull electrostatics with PME and most simulation features
SCALABILITYUp to 100M atom capable, multi-GPU, Scale Scales to 2xP100
More Informationhttp://www.ks.uiuc.edu/Research/namd/
11 CPU Servers
# of CPU Only Servers 14 CPU Servers
1 server 2x V100 GPUs
1 server 4x V100 GPUs
9x 12xSpeed up vs CPU server
13
MATERIALS SCIENCE
14
Quantum Espresso Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176; Dataset: AUSURF112-jRTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
Quantum EspressoMaterial Science (Quantum Chemistry)
An Open-source suite of computer codes for electronic structure calculations and materials modeling at the nanoscale.
VERSION6.2.1
ACCELERATED FEATURESLinear algebra (matix multiply), explicit computational kernels, 3D FFTs
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttp://www.quantum-espresso.org
7 CPU Servers
13 CPU Servers
# of CPU Only Servers
Speed up vs CPU server
1 server 2x V100 GPUs
4x 8x
1 server 4x V100 GPUs
15
PHYSICS
16
Chroma Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: szscl21_24_128To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
ChromaPhysics
Lattice Quantum Chromodynamics (LQCD)
VERSION2018
ACCELERATED FEATURESWilson-clover fermions, Krylov solvers,Domain-decomposition
SCALABILITYMulti-GPU
MORE INFORMATIONhttp://jeffersonlab.github.io/chroma/
6 CPU Servers
13 CPU Servers
# of CPU Only Servers
1 server 4x V100 GPUs
6x 16xSpeed up vs CPU server
1 server 2x V100 GPUs
17
GTC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: mpi#proc.inTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
GTCPhysics
GTC is used for Gyrokinetic Particle Simulation of Turbulent Transport in Burning Plasmas.
VERSION2017
ACCELERATED FEATURESPush, shift, and collision
SCALABILITYMulti-GPU
MORE INFORMATIONwww.nvidia.com/gtc-p
9 CPU Servers
18 CPU Servers
# of CPU Only Servers
1 server 4x V100 GPUs
9x 16xSpeed up vs CPU server
1 server 2x V100 GPUs
18
MILC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: Dual Xeon Gold [email protected] with Tesla V100 PCIe (16GB)CUDA Version: CUDA 9.0.176; Dataset: MILC APEX MediumTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
MILCPhysics
Lattice Quantum Chromodynamics (LQCD) codes simulate how elemental particles are formed and bound by the “strong force” to create larger particles like protons and neutrons
VERSION2018
ACCELERATED FEATURESStaggered fermions, Krylov solvers, Gauge-link fattening
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONwww.nvidia.com/milc
15 CPU Servers
# of CPU Only Servers 28 CPU Servers
1 server 4x V100 GPUs
14x 27xSpeed up vs CPU server
1 server 2x V100 GPUs
19
QUDA Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA 9.0.103; Dataset: Dslash Wilson-Clover; Precision: Single; Gauge Compression/Recon: 12; Problem Size 32x32x32x64To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
68 CPU Servers
QUDAPhysics
A library for Lattice Quantum Chromo Dynamics on GPUs
VERSION2017
ACCELERATED FEATURESAll
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONwww.nvidia.com/quda
38 CPU Servers
89 CPU Servers
# of CPU Only Servers
1 server 8x V100 GPUs
29x 51x 68xSpeed up vs CPU server
1 server 2x V100 GPUs
1 server 4x V100 GPUs
No change
20
GEOSCIENCE
21
RTM Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.103; Dataset: TTI RX 2pass mgpuTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.
RTMGeoscience (Oil & Gas)
Reverse time migration (RTM) modeling is a critical component in the seismic processing workflow of oil and gas exploration
VERSION2017
ACCELERATED FEATURESBatch algorithm
SCALABILITYMulti-GPU and Multi-Node
6 CPU Servers
11 CPU Servers
21 CPU Servers
# of CPU Only Servers
1 server 8x V100 GPUs
5x 10x 20xSpeed up vs CPU server
1 server 2x V100 GPUs
1 server 4x V100 GPUs
Speed-up slight change, rounded up old raw data
22
SPECFEM3D Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: four_material_simple_modelTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.
SPECFEM3DGeoscience
Simulates Seismic wave propagation
VERSION2.0.2
SCALABILITYMulti-GPU and Single-Node
MORE INFORMATIONhttps://geodynamics.org/cig/software/specfem3d_globe/
28 CPU Servers
# of CPU Only Servers 54 CPU Servers
1 server 4x V100 GPUs
27x 52xSpeed up vs CPU server
1 server 2x V100 GPUs
23
ENGINEERING
24
SIMULIA Abaqus Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 7.5; Dataset: LS-EPP-Combined-WC-Mkl (RR)To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
SIMULIA AbaqusEngineering
Simulation tool for analysis of structures
VERSION2017
ACCELERATED FEATURESDirect Sparse SolverAMS Eigen SolverSteady-state Dynamics Solver
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttp://www.nvidia.com/simulia-abaqus
# of CPU Only Servers
10 CPU Servers
5 CPU Servers
2.2x 3.0x
1 server 2x V100 GPU’s
1 server 4x V100 GPU’s
Speed-up slight change, rounded up old raw data
Speed up vs CPU server
25
ANSYS Fluent Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 6.0; Dataset: Water JacketTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
ANSYS Fluent Engineering
General purpose software for the simulation of fluid dynamics
VERSION18
ACCELERATED FEATURESPressure-based Coupled Solver and Radiation Heat Transfer
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttp://www.nvidia.com/ansys-fluent
# of CPU Only Servers
2x
1 server 2x V100 GPUs
1 server 4x V100 GPUs
3x
4 CPU Servers
3 CPU Servers
Speed up vs CPU server
26
COMPUTATIONAL FINANCE
27
STAC-A2Computational Finance
Financial risk management benchmark created by leading global banks working with the Securities Technology Analysis Center (STAC) used to assess financial compute solutions.
VERSIONSTAC-A2 (Beta 2)
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU
More Informationwww.STACresearch.com/nvidia
STAC-A2 Benchmark Performance Results8 x V100 GPU Server vs Dual Skylake Platinum 8180 Sever
8.9x
System Configuration: GPU Server SUT ID: NVDA171020 | STAC-A2 Pack for CUDA (Rev D) | GPU Server: 8 x NVIDIA Tesla V100 (Volta) GPU,2 x Intel® Xeon E5-2680 v4 @ 2.4 GHz | CUDA Version 9.0 | Compared to: CPU Server SUT ID: INTC170920, | STAC-A2 Pack for Intel Composer XE (Rev K) | 2 x 28-Core Intel Xeon Platinum 8180 @ 2.5GHz“Throughput” is STAC-A2.β2.HPORTFOLIO.SPEED, “Latency” is STAC-A2.β2.GREEKS.WARM, and “Energy Efficiency” is STAC-A2.β2.ENERG_EFF.|“STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center, LLC.
6.2x
2.9x
28
BENCHMARKS
29
Cloverleaf Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: bm32To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
CloverleafBenchmark – Mini-App
Hydrodynamics
VERSION1.3
ACCELERATED FEATURESLagrangian-Eulerianexplicit hydrodynamics mini-application
SCALABILITYMulti-Node (MPI)
MORE INFORMATIONhttp://uk-mac.github.io/CloverLeaf/
19 CPU Servers
22 CPU Servers
# of CPU Only Servers
25x 29xSpeed up vs CPU server
1 server 2x V100 GPUs
1 server 4x V100 GPUs
No change
30
HPCG Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 256x256x256 local sizeTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
HPCGBenchmark
Exercises computational and data access patterns that closely match a broad set of important HPC applications
VERSION3
ACCELERATED FEATURESAll
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttp://www.hpcg-benchmark.org/index.html
12 CPU Servers
23 CPU Servers
# of CPU Only Servers
1 server 2x V100 GPUs
1 server 4x V100 GPUs
19x 22xSpeed up vs CPU server
31
Linpack Performance EquivalencySingle GPU Server vs Multiple S Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: HPL.datTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
LinpackBenchmark
Measures floating point computing power
VERSION2.1
ACCELERATED FEATURESAll
SCALABILITYMulti-GPU and Multi-Node
MORE INFORMATIONhttps://www.top500.org/project/linpack/
6 CPU Servers
# of CPU Only Servers 11 CPU Servers
9x 18x 22xSpeed up vs CPU server
1 server 2x V100 GPUs
1 server 4x V100 GPUs
32
MiniFE Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers
CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 350x350x350To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.
MiniFEBenchmark – Mini-App
Finite Element Analysis
VERSION0.3
ACCELERATED FEATURESAll
SCALABILITYMulti-GPU
MORE INFORMATIONhttps://mantevo.org/about/applications/
8 CPU Servers
15 CPU Servers
# of CPU Only Servers
7x 15x
1 server 2x V100 GPUs
1 server 4x V100 GPUs
Speed up vs CPU server
33
DEEP LEARNING
34
Caffe2Deep Learning
A popular, GPU-accelerated Deep Learning framework developed at UC Berkeley and Facebook
VERSION1.0
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU
More Informationresearch.fb.com/downloads/caffe2/
Caffe2 Deep Learning FrameworkTraining on 8 x V100 GPU Server vs 8 x P100 GPU Server
2.0x Avg. Speedup
2.2x Avg. Speedup
Server with 8x V100 SXM2 16GB
Server with 8x V100 16GB PCIe
CPU Server: Dual Xeon E5-2698 [email protected], GPU servers as shownUbuntu 14.04.5; CUDA Version: CUDA 9.0.176; NCCL 2.0.5, CuDNN 7.0.2.43; Driver 384.66Data set: ImageNet, Batch sizes: Inception V3 and ResNet-50 64 for P100 SXM2, 128 for Tesla V100, Seq2Seq 192, VGG16 96
SXM2 up to 27% faster
than PCIe
35
NVIDIA TENSORRT 3Massive Throughput and Amazing Efficiency at Low Latency
CNN - IMAGES
Imag
es/S
ec (
Targ
et 7
ms
late
ncy)
ResNet-50 Throughput
17ms
CPU + Caffe P100 + TensorRT
P4 +TensorRT
CPU throughput based on measured inference throughput performance on Broadwell-based Xeon E2690v4 CPU, and doubled to reflect Intel’s stated claim that Xeon Scalable Processor will deliver 2x the performance of Broadwell-based Xeon CPUs on Deep Learning Inference.
V100 + TensorRT
Imag
es/S
ec (
Targ
et 7
ms
late
ncy)
GoogleNet Throughput
8ms
CPU + Caffe P100 + TensorRT
P4 +TensorRT
V100 + TensorRT
7ms 7ms
CNN - IMAGES
Top Related