Download - PERFORMANCE GUIDE TESLA V100 · Single GPU Server vs Multiple Skylake CPU-Only Servers CPU Server: Dual Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on

May 2018

TESLA V100 PERFORMANCE GUIDE

2

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

3

WEAK NODESLots of Nodes Interconnected with

Vast Network Overhead

STRONG NODESFew Lightning-Fast Nodes with

Performance of Hundreds of Weak Nodes

4

CUSTOMER REFERENCE POINT IS SERVER NODE

Server Node They’ve Known for 20 Years

2x CPU: $4K

Memory: $1K

Interconnect NIC: $1K

Misc: $1K

Server Cost: $7K

Server Node with GPUs

Server: $7K

4x v100 GPU: $28K

GPU Server Cost: $35K

CUSTOMERS WANT TO COMPARE NODE COST

5

GPU NODE CPU NODE# CPU NODES

TO MATCH 1 GPU NODE

$ SPEND ON CPU NODES

$ SAVED WITH GPU

NODE

AMBER $35K+

$5K (Core networking &

OEM margins)

Total: $40K

$7K +

$2K (Core networking &

OEM margins)

Total: $9K

95 CPU Nodes $855K $815K(95% savings)

GTC 16 CPU Nodes $144K $104K(72% savings)

HOW TESLA SAVES MONEY

6

550+ GPU-ACCELERATED APPLICATIONS

DEFINING THE NEXT GIANT WAVE IN HPC

EVERY DEEP LEARNING FRAMEWORK ACCELERATED

All Top 15 Apps Are Accelerated

OAK RIDGE SUMMITUS’s next fastest supercomputer200+ Petaflop HPC; 3+ Exaflop of AI

ABCI Supercomputer (AIST)Japan’s fastest AI supercomputer

Piz DaintEurope’s fastest supercomputer

#1 IN AI AND HPC ACCELERATION

7

V100 PCIE PERFORMANCE SUMMARYPerformance Equivalency of Single GPU Server vs SKYLAKE CPU Servers

LIFE SCIENCEAMBER

V100 GPU Node Outperforms 190 CPU Nodes

1 Server with GPUs

# of CPU Servers

MATERIAL SCIENCEQUANTUM ESPRESSO

V100 GPU Node Outperforms 12 CPU Nodes

PHYSICSMILC

V100 GPU Node Out Performs 25 CPU Nodes

GEOPHYSICSSPECFEM3D

V100 GPU Node Out Performs 80 CPU Nodes

1 Server with GPUs

# of CPU Servers

# of CPU Servers

1 Server with GPUs

# of CPU Servers

1 Server with GPUs

For benchmark details, input models, see following slides in the guide

8

MOLECULAR DYNAMICS

9

AMBER Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: PME-Cellulose_NVETo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

AMBERMolecular Dynamics

Suite of programs to simulate molecular dynamics on biomolecule

VERSION16.12

ACCELERATED FEATURESPMEMD Explicit Solvent & GB; Explicit & Implicit Solvent, REMD, aMD

SCALABILITYMulti-GPU and Single-Node

MORE INFORMATIONhttp://ambermd.org/gpus

# of CPU Only Servers

1 server 4x V100 GPUs

95 CPU Servers

59xSpeed up vs CPU server

SKL/V100 #’s may change

30x 126x

202 CPU Servers

48 CPU Servers



10

HOOMD-Blue Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: microsphereTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.


HOOMD-BlueMolecular Dynamics

Particle dynamics package written grounds up for GPUs

VERSION2.2.2

ACCELERATED FEATURESCPU & GPU versions available

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://codeblue.umich.edu/hoomd-blue/index.html

34 CPU Servers

43 CPU Servers

79 CPU Servers


30x 38x 70xSpeed up vs CPU server



11

LAMMPS Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config CUDA Version: CUDA 9.0.176, Dataset: Atomic-Fluid Lennard-Jones 2.5 CutoffTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

7 CPU Servers

LAMMPSMolecular Dynamics

Classical molecular dynamics package

VERSION2018

ACCELERATED FEATURESLennard-Jones, Gay-Berne, Tersoff, many more potentials


More Informationhttp://lammps.sandia.gov/index.html

20 CPU Servers



6x 18xSpeed up vs CPU server


11 CPU Servers


10x

12

NAMD Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176, Dataset: STMVTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

NAMDGeoscience (Oil & Gas)

Designed for high-performance simulationof large molecular systems

VERSION2.13

ACCELERATED FEATURESFull electrostatics with PME and most simulation features

SCALABILITYUp to 100M atom capable, multi-GPU, Scale Scales to 2xP100

More Informationhttp://www.ks.uiuc.edu/Research/namd/

11 CPU Servers

# of CPU Only Servers 14 CPU Servers




13

MATERIALS SCIENCE

14

Quantum Espresso Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176; Dataset: AUSURF112-jRTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

Quantum EspressoMaterial Science (Quantum Chemistry)

An Open-source suite of computer codes for electronic structure calculations and materials modeling at the nanoscale.

VERSION6.2.1

ACCELERATED FEATURESLinear algebra (matix multiply), explicit computational kernels, 3D FFTs


MORE INFORMATIONhttp://www.quantum-espresso.org

7 CPU Servers

13 CPU Servers


Speed up vs CPU server


4x 8x


15

PHYSICS

16

Chroma Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: szscl21_24_128To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

ChromaPhysics

Lattice Quantum Chromodynamics (LQCD)

VERSION2018

ACCELERATED FEATURESWilson-clover fermions, Krylov solvers,Domain-decomposition

SCALABILITYMulti-GPU

MORE INFORMATIONhttp://jeffersonlab.github.io/chroma/

6 CPU Servers

13 CPU Servers





17

GTC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: mpi#proc.inTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

GTCPhysics

GTC is used for Gyrokinetic Particle Simulation of Turbulent Transport in Burning Plasmas.

VERSION2017

ACCELERATED FEATURESPush, shift, and collision


MORE INFORMATIONwww.nvidia.com/gtc-p

9 CPU Servers

18 CPU Servers





18

MILC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: Dual Xeon Gold [email protected] with Tesla V100 PCIe (16GB)CUDA Version: CUDA 9.0.176; Dataset: MILC APEX MediumTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

MILCPhysics

Lattice Quantum Chromodynamics (LQCD) codes simulate how elemental particles are formed and bound by the “strong force” to create larger particles like protons and neutrons

VERSION2018

ACCELERATED FEATURESStaggered fermions, Krylov solvers, Gauge-link fattening


MORE INFORMATIONwww.nvidia.com/milc

15 CPU Servers





19

QUDA Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA 9.0.103; Dataset: Dslash Wilson-Clover; Precision: Single; Gauge Compression/Recon: 12; Problem Size 32x32x32x64To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

68 CPU Servers

QUDAPhysics

A library for Lattice Quantum Chromo Dynamics on GPUs

VERSION2017

ACCELERATED FEATURESAll


MORE INFORMATIONwww.nvidia.com/quda

38 CPU Servers

89 CPU Servers






No change

20

GEOSCIENCE

21

RTM Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.103; Dataset: TTI RX 2pass mgpuTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.

RTMGeoscience (Oil & Gas)

Reverse time migration (RTM) modeling is a critical component in the seismic processing workflow of oil and gas exploration

VERSION2017

ACCELERATED FEATURESBatch algorithm


6 CPU Servers

11 CPU Servers

21 CPU Servers






Speed-up slight change, rounded up old raw data

22

SPECFEM3D Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: four_material_simple_modelTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.

SPECFEM3DGeoscience

Simulates Seismic wave propagation

VERSION2.0.2

SCALABILITYMulti-GPU and Single-Node

MORE INFORMATIONhttps://geodynamics.org/cig/software/specfem3d_globe/

28 CPU Servers





23

ENGINEERING

24

SIMULIA Abaqus Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 7.5; Dataset: LS-EPP-Combined-WC-Mkl (RR)To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

SIMULIA AbaqusEngineering

Simulation tool for analysis of structures

VERSION2017

ACCELERATED FEATURESDirect Sparse SolverAMS Eigen SolverSteady-state Dynamics Solver


MORE INFORMATIONhttp://www.nvidia.com/simulia-abaqus


10 CPU Servers

5 CPU Servers

2.2x 3.0x

1 server 2x V100 GPU’s

1 server 4x V100 GPU’s

Speed-up slight change, rounded up old raw data


25

ANSYS Fluent Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 6.0; Dataset: Water JacketTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

ANSYS Fluent Engineering

General purpose software for the simulation of fluid dynamics

VERSION18

ACCELERATED FEATURESPressure-based Coupled Solver and Radiation Heat Transfer


MORE INFORMATIONhttp://www.nvidia.com/ansys-fluent


2x



3x

4 CPU Servers

3 CPU Servers


26

COMPUTATIONAL FINANCE

27

STAC-A2Computational Finance

Financial risk management benchmark created by leading global banks working with the Securities Technology Analysis Center (STAC) used to assess financial compute solutions.

VERSIONSTAC-A2 (Beta 2)

ACCELERATED FEATURESFull framework accelerated


More Informationwww.STACresearch.com/nvidia

STAC-A2 Benchmark Performance Results8 x V100 GPU Server vs Dual Skylake Platinum 8180 Sever

8.9x

System Configuration: GPU Server SUT ID: NVDA171020 | STAC-A2 Pack for CUDA (Rev D) | GPU Server: 8 x NVIDIA Tesla V100 (Volta) GPU,2 x Intel® Xeon E5-2680 v4 @ 2.4 GHz | CUDA Version 9.0 | Compared to: CPU Server SUT ID: INTC170920, | STAC-A2 Pack for Intel Composer XE (Rev K) | 2 x 28-Core Intel Xeon Platinum 8180 @ 2.5GHz“Throughput” is STAC-A2.β2.HPORTFOLIO.SPEED, “Latency” is STAC-A2.β2.GREEKS.WARM, and “Energy Efficiency” is STAC-A2.β2.ENERG_EFF.|“STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center, LLC.

6.2x

2.9x

28

BENCHMARKS

29

Cloverleaf Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: bm32To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

CloverleafBenchmark – Mini-App

Hydrodynamics

VERSION1.3

ACCELERATED FEATURESLagrangian-Eulerianexplicit hydrodynamics mini-application

SCALABILITYMulti-Node (MPI)

MORE INFORMATIONhttp://uk-mac.github.io/CloverLeaf/

19 CPU Servers

22 CPU Servers





No change

30

HPCG Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 256x256x256 local sizeTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

HPCGBenchmark

Exercises computational and data access patterns that closely match a broad set of important HPC applications

VERSION3



MORE INFORMATIONhttp://www.hpcg-benchmark.org/index.html

12 CPU Servers

23 CPU Servers





31

Linpack Performance EquivalencySingle GPU Server vs Multiple S Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: HPL.datTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

LinpackBenchmark

Measures floating point computing power

VERSION2.1



MORE INFORMATIONhttps://www.top500.org/project/linpack/

6 CPU Servers





32

MiniFE Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 350x350x350To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

MiniFEBenchmark – Mini-App

Finite Element Analysis

VERSION0.3



MORE INFORMATIONhttps://mantevo.org/about/applications/

8 CPU Servers

15 CPU Servers


7x 15x




33

DEEP LEARNING

34

Caffe2Deep Learning

A popular, GPU-accelerated Deep Learning framework developed at UC Berkeley and Facebook

VERSION1.0

ACCELERATED FEATURESFull framework accelerated


More Informationresearch.fb.com/downloads/caffe2/

Caffe2 Deep Learning FrameworkTraining on 8 x V100 GPU Server vs 8 x P100 GPU Server

2.0x Avg. Speedup

2.2x Avg. Speedup

Server with 8x V100 SXM2 16GB

Server with 8x V100 16GB PCIe

CPU Server: Dual Xeon E5-2698 [email protected], GPU servers as shownUbuntu 14.04.5; CUDA Version: CUDA 9.0.176; NCCL 2.0.5, CuDNN 7.0.2.43; Driver 384.66Data set: ImageNet, Batch sizes: Inception V3 and ResNet-50 64 for P100 SXM2, 128 for Tesla V100, Seq2Seq 192, VGG16 96

SXM2 up to 27% faster

than PCIe

35

NVIDIA TENSORRT 3Massive Throughput and Amazing Efficiency at Low Latency

CNN - IMAGES

Imag

es/S

ec (

Targ

et 7

ms

late

ncy)

ResNet-50 Throughput

17ms

CPU + Caffe P100 + TensorRT

P4 +TensorRT

CPU throughput based on measured inference throughput performance on Broadwell-based Xeon E2690v4 CPU, and doubled to reflect Intel’s stated claim that Xeon Scalable Processor will deliver 2x the performance of Broadwell-based Xeon CPUs on Deep Learning Inference.

V100 + TensorRT

Imag

es/S

ec (

Targ

et 7

ms

late

ncy)

GoogleNet Throughput

8ms

CPU + Caffe P100 + TensorRT

P4 +TensorRT

V100 + TensorRT

7ms 7ms

CNN - IMAGES