ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20...

10
Timothy Lanfear ENDURING DIFFERENTIATION 2 WHERE ARE WE? 3 LIFE AFTER DENNARD SCALING 1980 1990 2000 2010 2020 10 2 10 3 10 4 10 5 10 6 10 7 40 Years of Microprocessor Trend Data Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp Single-threaded perf 1.5X per year 1.1X per year Transistors (thousands) 4 0 8 16 24 32 40 AMBER Performance (ns/day) P100 2016 K80 2015 K40 2014 K20 2013 AMBER 12 CUDA 4 AMBER 14 CUDA 5 AMBER 14 CUDA 6 AMBER 16 CUDA 8 0 2400 4800 7200 9600 12000 GoogleNet Performance (i/s) cuDNN 2 CUDA 6 cuDNN 4 CUDA 7 cuDNN 6 CUDA 8 NCCL 1.6 cuDNN 7 CUDA 9 NCCL 2 8x K80 2014 8x Maxwell 2015 DGX-1 2016 DGX-1V 2017 GPU-ACCELERATED PERFORMANCE

Transcript of ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20...

Page 1: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

Timothy Lanfear

ENDURING DIFFERENTIATION

2

WHERE ARE WE?

3

LIFE AFTER DENNARD SCALING

1980 1990 2000 2010 2020

102

103

104

105

106

107

40 Years of Microprocessor Trend Data

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,

O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected

for 2010-2015 by K. Rupp

Single-threaded

perf

1.5X per year

1.1X per yearTransistors

(thousands)

4

0

8

16

24

32

40

AMBER Performance (ns/day)

P100

2016

K80

2015

K40

2014

K20

2013

AMBER 12

CUDA 4

AMBER 14

CUDA 5

AMBER 14

CUDA 6

AMBER 16

CUDA 8

0

2400

4800

7200

9600

12000

GoogleNet Performance (i/s)

cuDNN 2

CUDA 6

cuDNN 4

CUDA 7

cuDNN 6

CUDA 8

NCCL 1.6

cuDNN 7

CUDA 9

NCCL 2

8x K80

2014

8x Maxwell

2015

DGX-1

2016

DGX-1V

2017

GPU-ACCELERATED PERFORMANCE

Page 2: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

5

Delivered value grows over time

10X Perf in 8 Years 6X Perf in 6 Years

TESLA VALUE: $15-20M COST SAVINGS DELIVERED

Life Sciences (AMBER) Oil & Gas (RTM)

Amber performance: Nano Seconds Per Day delivered on 1xServer with GPUs and CPUS

TESLA PLATFORM ADVANTAGE

6

GPU-ACCELERATED EFFICIENCY

On Track To Meet Exascale Goal

GFLO

PS p

er

Watt

0

5

10

15

20

25

30

35

9.5 SaturnV

P100

Top GPU Systems in Green500 Lists and NVIDIA Projections for V100

35 GF/WExascale Goal

14.1 TSUBAME 3.0

P1005.3 TiTechW780IK80

4.4 Tsubame-

KFCK20X

3.2 EurotechAurora

K20

V100

13/13 Greenest Supercomputers Powered by Tesla P100

TSUBAME 3.0

Kukai

AIST AI Cloud

RAIDEN GPU subsystem

Piz Daint

Wilkes-2

GOSAT-2 (RCF2)

DGX Saturn V

Reedbush-H

JADE

Facebook Cluster

Cedar

DAVIDE

7

HOW ARE WE DOING THIS?

• What are the most important dimensions of our differentiation?

• Why are GPUs so much more efficient than CPUs?

• How can we continue scaling performance/efficiency as Moore’s Law fades?

• Why can’t competitors replicate GPU efficiency, performance, scaling, etc., with

lots of weak CPU cores? (e.g., Intel KNC/KNL/KNM)

• How is optimizing GPUs for AI affecting their suitability for HPC?

And, is our differentiation sustainable?

8

ENERGY EFFICIENCY

Page 3: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

9

20mm

64-bit DP

1000 pJ

28nm IC

256-bit access8 kB SRAM 50 pJ

16000 pJ DRAM Rd/Wr

500 pJ Efficient off-chip link

20 pJ 26 pJ 256 pJ

256

bits

COMPUTATION VERSUS COMMUNICATIONS

10

CPU126 pJ/flop (SP)

Optimized for Latency

Deep Cache Hierarchy

Broadwell E5 v414 nm

GPU28 pJ/flop (SP)

Optimized for Throughput

Explicit Managementof On-chip Memory

Pascal16 nm

11

HOW IS POWER SPENT IN A CPU?

Out of Order, High Performance

Clock + Pins

45%

ALU

4%

Fetch

11%

Rename

10%

RF

14%

DataSupply

5%

Natarajan [2003] (Alpha 21264)

In Order, Embedded

Clock + Control Logic

24%

Data Supply

17%

Instruction Supply

42%

Register File

11%

ALU 6%

Issue

11%

Dally [2008] (Embedded in-order CPU)

12

Overhead

985pJ

Payload

Arithmetic

15pJ

Page 4: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

13Source: Azizi [PhD 2010]

SIMPLER CORES = ENERGY EFFICIENCY

14

Overhead

15pJ

Payload

Arithmetic

15pJ

15

THROUGHPUT PROCESSORS

16

RISE OF LEAKAGE

Page 5: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

17

FREQUENCY VS. LEAKAGE

Source: Gordon Moore, Intel; IEEE19

OPTIMIZED FOR DATACENTER EFFICIENCY40% More Performance in a Rack

Computer Vision

V100Max Efficiency

Computer Vision

V100Max Performance

13 KW Rack4 Nodes of 8xV100

13ResNet-50 Networks

Trained Per Day

13 KW Rack7 Nodes of 8xV100

18ResNet-50 Networks

Trained Per Day

ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance measured on pre-production hardware.

80% Perf at Half the Power

20

SP ENERGY EFFICIENCY @ 28 NM

0

5

10

15

20

25

Fermi Kepler Maxwell

GF

LO

PS

/ w

att

21

HETEROGENEOUS COMPUTING

Page 6: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

22

OPTIMIZING SERIAL/PARALLEL EXECUTIONApplication Code

+

GPU CPU

Parallel Work

Majority of Ops

Serial Work

System and Sequential Ops

23

TWO TYPES OF ACCELERATORS

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Parallel Tasks

Heterogeneous Computing ModelComplementary Processors Work Together

Many-Weak-Cores (MWC) ModelSingle CPU Core for Both Serial & Parallel Work

Xeon Phi (And Others)Many Weak Serial Cores

24

EXTENSIBILITY

25

NVLINK: A MEMORY FABRIC, NOT A NETWORKDGX-1: 8 NVLink-Connected GPUs

Page 7: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

26

LATENCY HIDING FOR LOAD/STORE/ATOMICSWhere are the NICs? There are no NICs.

27

STRONG SCALING

0x

10x

20x

30x

40x

1 4 8 16 32 64 128

AlexNet TrainingDGX-1 Faster than 128 Knights Landing Servers

Knights Landing Servers 1x DGX1

Speed-u

p v

s 1x K

NL S

erv

er

GPU-Accelerated Server

GPU-Accelerated ServerNVIDIA DGX-1

28

Speed-u

p v

s 1 K

NL S

erv

er

Node

0x

5x

10x

15x

20x

25x

30x

35x

40x

0 4 8 12 16 20 24 28 32

2x P100: 13x Faster

# of KNL Nodes

4x P100: 21x Faster

8x P100: 35x Faster

LAMMPS: Molecular Dynamics8x Tesla P100 PCIe Server Faster than 32 KNL Servers

Speed-u

p v

s 1 K

NL S

erv

er

Node

0x

2x

4x

6x

8x

10x

0 8 16 24 32 40 48 56 64

# of KNL Nodes

GTC-P: Plasma Turbulence8x Tesla P100 PCIe Server Faster than 64 KNL Servers

2x P100

4x P100

8x P100: 8x Faster

STRONG SCALING

29

Gflop/s

0

5,000

10,000

15,000

20,000

25,000

0 18 36 54 72 90 108 126 144

# of CPU Nodes (in SuperMUC Supercomputer)

1x DGX-1: 8K Gflop/s

2x DGX-1: 15K Gflop/s

4x DGX-1: 20K Gflop/s

2K Gflop/s3K Gflop/s

5K Gflop/s

7K Gflop/s

LQCD- Higher Energy PhysicsSATURNV DGX Servers vs SuperMUC Supercomputer

QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single

# of CPU Servers to Match Performance of SATURNV

2,300CPU Servers

S3D: Discovering New Fuel for Engines

3,800CPU Servers

SPECFEM3D: Simulating Earthquakes

STRONG SCALING

Page 8: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

30

NEW TENSOR CORE

New CUDA TensorOpinstructions and data formats

4×4 matrix processing array

DFP32 = AFP16 × BFP16 + CFP32

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

31

TESLA PLATFORM

32

TESLA IS A PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI

TESLA GPU & SYSTEMS

NVIDIA SDK

INDUSTRY FRAMEWORKS & TOOLS

APPLICATIONS

ECOSYSTEM TOOLS

HPC

+450 Applications

FRAMEWORKS

INTERNET SERVICES

DEEP LEARNING SDK COMPUTEWORKS

SYSTEM OEM CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1

ENTERPRISE APPLICATIONS

Manufacturing

Automotive

Healthcare Finance

Retail

Defense

cuBLAS DeepStream SDK

NCCL

cuSPARSEcuDNN TensorRT

33

MULTIPLE GROWTH MARKETS

10M Users 40 years of video/day

TESLA PLATFORM

GROWTH MARKETS

Video TranscodingHPC AI Training AI Inference Desktop Virtualization

Page 9: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

34

CONCLUSION

35

PASCAL TO VOLTA

• Area: ~600 mm2 → ~800 mm2 (~33% more area)

• Process: ~ small Pascal → Volta improvement (a few percent)

• Clocks: similar dynamic range, power limited

• Memory BW (sustained): 50% improvement

• Communications (NVLink): 160 GB/s → 300 GB/s (almost double!)

• AI (Tensor Cores): ~20 TFLOPS → 120 TFLOPS (~6x!)

Architecture with Technology

36

REVOLUTIONARY PERFORMANCE FOR HPC AND AISingle Platform For Data Science and Computation Science

1.5X HPC Performance In 1 Year 3X AI Performance in 1 Year

LSTM ( Neural Machine Translation)

0 10 20

1X V100

1X P100

2X CPU 15 Days

18 Hours

6 Hours

1.8

1.5 1.61.5

Molecular Dynamics(VMD)

Physics (QUDA) Seismic (RTM) STREAM

V100 P

erf

orm

ance N

orm

alize

d t

o P

100

NMT Training for 13 Epochs |German ->English, WMT15 subset |System Config Info: CPU = 2x Xeon E5 2699 V4 w/ P100s or V100s |

QUDA, RTM, STREAM | System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100.

VMD | System Config Info: Xeon E5-2698v3 w/ 1x Tesla P100 and E5-2697Av4 w/ 1x Tesla V100 | V100 measured on pre-production hardware.

37

P100 V100 Ratio

Training acceleration 10 TOPS 120 TOPS 12x

Inference acceleration 21 TFLOPS 120 TOPS 6x

FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x

HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x

NVLink Bandwidth 160 GB/s 300 GB/s 1.9x

L2 Cache 4 MB 6 MB 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

GPU PERFORMANCE COMPARISON

Page 10: ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

38

GPU TRAJECTORY

1980 1990 2000 2010 2020

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,

O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected

for 2010-2015 by K. Rupp

Single-threaded

perf 1.5X per year

1.1X per year

GPU-Computing perf

1.5X per year

1000X

by

2025

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE