VOLTA: PROGRAMMABILITY AND PERFORMANCE...3 P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL...

VOLTA: PROGRAMMABILITY AND PERFORMANCE

Jack ChoquetteNVIDIA

Hot Chips 2017

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

TESLA V100

*full GV100 chip contains 84 SMs

P100 V100 Ratio

DL Training 10 TFLOPS 120 TFLOPS 12x

DL Inferencing 21 TFLOPS 120 TFLOPS 6x

FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x

HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x

STREAM Triad Perf 557 GB/s 855 GB/s 1.5x

NVLink Bandwidth 160 GB/s 300 GB/s 1.9x

L2 Cache 4 MB 6 MB 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

GPU PERFORMANCE COMPARISON

NVLINK

NVLINK - PROGRAMMABILITY

CPU Mastering

Direct Load/Store access for all agents

Flat shared address space

GPU & CPU atomics

Atomic completion at destination

Read-Modify-Write primitive for unsupported atomics

Address Translation Services

GPU uses CPU page tables directly

NVLINK – PERFORMANCE AND POWER

Bandwidth

25Gbps signaling

6 NVLinks for GV100

1.9 x Bandwidth improvement over GP100

Coherence

Latency sensitive CPU caches GMEM

Fast access in local cache hierarchy

Probe filter in GPU

Power Savings Reduce number of active lanes for lightly loaded link

NVLINK NODES

DL – HYBRID CUBE MESH – DGX-1 w/ Volta

HPC – P9 CORAL NODE – SUMMIT

V100 V100 V100 V100

V100 V100 V100

SM CORE

VOLTA GV100 SM

Twice the schedulers

Simplified Issue Logic

Large, fast L1 cache

Improved SIMT model

Tensor acceleration

+50% energy efficiency vs GP100 SM

Redesigned for Productivity and Accessible Performance

Sub-Core

L1 D$ & SMEM

Sub-Core

L1 I$SM

SM MICROARCHITECTURE

Shared L1 I$

4 Independently Scheduled Sub-Cores

Shared MIO (TEX/L1$/SMEM)TEX

1 quad/clk

Sub-Core1 Warp Inst/clk

L1 D$ & SMEM128KB

128 B/clk

64 B/clk

Sub-Core1 Warp Int/clk

64 B/clk

Sub-Core1 Warp Intr/clk

64 B/clk 64 B/clk

SML1 I$

4 Warp Instr/clk

SUB-CORE

Warp Scheduler

• 1 Warp instr/clk

• L0 I$, branch unit

Math Dispatch Unit

• Keeps 2+ Datapaths Busy

MIO Instruction Queue

• Hold for Later Scheduling

Two 4x4x4 Tensor Cores

Warp Scheduler(1 Warp Inst/clk)

Math Dispatch Unit(1 Warp Inst/clk)

FP648 DFMA/clk

INT16/clk

FP3216 FFMA/clk

MUFU4/clk

Tensor Coretwo 4x4x4 tensor/clk

Register File(512 * 32-threads * 32-bits)

BRU(1 Branch/clk)

MIO Scheduler(1 warp Inst / 2 clk)

MIO Datapath (64B/clk)

Constant $L0 I$ Sub-Core

MIO Queue(Load/Store/TEX)

L1 AND SHARED MEMORY

Streaming L1$

• Unlimited cache misses in flight

• Low cache hit latency

• 4x bandwidth vs GP100

• 4x capacity vs GP100

Shared Memory

• Unified Storage with L1$

• Configurable up to 96KB

TEX1 quad/clk

L1 D$ & SMEM128KB

128 B/clk

MIO MIO Scheduler

Sub-Coreinstructions

Sub-Core 0Data

Sub-Core 1Data

Sub-Core 2Data

Sub-Core 3Data

NARROWING THE SHARED MEMORY GAPwith the GV100 L1 cache

Pascal Volta

Cache: vs shared

• Easier to use

• 90%+ as good

Shared: vs cache

• Faster atomics

• More banks

• More predictable

Average Shared Memory Benefit

Directed testing: shared in global

INDEPENDENT THREAD SCHEDULING

Convergence

Optimizer

if(…)

causes

Convergence is prescribed

Convergence isan optimization

PROGRAMMING MODEL: MEMORY ADDRESSES

causes

GPU Innovation

Convergence must be proven

Programs written as normal

Convergence is perf optimization

Programs written using annotations

(e.g. Vector) (e.g. NVIDIA GPUs)

Convergence is prescribed

if(…)

causes

Programs writtenwithout synchronization

Convergence must be proven

Programs written as normal

PROGRAMMING MODEL: CONTROL ADDRESSES

causes

Convergence is perf optimization

Convergence isan optimization

Volta Innovation

(e.g. Volta’s Independent Thread Scheduling)

(e.g. Scalar + Predicated Vector)(e.g. pre-Volta GPUs)

NVIDIA SIMT GPUS: SCALAR THREADS ON SIMD UNITS

Scalar Compiler

Scalar ISA

SIMD Units

Thread Virtualization Tech

C, C++, …, FORTRAN

Where the efficiency comes from

30+ year history of TLP-to-ILP conversion

50+ year history of optimizations

NVIDIA GPUs have always used

Array of scalar threads

COOPERATION SYNCHRONIZATION

Warp-synchronizing operation

e.g. our NEW lane data compare

Result distributed across warp

Example: match.any.sync

EFFICIENT GPU THREAD SYNCHRONIZATION

struct mutex {

void lock() {

while (1) {

bool state = false;

if (locked.compare_exchange_weak(state, true,

std::memory_order_acquire))

return;

while (locked.load(std::memory_order_relaxed))

do_exponential_backoff(); // for brevity

void unlock() {

locked.store(false, std::memory_order_release);

atomic<bool> locked = ATOMIC_VAR_INIT(false);

Note: Unfair spinlock with exponential back-off algorithm, x86-64 Linux 4.8.0, i7-5820K, CUDA 9.0 pre-release software, V100 pre-production hardware, Aug 2017

GPU Threads10K's

Critical Section Performance(relative to single thread)

TENSOR CORE

TENSOR COREMixed Precision Matrix Math4x4 matrices

D = AB + C

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

TENSOR SYNCHRONIZATION

Warp-synchronizing operation

Full Warp 16x16 Matrix Math

Composed Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

VOLTA TENSOR OPERATION

storage/input

Full precision

product

Sum with

accumulator

Convert to

FP32 result

Also supports FP16 accumulator mode for inferencing

more products

A GIANT LEAP FOR DEEP LEARNING

tive P

P100 V100 – Tensor Cores

9.3xfaster

cuBLAS Mixed Precision (FP16 input, FP32 compute)

Matrix Multiply (M=N=K=2048)

(CUDA 8) (CUDA 9)

V100 measured on pre-production hardware.

TESLA V100

The Fastest and Most Productive GPU for Deep Learning and HPC

More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, MPS acceleration, and more …

Volta Architecture

Most Productive GPU

Tensor Core

120 Programmable

TFLOPS Deep Learning

Independent Thread

Scheduling

New Algorithms

New SM Core

Performance &

Programmability

Improved NVLink &

Efficient Bandwidth

QUESTIONS?

Sub-Core

L1 D$ & SMEM

Sub-Core

L1 I$SM

VOLTA: PROGRAMMABILITY AND PERFORMANCE...3 P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL...

Documents

Transcript of VOLTA: PROGRAMMABILITY AND PERFORMANCE...3 P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL...

Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

A TFLOPS scale 16nm 1024-core 64-bit RISC-Array Processor...A TFLOPS scale 16nm 1024-core 64-bit RISC-Array Processor. Adapteva, Inc (Funded by DARPA under contract HR0011-15-9-0013

Final draft of the Standard 'Data interchange on 120 … · Data interchange on 120 mm and 80 mm Optical Disk using +R DL Format – Capacity: 8,55 and 2,66 Gbytes per Side (Recording

A Whole-Food Plant-Based Diet Reversed Angina without ... · Total cholesterol (mg/dL) 234 148 125 138 Triglycerides (mg/dL) 165 155 126 120 ... results support prior epidemiologic

... DL AM01 Air conditioning for automobiles DL AM02 Engine starting DL AM05 Sensors and controls DL AM06 Emissions control system DL AM09

Intel®: Accelerating the Path to Exascale...Computing Power Available Today 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1 GFlops 100 MFlops 100 PFlops 10

Architectural Considerations for a 500 TFLOPS Heterogeneous HPC

Introduction to Roc Bevel - Rocplasrocplas.co.uk/door-lights-catalogue.pdf · DL 018 DL 016 DL 019. PAGE Inspiration 07 DL 020 DL 023 DL 021 DL 024 DL 022 DL 025 DL 026. Contemporary

92B€¦ · ir 2 deadman's island l 26 d 160 dl 171 dl 218 dl 225 dl 172 dl 286 dl 186 dle 168 dl 189 dl 211 dl 220 dl 163 dl 269 dl 287 dl 291 dl 20 dl 270 l 167 dl 177c dl 173 dl

Range Tenure RAN076373 (PIETER BLEEKER) Overlap with or ......dl 8155 dl 8532 dl 8294 dl 8472 dl 2913 dl 4639 dl 474 dl 2202 dl 5301 dl 5297 dl 5300 dl 5296 dl 5298 dl 5295 dl 8159

AutomationDirect ECOM Driver Help - Kepware · AutomationDirectECOMDriverHelp DeviceSetup SupportedDevices* DL-05 DL-06 DL-240 DL-250(-1) DL-260 DL-430 DL-440 DL-450 *AllPLCsviaanHx-ECOMmodule.

Ascii_red - An Overview of the Intel TFLOPS Supercomputer

Massimo Celino ENEA - in.bgu.ac.il Assets/Pages... · CRESCO3 (20 Tflops, 2016 cores) • 84 nodes 2x12 cores AMD 6234 @2.4 GHz; CRESCO4 (100 Tflops, 4864 cores) ... CMAST, Computational

ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

DL-1020 Features DL-1021 DL-1022

Reproducible science at large scale within a continous ...€¦ · 2004 –42,3 Tflops 1st Europe / 4th World New technologies MareNostrum2 2006 –94,2 Tflops 1st Europe / 5th World

Land Use Designations - Regional District of Central Okanagan · DL 4513 DL 4508 (DL 14) (DL l4) DL 4507 DL 4505 DL 4506 DL 4095 DL 4507 DL 4502 D L 2 4 0 5 1 DL 3742 DL 4048 DL 3906

Adria Scan ID Reader€¦ · Costa Rica ID Curacao ID, DL Dominican Republic ID, DL Ecuador ID, DL El Salvador ID, DL Equatorial Guinea RP Fiji DL Guatemala ID, DL Guyana DL Haiti

Electoral Area 'G' - Crooked River-Parsnip · 2366 dl 2367 dl 2373 dl 2374 dl 2375 2376 dl 2377 2383 2378 dl 2384 dl 2386 dl 2387 dl 2388 2282 dl 2283 dl 2284 dl ... 2325 2324 2323

82G - alc.gov.bc.ca · ALC Website: This map represents Agricultural ... DL 937 DL 126 DL 0676 DL 106 DL 426 DL 7206 DL 26029 DL 1021 DL 128 DL 2313A DL 10144 ...