HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND...

47
STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK

Transcript of HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND...

Page 1: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

STEVE OBERLIN

CTO, TESLA ACCELERATED COMPUTING

NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK

Page 2: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

STATE OF THE ART 2012

18,688 Tesla K20X GPUs

27 PetaFLOPS

Page 3: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

FLAGSHIP SCIENTIFIC APPLICATIONS ON TITAN

Material Science (WL-LSMS)

Combustion (S3D)

Climate Change (CAM-SE)

Nuclear Energy (Denovo)

Biofuels (LAMMPS)

Astrophysics (NRDF).

Page 4: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

CORAL Summit System 5-10x Faster

1/5th the Nodes, Same Energy Use as Titan

STATE OF THE ART 2017

Just 1 Node in Summit Is the same performance as the

Earth Simulator in 2002

Earth Simulator 40.96 TF

Top500 #1 for 3 years

Page 5: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

AGENDA: 3 STORIES

Heterogeneous HPC

HPC System Evolution

Architecture and Technology Basis for Heterogeneous Processing

Architectural Optimization and NVLink

Tesla Accelerated Computing Platform

NVLink

CORAL Punctuation Mark

Optimized Fat node performance projections

Page 6: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

ARE YOU CALIBRATED?

.6 MPH 6 MPH 60 MPH 600 MPH

3 ORDERS OF MAGNITUDE

Page 7: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

HPC SYSTEM EVOLUTION

MEMORY

CPU

Page 8: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

CRAY-1 LATENCY-HIDING SINGLE VECTOR CPU, 160 MFLOPS PEAK

Page 9: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

FIRST LINPACK LIST, JANUARY 1979

Page 10: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

TOP500 SUPERCOMPUTER LIST 1993-2012

CORAL

9 ORDERS

OF

MAGNITUDE

CRAY-1

TITAN

Page 11: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

HPC SYSTEM EVOLUTION

MEMORY

CPU

MEMORY

CPU CPU …

Page 12: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

CRAY-2 4 LATENCY-HIDING VECTOR CPUS, 2 GFLOPS PEAK, 1985

Page 13: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

HPC SYSTEM EVOLUTION

MEMORY

CPU

MEMORY

CPU CPU …

MEMORY

CPU

MEMORY

CPU …

NETWORK

Page 14: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

“ATTACK OF THE KILLER MICROS”

Page 15: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

CRAY T3D DEC ALPHA EV4 MICROPROCESSORS, 1 TFLOPS PEAK, 1993

Page 16: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

HETEROGENEOUS RESULTS

Page 17: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

CRAY T3E DEC ALPHA EV5 MICROPROCESSORS, 1 TFLOPS SUSTAINED, 1995

Page 18: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

FUTURE SHOCK

“There’s hasn’t really been anything truly new for 20 years… except GPUs.” --Thomas Schulthess, Director, CSCS

Page 19: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

COMING SOON TO A NODE NEAR YOU

Cray T3E

2.4 TFLOPS peak performance (2K processors)

~128 GB/s bisection bandwidth

CRAY T90

56 GFLOPs peak performance (32 processors)

1 TB/s bisection bandwidth

Pascal GPU (1H16) has >T3E peak with T90 memory BW

Page 20: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

OPTIMIZING SERIAL/PARALLEL EXECUTION Application Code

+

GPU CPU

Parallel Work

Majority of Ops

Serial Work

System and Sequential Ops

Page 21: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

21

TWO COMPUTING MODELS FOR ACCELERATORS

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Parallel Tasks

Heterogeneous Computing Model Complementary Processors Work Together

Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work

Xeon Phi (And Others) Many Weak Serial Cores

Page 22: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

22

AMDAHL’S LAW ANALYSIS

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

98% Parallel Work

Minutes Run Time

Page 23: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

23

AMDAHL’S LAW ANALYSIS

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

90% Parallel Work

Minutes Run Time

Page 24: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

24

AMDAHL’S LAW ANALYSIS

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

80% Parallel Work

Minutes Run Time

Page 25: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

25

AMDAHL’S LAW ANALYSIS

0 2 4 6 8 10 12 14

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

70% Parallel Work

Minutes Run Time

Page 26: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

26

AMDAHL’S LAW ANALYSIS

60% Parallel Work

0 2 4 6 8 10 12 14 16 18

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 27: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

27

AMDAHL’S LAW ANALYSIS

50% Parallel Work

0 5 10 15 20 25

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 28: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

28

AMDAHL’S LAW ANALYSIS

40% Parallel Work

0 5 10 15 20 25 30

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 29: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

29

30% Parallel Work

0 5 10 15 20 25 30

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

AMDAHL’S LAW ANALYSIS

Page 30: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

30

AMDAHL’S LAW ANALYSIS

20% Parallel Work

0 5 10 15 20 25 30 35

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 31: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

31

AMDAHL’S LAW ANALYSIS

10% Parallel Work

0 5 10 15 20 25 30 35 40

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 32: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

32

TESLA K80

WORLD’S FASTEST ACCELERATOR

FOR DATA ANALYTICS AND

SCIENTIFIC COMPUTING

Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2

Maximum Performance Dynamically Maximize

Performance for Every Application

Double the Memory Designed for Big Data Apps

24GB

Oil & Gas

Data Analytics

HPC Viz

K40 12GB

2x Faster 2.9 TF| 4992 Cores | 480 GB/s

0x

5x

10x

15x

20x

25x

CPU Tesla K40 Tesla K80

Deep Learning: Caffe

Dual-GPU Accelerator for

Max Throughput

Page 33: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

33

FOCUS ON THROUGHPUT PERFORMANCE

0

500

1000

1500

2000

2500

3000

3500

2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090

M1060

K20

K80

Westmere Sandy Bridge

Haswell

GFLOPS

0

100

200

300

400

500

600

2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

Westmere Sandy Bridge

Haswell

Ivy Bridge

K40

Ivy Bridge

K40

M2090

M1060

Page 34: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

34

0x

5x

10x

15x

K80 CPU

10X FASTER THAN CPU ON APPLICATIONS

CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled

Quantum Chemistry Molecular Dynamics Physics Benchmarks

Page 35: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

35

TESLA PLATFORM ENABLES OPTIMIZATION

Ecosystem Industry Standard CPUs and Interconnects

ARM64 POWER x86

NVIDIA

GPU

InfiniBand

Industry-Driven Solutions

Others Cray Ethernet

Page 36: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

36

LOGICAL VS. PHYSICAL INTEGRATION OPTIMIZING FOR EFFICIENCY + FLEXIBILITY

CPU

System Memory

GPU Memory

GPU

Latency- Optimized

Throughput- Optimized

NVLink

Page 37: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

37

UNIFIED MEMORY

DRAMATICALLY LOWER DEVELOPER EFFORT

Exposed Developer View (Pre-CUDA 5.5)

Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Page 38: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

Latency

Optimized

Throughput

Optimized

NVLINK: NODE INTEGRATION NETWORK

5x PCIe bandwidth

Move data at CPU memory speed

3x lower energy/bit

TESLA

GPU

Power or

ARM CPU

DDR Memory Stacked Memory

NVLink

80 GB/s

DDR4

50-75 GB/s

HBM

1 Terabyte/s

Page 39: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

NVLINK HIGH-SPEED NODE NETWORK

NVLink

NVLink

POWER CPU

X86 ARM64 POWER CPU

PASCAL GPU KEPLER GPU

2016 2014

PCIe PCIe

X86 ARM64 POWER CPU

Page 40: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

40

NVLINK MULTI-GPU PERFORMANCE

3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

TESLA

GPU

TESLA

GPU

CPU

5x Faster than

PCIe Gen3 x16

PCIe Switch

GPUs Interconnected with NVLink

1.00x

1.25x

1.50x

1.75x

2.00x

2.25x

ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe

Speedup vs PCIe based Server

Page 41: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

41

TESLA PLATFORM ENABLES OPTIMIZATION

Scalable Nodes, ISA Choice

x86

NVLink

Page 42: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

IBM POWER CPU Most Powerful Serial Processor

NVIDIA NVLink Node Integration Interconnect

NVIDIA Volta GPU Most Powerful Parallel Processor

NVLINK-ENABLED HETEROGENEOUS NODE LOGICAL INTEGRATION + FLEXIBILITY + EFFICIENCY

80-200 GB/s

Page 43: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

43

CORAL SCALABLE HETEROGENEOUS NODE

Approximately 3,400 nodes, each with:

IBM POWER9 CPUs and multiple NVIDIA Tesla® Volta GPUs

CPUs and GPUs integrated on-node with high speed NVLink

Large coherent memory: over 512 GB (HBM + DDR4)

All directly addressable from the CPUs and GPUs

An additional 800 GB of NVRAM, burst buffer or as extended memory

Over 40 TF peak performance/node(!)

NVLink In Practice

Page 44: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

44

OPTIMIZED HETEROGENEOUS NODE

CORAL Application Performance Projections

Page 45: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

SUMMARY:

Heterogeneous acceleration is powerful and efficient Latency-optimized cores for serial and system work

Throughput-optimized cores for on-node parallel work

High-performance integration

NVLink provides a compelling node integration advantage Flexible configuration of resources in application-driven proportions

Scalable performance into the future

CORAL is awesome, you can buy nodes just like it. OpenPower is surprisingly affordable.

It’s just one example of optimized architecture for an application set and budget.

Page 46: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

NVIDIA REGISTERED DEVELOPER PROGRAMS Everything you need to develop with NVIDIA products

Membership is your first step in establishing a working relationship with NVIDIA Engineering

Exclusive access to pre-releases

Submit bugs and features requests

Stay informed about latest releases and training opportunities

Access to exclusive downloads

Exclusive activities and special offers

Interact with other developers in the NVIDIA Developer Forums

REGISTER FOR FREE AT: developer.nvidia.com

Page 47: HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND …on-demand.gputechconf.com/gtc/2015/presentation/S5649-Steve... · HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK .

THANK YOU