NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

36
Peter Messmer [email protected] NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

Transcript of NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

Page 1: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

Peter Messmer

[email protected]

NOVEL GPU FEATURES:

PERFORMANCE AND PRODUCTIVITY

Page 2: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

2

COMPUTATIONAL CHALLENGES IN HEP

Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD

Page 3: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

3

COMPUTATIONAL CHALLENGES IN HEP

Low-Level Trigger High-Level Trigger Monte Carlo Analysis

Connectivity

Latency control

Power limited

Portability

Legacy codes

Limited parallelism

Geometry processing

Data volume

Computer vision

Unsupervised learning

Lattice QCD

Connectivity

Low latency

Bandwidth,

bandwidth,

bandwidth,..

Page 4: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

4

COMPUTATIONAL CHALLENGES IN HEP

Low-Level Trigger High-Level Trigger Monte Carlo Analysis

Connectivity

Latency control

Power limited

Portability

Legacy codes

Limited parallelism

Geometry processing

Data volume

Computer vision

Unsupervised learning

Lattice QCD

Connectivity

Low latency

Bandwidth,

bandwidth,

bandwidth,..

Page 5: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

5

Network

KEPLER ENABLES NVIDIA GPUDIRECT™ RDMA

Server 1

GPU1 GPU2 CPU

GDDR5 Memory

GDDR5 Memory

Network Card

System Memory

PCI-e

Server 2

GPU1 GPU2 CPU

GDDR5 Memory

GDDR5 Memory

Network Card

System Memory

PCI-e

http://docs.nvidia.com/cuda/gpudirect-rdma

Page 6: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

6

0

1

2

3

4

5

CPU K20X K40

ns/day

TESLA K40 WORLD’S FASTEST ACCELERATOR

FASTER 1.4 TF| 2880 Cores | 288 GB/s

LARGER 2x Memory Enables More Apps

SMARTER Unlock Extra Performance

Using Power Headroom

AMBER Benchmark: SPFP-Nucleosome CPU: Dual E5-2687W @ 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40

AMBER Benchmark

6GB

Fluid Dynamics

Seismic Analysis

Rendering

12GB

GPU Boost

Page 7: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

7

GPU BOOST ON TESLA K40

Base

Clock

Workload # 1

Worst case

Reference App

235W

Boost

Clock #1

Workload # 2

E.g. AMBER

235W

Boost

Clock #2

Workload # 3

E.g. ANSYS Fluent

235W

Convert Power Headroom to Higher Performance

810Mhz

745Mhz

875Mhz

Page 8: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

8

Non-Tesla

WORKLOAD BEHAVIOR WITH GPU BOOST

GPU

Clock

Automatic clock switching

Default Boost Base

Preset Options Lock to base clock 3 Levels: Base, Boost1 or Boost2

Boost Interface Control Panel

NV-SMI, NVML

nvidia-smi -q –d CLOCK,SUPPORTED_CLOCKS

nvidia-smi -ac <MEM clock, Graphics clock>

Target duration

for boost clocks ~50% of run-time

100% of workload run time

Must-have for HPC workload

Boost Clock # 1

Boost Clock # 2

Tesla K40

Deterministic Clocks

Base Clock # 1

Page 9: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

9

TEGRA K1 IMPOSSIBLY ADVANCED

NVIDIA Kepler Architecture

4-Plus-1 Quad-Core A15

192 NVIDIA CUDA Cores

Compute Capability 3.2

326 GFLOPS

5 Watts

Page 10: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

10

JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER

Development Platform for Embedded

Computer Vision, Robotics, Medical

192 Cores · 326 GFLOPS

CUDA Enabled

Available Now

Page 11: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

11

BATCHED 1D FFT PERFORMANCE ON JETSON

Low power processor on sensor

Host/Dev Bandwidth: 6 GB/s

Dev/Dev FFT up to 780 Msamples/s

Streaming FFT off host: 110 Msamples/s sustained

About 1/13th of K20 performance

Page 12: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

12

COMPUTATIONAL CHALLENGES IN HEP

Low-Level Trigger High-Level Trigger Monte Carlo Analysis

Connectivity

Latency control

Power limited

Portability

Legacy codes

Limited parallelism

Geometry processing

Data volume

Computer vision

Unsupervised learning

Lattice QCD

Connectivity

Low latency

Bandwidth,

bandwidth,

bandwidth,..

Page 13: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

13

0

1

2

3

4

5

CPU K20X K40

ns/day

TESLA K40 WORLD’S FASTEST ACCELERATOR

FASTER 1.4 TF| 2880 Cores | 288 GB/s

LARGER 2x Memory Enables More Apps

SMARTER Unlock Extra Performance

Using Power Headroom

AMBER Benchmark: SPFP-Nucleosome CPU: Dual E5-2687W @ 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40

AMBER Benchmark

6GB

Fluid Dynamics

Seismic Analysis

Rendering

12GB

GPU Boost

Page 14: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

14

STRONG CUDA GPU ROADMAP

SG

EM

M /

W N

orm

alized

2012 2014 2008 2010 2016

Tesla CUDA

Fermi FP64

Kepler Dynamic Parallelism

Maxwell DX12

Pascal Unified Memory

3D Memory

NVLink

20

16

12

8

6

2

0

4

10

14

18

Page 15: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

15

INTRODUCING NVLINK AND STACKED MEMORY

NVLINK

GPU high speed interconnect

80-200 GB/s

Planned support for POWER CPUs

Stacked Memory 4x Higher Bandwidth (~1 TB/s)

3x Larger Capacity

4x More Energy Efficient per bit

Page 16: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

16

NVLINK ENABLES DATA TRANSFER AT SPEED OF CPU MEMORY

TESLA

GPU CPU

DDR Memory Stacked Memory

NVLink

80 GB/s

DDR4

50-75 GB/s

HBM

1 Terabyte/s

Page 17: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

17

UNIFIED MEMORY DRAMATICALLY LOWER DEVELOPER EFFORT

Developer View Today Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Page 18: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

18

SUPER SIMPLIFIED MEMORY MANAGEMENT CODE

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU Code CUDA 6 Code with Unified Memory

Page 19: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

19

SIMPLER MEMORY MODEL: ELIMINATE DEEP COPIES

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

struct dataElem { int prop1; int prop2; char *text; };

Page 20: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

20

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

“Hello World”

dataElem

prop1

prop2

*text

Two Copies

Required

struct dataElem { int prop1; int prop2; char *text; };

SIMPLER MEMORY MODEL: ELIMINATE DEEP COPIES

Page 21: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

21

CPU Memory

“Hello World”

dataElem

prop1

prop2

*text

GPU Memory

“Hello World”

dataElem

prop1

prop2

*text

Two Copies

Required

void launch(dataElem *elem) { dataElem *g_elem; char *g_text; int textlen = strlen(elem->text); // Allocate storage for struct and text cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); // Copy up each piece separately, including // new “text” pointer value cudaMemcpy(g_elem, elem, sizeof(dataElem)); cudaMemcpy(g_text, elem->text, textlen); cudaMemcpy(&(g_elem->text), &g_text, sizeof(g_text)); // Finally we can launch our kernel, but // CPU & GPU use different copies of “elem” kernel<<< ... >>>(g_elem); }

SIMPLER MEMORY MODEL: ELIMINATE DEEP COPIES

Page 22: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

22

CPU Memory

GPU Memory

Unified Memory

“Hello World”

dataElem

prop1

prop2

*text

void launch(dataElem *elem) { kernel<<< ... >>>(elem); }

SIMPLER MEMORY MODEL: ELIMINATE DEEP COPIES

Page 23: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

23

UNIFIED MEMORY

void sortfile(FILE *in, FILE *out, int N) { char *data = (char *)malloc(N); fread(data, 1, N, in); sort(data, N); fwrite(data, 1, N, out); free(data); }

void sortfile(FILE *in, FILE *out, int N) { char *data = (char *)cudaMallocManaged(N); fread(data, 1, N, in); parallel_sort<<< ... >>>(data, N); fwrite(data, 1, N, out); cudaFree(gpu_data); }

Call Sort on CPU Call Sort on Kepler

void sortfile(FILE *in, FILE *out, int N) { char *data = (char *)malloc (N); fread(data, 1, N, in); parallel_sort<<< ... >>>(data, N); fwrite(data, 1, N, out); free(gpu_data); }

Call Sort on Pascal

No need for opt-in

allocator

Memory Management

Becomes Performance

Optimization

Page 24: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

24

C++11 IN CUDA 6.5

Experimental release in CUDA 6.5

nvcc -std=c++11 my_cpp11_code.cu

Support for all C++11 features offered by host compiler in host code

Currently no support for lambdas passed from host to device

Page 25: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

25

THRUST: STL-LIKE CUDA TEMPLATE LIBRARY

GPU(device) and CPU(host) vector class

thrust::host_vector<float> H(10, 1.f);

thrust::device_vector<float> D = H;

Iterators

thrust::fill(D.begin(), D.begin()+5, 42.f);

float* raw_ptr = thrust::raw_pointer_cast(D);

Algorithms

Sort, reduce, transformation, scan, ..

thrust::transform(D1.begin(), D1.end(), D2.begin(), D2.end(),

thrust::plus<float>()); // D2 = D1 + D2

C++ STL Features for CUDA

Page 26: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

26

CUB – CUDA UNBOUND

Building blocks for kernels via C++ header library

Data parallel primitives for thread blocks

Collective primitives

scan, sort, histogram, ..

global memory transfer

http://nvlabs.github.io/cub

Page 27: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

27

OPENACC DIRECTIVES

Program myscience

... serial code ...

!$acc region

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end region

...

End Program myscience

CPU GPU

Your original

Fortran or C code

Easy, Open, Powerful

• Simple Compiler hints

• Works on multicore CPUs & many core

GPUs

• Compiler Parallelizes code

OpenACC

Compiler

Hint http://www.openacc.org

Page 28: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

28

OPENACC: OPEN, SIMPLE, PORTABLE

• Open Standard

• Easy, Compiler-Driven Approach

• Portable on GPUs and Xeon Phi main() {

<serial code>

#pragma acc kernels

{

<compute intensive code>

}

}

Compiler Hint CAM-SE Climate

6x Faster on GPU Top Kernel: 50% of Runtime

Page 29: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

29

ADDITIONS FOR OPENACC 2.0

Procedure calls

Separate compilation

Nested parallelism

Device-specific tuning, multiple devices

Data management features and global data

Multiple host thread support

Loop directive additions

Asynchronous behavior additions

New API routines for target platforms

(CUDA, OpenCL, Intel Coprocessor Offload Infrastructure)

Page 30: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

30

COMPUTATIONAL CHALLENGES IN HEP

Low-Level Trigger High-Level Trigger Monte Carlo Analysis

Connectivity

Latency control

Power limited

Portability

Legacy codes

Limited parallelism

Geometry processing

Data volume

Computer vision

Unsupervised learning

Lattice QCD

Connectivity

Low latency

Bandwidth,

bandwidth,

bandwidth,..

Page 31: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

31

CUDNN

Deep Neural Network Library

Pre-packaged kernels for

convolution, pooling, softmax

Activations

..

developer.nvidia.com/cuDNN

Page 32: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

32

IF YOUR APPLICATION LOOKS LIKE THIS..

Page 33: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

33

.. YOU MIGHT BE INTERESTED IN OPTIX

• Ray-tracing framework

• Build your own RT application

• Generic Ray-Geometry interaction

• Rays with arbitrary payloads

• Multi-GPU support

Page 34: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

34

DIFFERENT PROGRAMS GET INVOKED FOR DIFFERENT RAYS

Ray Launcher

Closest hit

program

Any hit

program

Miss

program

Page 35: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

35

OPTIX PRIME: LOW-LEVEL RAY TRACING API

•OptiX simplifies implementation of RT apps

• Manages memory, data transfers etc

• Sometimes an overkill for simple scene queries

• E.g. just need visibility of triangulated geometries

• OptiX Prime: Low-Level Tracing API

Page 36: NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY

36

SUMMARY

Connectivity

GPU Direct RDMA

Memory limit is non-issue

Large memory boards, Unified memory, NVLink

Portability via abstraction

Thrust, CUB, C++11, OpenACC

Frameworks and libraries for HEP tasks

OptiX, cuDNN, …

GPUs are ready for HEP tasks!