Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

42
Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

description

Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc. Let’s Make a … Socket!. Your goal is to speed up your code as much as possible, BUT … …you have a budget for Power... Do you choose: 6 Processors, each providing N performance, and using P Watts - PowerPoint PPT Presentation

Transcript of Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

Page 1: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

Leveraging GPUs for Application Acceleration

Dan ErnstCray, Inc.

Page 2: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

2

Let’s Make a … Socket!

Your goal is to speed up your code as much as possible, BUT…

…you have a budget for Power...

Do you choose:1. 6 Processors, each providing N performance, and using P

Watts2. 450 Processors, each providing N/10 performance, and

collectively using 2P Watts3. It depends!

Page 3: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

3

Nvidia Fermi (Jan 2010)

~1.0TFLOPS (SP)/~500GFLOPS (DP)140+ GB/s DRAM Bandwidth

ASCI Red – Sandia National Labs – 1997

Page 4: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

4

Intel P4 Northwood

Page 5: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

5

NVIDIA GT200

Page 6: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

6

Why GPGPU Processing?

A quiet revolutionCalculation: TFLOPS vs. 100 GFLOPSMemory Bandwidth: ~10x

GPU in every PC– massive volume

Figure 1.1. Enlarging Performance Gap between GPUsand CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

Page 7: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

7

NVIDIA Tesla C2090 Card Specs

512 GPU cores1.30 GHzSingle precision floating point performance: 1331 GFLOPs

(2 single precision flops per clock per core)Double precision floating point performance: 665 GFLOPs

(1 double precision flop per clock per core)Internal RAM: 6 GB DDR5Internal RAM speed: 177 GB/sec (compared 30s-ish GB/sec for

regular RAM)Has to be plugged into a PCIe slot (at most 8 GB/sec)

Page 8: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

8

NVIDIA Tesla S2050 Server Specs

4 C2050 cards inside a 1U server (looks like a typical CPU node)

1.15 GHzSingle Precision (SP) floating point performance: 4121.6

GFLOPsDouble Precision (DP) floating point performance:

2060.8 GFLOPsInternal RAM: 12 GB total (3 GB per GPU card)Internal RAM speed: 576 GB/sec aggregateHas to be plugged into two PCIe slots

(at most 16 GB/sec)

Page 9: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

9

Compare x86 vs S2050

Let’s compare a good dual socket x86 server today vs S2050.Dual socket, AMD 2.3 GHz 12-core

NVIDIA Tesla S2050

Peak DP FLOPs 220.8 GFLOPs DP 2060.8 GFLOPs DP (9.3x)

Peak SP FLOPS 441.6 GFLOPs SP 4121.6 GFLOPs SP (9.3x)Peak RAM BW 25 GB/sec 576 GB/sec (23x)

Peak PCIe BW N/A 16 GB/sec

Needs x86 server to attach to?

No Yes

Power/Heat ~450 W ~900 W + ~400 W (~2.9x)

Code portable? Yes No (CUDA)Yes (PGI, OpenCL)

Page 10: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

10

Compare x86 vs S2050

Here are some interesting measures:

Dual socket, AMD 2.3 GHz 12-core

NVIDIA Tesla S2050

DP GFLOPs/Watt ~0.5 GFLOPs/Watt ~1.6 GFLOPs/Watt (~3x)

SP GFLOPS/Watt ~1 GFLOPs/Watt ~3.2 GFLOPs/Watt (~3x)

DP GFLOPs/sq ft ~590 GFLOPs/sq ft ~2750 GFLOPs/sq ft (4.7x)

SP GFLOPs/sq ft ~1180 GFLOPs/sq ft ~5500 GFLOPs/sq ft (4.7x)

Racks per PFLOP DP 142 racks/PFLOP DP 32 racks/PFLOP DP (23%)

Racks per PFLOP SP 71 racks/PFLOP SP 16 racks/PFLOP SP (23%)

OU’s Sooner is 34.5 TFLOPs DP, which is just over 1 rack of S2050.

Page 11: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

11

These Are Raw Numbers

Do they bear out in practice?

Tianhe-1 – Hybrid (GPU-heavy) machine55% peak on HPL

Jaguar – CPU-based machine75% peak on HPL

Page 12: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

Stone, et al. Overset Grid/Gridless Methods for Fuselage and Rotor Wakes

Results

But they do bear out more fully on some applications

Many of these applications are in computational science and engineering.

Page 13: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

13

Previous GPGPU Constraints

Dealing with graphics APITo get general purpose code

working, you had to use the corner cases of the graphics API

Essentially – re-write entire program as a collection of shaders and polygons

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

Page 14: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

14

CUDA

“Compute Unified Device Architecture”General purpose programming model

User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data

parallel co-processorTargeted software stack

Compute oriented drivers, language, and toolsDriver for loading computational programs

onto GPU

Page 15: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

15

Overview

CUDA programming modelBasic concepts and data types

CUDA application programming interface (API) basics

A couple of simple examples

Some performance issues will be in session #2 (3:30-5pm)

Page 16: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

16

CUDA Devices and Threads

A CUDA compute deviceIs a coprocessor to the CPU or hostHas its own DRAM (device memory)Runs many threads in parallelIs typically a GPU but can also be another type of parallel

processing device Data-parallel portions of an application are expressed as

device kernels which run on many threadsDifferences between GPU and CPU threads

GPU threads are extremely lightweight Very little creation overhead

GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few (and is hurt by having too many)

Page 17: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

17

CUDA – C with a Co-processor

One program, two devicesSerial or modestly parallel parts in host C codeHighly parallel parts in device kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Page 18: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

18

Buzzword: Kernel

In CUDA, a kernel is code (typically a function) that can be run inside the GPU.

The kernel code runs on many of the stream processors in the GPU in parallel.Each processor runs the code over different data (SPMD)

Page 19: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

19

Buzzword: Thread

In CUDA, a thread is an execution of a kernel with a given index. Each thread uses its index to access a specific

subset of the data, such that the collection of all threads cooperatively processes the entire data set.

Think: MPI Process IDThese operate very much like threads in

OpenMP they even have shared and private variables.

So what’s the difference with CUDA? Threads are free

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 20: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

20

Buzzword: Block

In CUDA, a block is a group of threads.

Blocks are used to organize threads into manageable (and schedulable) chunks.Can organize threads in 1D, 2D, or 3D arrangementsWhat best matches your data?Some restrictions, based on hardware

Threads within a block can do a bit of synchronization, if necessary.

Page 21: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

21

Buzzword: Grid

In CUDA, a grid is a group of blocksno synchronization at all between the blocks.

Grids are used to organize blocks into manageable (and schedulable) chunks.Can organize blocks in 1D or 2D arrangementsWhat best matches your data?

A grid is the set of threads created by a call to a CUDA kernel

Page 22: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

Mapping Buzzwords to Hardware

Grids map to GPUsBlocks map to the

MultiProcessors (MP)Blocks are never split across

MPs, but a MP can have multiple blocks

Threads map to Stream Processors (SP)

Warps are groups of (32) threads that execute simultaneouslyCompletely forget these exist

until you get good at thisImage Source:

NVIDIA CUDA Programming Guide

Page 23: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

23

GeForce 8800 (2007)

16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 24: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

24

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Hardware is free to assign blocks to any SM (processor)A kernel scales across any number of parallel processors

Transparent Scalability

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time

Page 25: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

25

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs

Each thread uses IDs to decide what data to work onBlockIdx: 1D or 2DThreadIdx: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional dataImage processingSolving PDEs on volumes…

Page 26: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

26

CUDA Memory Model Overview

Global memoryMain means of communicating

R/W Data between host and device

Contents visible to all threadsLong latency access

We will focus on global memory for nowOther memories will come

later

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Note: This is not hardware!

Page 27: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

27

cudaMalloc()Allocates object in the device Global MemoryRequires two parameters

Address of a pointer to the allocated object Size of of allocated object

cudaFree()Frees object from device Global Memory

Pointer to freed object

CUDA Device Memory Allocation

Page 28: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

28

Code example: Allocate a 64 * 64 single precision float arrayAttach the allocated storage to pointer named Md“d” is often used in naming to indicate a device data

structure

CUDA Device Memory Allocation

TILE_WIDTH = 64;float* Md;int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);

cudaFree(Md);

Page 29: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

29

The Physical Reality Behind CUDA

CPU(host)

GPU w/ local DRAM

(device)

Page 30: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

30

CUDA Host-Device Data Transfer

cudaMemcpy()memory data transferRequires four parameters

Pointer to destination Pointer to source Number of bytes copied Type of transfer

Host to Host Host to Device Device to Host Device to Device

Asynchronous transfer

Grid

GlobalMemory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Page 31: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

31

Code example: Transfer a 64 * 64 single precision float arrayM is in host memory and Md is in device memorycudaMemcpyHostToDevice and cudaMemcpyDeviceToHost

are symbolic constants

CUDA Host-Device Data Transfer

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Page 32: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

CUDA Kernel Template

In C:void foo(int a, float b)

{

// slow code goes here

}

In CUDA C:__global__ void foo(int a, float b)

{

// fast code goes here!

}

Page 33: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

33

Calling a Kernel Function

A kernel function must be called with an execution configuration:

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per block

KernelFunc(...); // invoke a function

Page 34: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

34

Calling a Kernel Function

A kernel function must be called with an execution configuration:

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per block

KernelFunc(...); // invoke a function

Declare the dimensions for grid/blocks

Page 35: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

35

Calling a Kernel Function

A kernel function must be called with an execution configuration:

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per block

KernelFunc<<<DimGrid, DimBlock>>>(...);

Any call to a kernel function is asynchronousexplicit synch needed for blocking

Declare the dimensions for grid/blocks

Page 36: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

36

C SAXPY

void

saxpy_serial(int n, float a, float *x, float *y)

{

int i;

for(i=0; i < n; i++) {

y[i] = a*x[i] + y[i];

}

}

//invoke the kernel

saxpy_serial(n, 2.0, x, y);

Page 37: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

37

SAXPY on a GPU

Doing anything across an entire vector is perfect for massively parallel (GPGPU) computing.

Instead of one function looping over the data set,we’ll use many threads, each doing one calculation

76543210

…y[tid] = a*x[tid] + y[tid];…

threadID

Page 38: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

38

CUDA SAXPY

__global__ void

saxpy_cuda(int n, float a, float *x, float *y)

{

int i = (blockIdx.x * blockDim.x) + threadIdx.x;

if(i < n)

y[i] = a*x[i] + y[i];

}

int nblocks = (n + 255) / 256;

//invoke the kernel with 256 threads per block

saxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y);

Page 39: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

39

SAXPY is Pretty Obvious

What kinds of codes are good for GPGPU acceleration?

What kinds of codes are bad?

Page 40: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

40

Performance: How Much Is Enough?(CPU Edition)

Could I be getting better performance?Probably a little bit. Most of the performance is handled in

HW

How much better?If you compile –O3, you can get faster (maybe 2x)If you are careful about tiling your memory, you can get

faster on codes that benefit from that (maybe 2-3x)

Is that much performance worth the work?Compiling with optimizations is a no-brainer (and yet…)Tiling is useful, but takes an investment

Page 41: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

41

Performance: How Much Is Enough?(GPGPU Edition)

Could I be getting better performance?Am I getting near peak GFLOP performance?

How much better?Brandon’s particle code, using several different

code modifications 148ms per time step 4ms per time step

Is that much worth the work?How much work would you do for 30-40x?Most of the modifications are fairly straightforward

You just need to know how the hardware works a bit more

Page 42: Leveraging GPUs for  Application Acceleration Dan Ernst Cray, Inc.

42

Am I bandwidth bound? (How do I tell?)Make sure I have high thread occupancy to tolerate latencies

These threads can get some work done while we wait for memoryMove re-used values to closer memories

Shared Constant/Texture

Am I not bandwidth bound – what is now my limit?Take a closer look at the instruction stream

Unroll loops Minimize branch divergence

What’s Limiting My Code?