[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Lecture #6: CUDA Ninja Tricks | March 1st, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

mailto:[email protected]


Lecture #6: CUDA Ninja Tricks | February 29th, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

GPU “Scripting”, Meta-programming, Auto-tuning



During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Todayyey!!

Outline

1. Scripting GPUs with PyCUDA

2.Meta-programming and RTCG

3.Case study in brain-inspired AI

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Why do Scripting for GPUs?

GPUs are everything that scripting

languages are not.

Highly parallel

Very architecture-sensitive

Built for maximum

compute/memory throughput

→ complement each other

CPU: largely restricted to control

tasks (∼1000/sec)

Scripting fast enough

Realize a promise: Use Scripting. . .

from first prototype

to full-scale production code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Why do Scripting for GPUs?

GPUs are everything that scriptinglanguages are not.

Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput

→ complement each other

CPU: largely restricted to controltasks (∼1000/sec)

Scripting fast enough

Python + CUDA = PyCUDA

Python + OpenCL = PyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)


How are High-Performance Codes constructed?

“Traditional” Construction of

High-Performance Codes:

C/C++/Fortran

Libraries

“Alternative” Construction of

High-Performance Codes:

Scripting for ‘brains’

GPUs for ‘inner loops’

Play to the strengths of each

programming environment.



Scripting: Python

One example of a scripting language: Python

Mature

Large and active community

Emphasizes readability

Written in widely-portable C

A ‘multi-paradigm’ language

Rich ecosystem of sci-comp related

software



Scripting Languages

Python:

is discoverable and interactive.

has comprehensive built-in functionality.

manages resources automatically.

uses run-time typing.

works well for “gluing” lower-level blocks together.



Scripting: Goals

Scripting languages aim to reduce the load on the programmer:

Reduce required knowledge

Encourage experimentation

Eliminate sources of error

Encourage abstraction wherever possible

Value programmer time over computer time

Think about the tools you use.Use the right tool for the job.



Scripting: Goals

Scripting languages aim to reduce the load on the programmer:

Reduce required knowledge

Encourage experimentation

Eliminate sources of error

Encourage abstraction wherever possible

Value programmer time over computer time

Think about the tools you use.Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial


Scripting: Speed

Usual answer to the “Speed

Question”:

Hybrid (“mixed”) Code.

Plays to the strengths of each

language.

But: Introduces (some)

complexity.

Observation: GPU code is already hybrid.

Consequence: No added complexity through hybrid code.



Whetting your appetite

1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]




1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python



1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel



Whetting your appetite, Part II

Did somebody say “Abstraction is good”?



Whetting your appetite, Part II

1 import numpy2 import pycuda.autoinit3 from pycuda import gpuarray45 a cpu = numpy.random.randn(4,4).astype(numpy.float32)6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)7 c cpu = a cpu ∗ b cpu89 a gpu = gpuarray.to gpu(a cpu)

10 b gpu = gpuarray.to gpu(b cpu)11 c gpu = (a gpu ∗ b gpu).get()1213 print c cpu − c gpu



Remember me?

1 // trivia2 #include <stdio.h>3

4 #define CUDA CHK(NAME, ARGS) { \5 cudaError t cuda err code = NAME ARGS; \6 if (cuda err code != cudaSuccess) { \7 printf (”%s failed with code %d\n”, #NAME, cuda err code); \8 abort (); \9 } \

10 }11 // end12

13 // kernel14 global void square array ( float ∗a, float ∗b, int n)

15 {16 int i = (blockIdx .x ∗ blockDim.y + threadIdx.y)

17 ∗ blockDim.x + threadIdx.x;

18 if ( i < n)

19 a[ i ] = a[i ] ∗ b[i ];

20 }21 // end22

23 // main124 int main()

25 {26 cudaSetDevice(0); // EDIT ME27

28 const int n = 4096;

29

30 float ∗a host = (float ∗) malloc(n∗sizeof(float ));

31 float ∗b host = (float ∗) malloc(n∗sizeof(float ));

32

33 float ∗a device, ∗b device;

34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));36 // end

1 // main22 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }3

4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),5 cudaMemcpyHostToDevice));

6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),7 cudaMemcpyHostToDevice));

8

9 dim3 block dim(16, 16);

10 int block size = block dim.x∗block dim.y;

11 int n blocks = (n + block size−1) / block size ;

12 square array <<<n blocks, block dim>>>(a device, b device, n);

13 // end14

15 // main316 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),17 cudaMemcpyDeviceToHost));

18

19 for ( int i = 0; i < n; i++)

20 printf (”%.0f ”, a host [ i ]);

21 puts(”\n”);

22

23 free (a host );

24 CUDA CHK(cudaFree, (a device));

25 }26 // end



PyCUDA Philosophy

Provide complete access

Automatically manage resources

Provide abstractions

Check for and report errorsautomatically

Full documentation

Integrate tightly with numpy



PyCuda: Workflow

Edit

PyCuda

Run

SourceModule("...")

Cache!

nvcc .cubin

Upload to GPU

Run on GPU



Automatic Cleanup

Reachable objects (memory,

streams, . . . ) are never destroyed.

Once unreachable, released at an

unspecified future time.

Scarce resources (memory) can be

explicitly freed. (obj.free())

Correctly deals with multiple

contexts and dependencies.



gpuarray: Simple Linear Algebra

pycuda.gpuarray:Meant to look and feel just like numpy.

gpuarray.to gpu(numpy array)

numpy array = gpuarray.get()

No: nd indexing, slicing, etc. (yet!)

Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .

Random numbers using pycuda.curandom

Mixed types (int32 + float32 = float64)

print gpuarray for debugging.

Memory behind gpuarray available as .gpudataattribute.

Use as kernel arguments, textures, etc.



What’s this “numpy”, anyway?

Numpy: package for large,multi-dimensional arrays.

Vectors, Matrices, . . .

A+B, sin(A), dot(A,B)

la.solve(A, b), la.eig(A)

cube[:, :, n-k:n+k], cube+5

All much faster than functional equivalents inPython.

“Python’s MATLAB”:Basis for SciPy, plotting, . . .



gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

from pycuda.curandom import rand as curanda gpu = curand((50,))b gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernellin comb = ElementwiseKernel(

” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)

c gpu = gpuarray.empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)

assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5



gpuarray: Reduction made easy

Example: A scalar product calculation

from pycuda.reduction import ReductionKerneldot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,

reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=”const float ∗x, const float ∗y”)

from pycuda.curandom import rand as curandx = curand((1000∗1000), dtype=numpy.float32)y = curand((1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())


GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python

Step 3: Usage

Complex numbers

. . . in GPUArray

. . . in user code

(pycuda-complex.hpp)

If/then/else for GPUArrays

Support for custom device pointers

Smarter device picking/context

creation

PyFFT: FFT for PyOpenCL and

PyCUDA

scikits.cuda: CUFFT, CUBLAS,

CULA



Sparse Matrix-Vector on the GPU

New feature in 0.94:Sparse matrix-vectormultiplication

Uses “packeted format”by Garland and Bell (alsoincludes parts of their code)

Integrates with scipy.sparse.

Conjugate-gradients solverincluded

Deferred convergencechecking



Kernel Invocation: Automatic Copies

mod = pycuda.driver.SourceModule(

” global my func(float ∗out, float ∗in ){...} ”)

func = mod.get function(”my func”)

src = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.empty like(src)

my func(

cuda.Out(dest),

cuda.In( src ),

block=(400,1,1))

“InOut” exists, too.

Only for immediate invocation style.



Step 4: Debugging

New in 0.94.1: Support for CUDA gdb:

$ cuda-gdb --args python -m

pycuda.debug demo.py

Automatically:

Sets Compiler flags

Retains source code

Disables compiler cache



CUDA APIs

Hardware

Kernel Driver

Driver API

Runtime API PyCuda

C/C++ Python CUDA has two Programming

Interfaces:

“Runtime” high-level

(libcudart.so, in the

“toolkit”)

“Driver” low-level

(libcuda.so, comes with

GPU driver)

(mutually exclusive)



Runtime vs. Driver API

Runtime ↔ Driver differences:

Explicit initialization.

Code objects (“Modules”) become programming language

objects.

Texture handling requires slightly more work.

Only needs nvcc for compiling GPU code.

Driver API:

Conceptually cleaner

Less sugar-coating (provide in Python)

Not very different otherwise



PyCuda: API Tracing

With ./configure --cuda-trace=1:

import pycuda. driver as cuda

import pycuda. autoinit

import numpy

a = numpy.random.randn(4,4).astype(numpy.float32)

a gpu = cuda.mem alloc(a.nbytes)

cuda.memcpy htod(a gpu, a)

mod = cuda.SourceModule(”””

global void doublify ( float ∗a)

{int idx = threadIdx.x + threadIdx.y∗4;

a[ idx ] ∗= 2;

}”””)

func = mod.get function(”doublify”)

func(a gpu, block=(4,4,1))

a doubled = numpy.empty like(a)

cuda.memcpy dtoh(a doubled, a gpu)

print a doubled

print a

cuInit

cuDeviceGetCount

cuDeviceGet

cuCtxCreate

cuMemAlloc

cuMemcpyHtoD

cuCtxGetDevice

cuDeviceComputeCapability

cuModuleLoadData

cuModuleGetFunction

cuFuncSetBlockShape

cuParamSetv

cuParamSetSize

cuLaunchGrid

cuMemcpyDtoH

cuCtxPopCurrent

cuCtxPushCurrent

cuMemFree

cuCtxPopCurrent

cuCtxPushCurrent

cuModuleUnload

cuCtxPopCurrent

cuCtxDestroy



PyCUDA: Vital Information

http://mathema.tician.de/

software/pycuda

Complete documentation

MIT License

(no warranty, free for all use)

Requires: numpy, Python 2.4+

(Win/OS X/Linux)

Support via mailing list


Sleepy ?

Outline



3.Case study in brain-inspired AI

caching

... too much ?

bank conflicts

coalescing

partition campingclam

ping

mix

ed p

reci

sion

broadcasting

streamszero-copy

can’t decide ?

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

GPU Programming: Implementation Choices

Many difficult questions

Insufficient heuristics

Answers are hardware-specific andhave no lasting value

Proposed Solution: Tune automaticallyfor hardware at run time, cache tuningresults.

Decrease reliance on knowledge ofhardware internals

Shift emphasis fromtuning results to tuning ideas



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human In GPU scripting,GPU code doesnot need to bea compile-time

constant.



PyCUDAPyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python

The News

4 Run-Time Code

Generation

WritingCode

whenthe most K

nowledge is Ava

ilable

Showcase

slide by Andreas Klockner (NYU)


Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDA

PyOpenCL



Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human


constant.



PyCUDAPyOpenCL



Machine-generated Code

Why machine-generate code?

Automated Tuning(cf. ATLAS, FFTW)

Data types

Specialize code for given problem

Constants faster than variables(→ register pressure)

Loop Unrolling



PyCuda: Support for Metaprogramming

Access properties of compiled code:

func.{num regs,shared size bytes,local size bytes}Exact GPU timing via events

Can calculate hardware-dependent MP occupancy

codepy (by Andreas):

Build C syntax trees from Python

Generates readable, indented C

Or use a templating engine (many available, e.g. Cheetah)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial

r

slide by Andreas Klockner (NYU)

Outline



3.Case study in brain-inspired AI (vision)

Motivation

fastaccuratetolerant to variationseffortlesscritical to survival

Visual Object RecognitionThe Problem:

The ApproachReverse and Forward Engineering the Brain

The ApproachReverse and Forward Engineering the Brain

Build Artificial System

FORWARD REVERSE Study

Natural System

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Why is modeling challenging?

Advice from Dave Cox:

“Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Visual Cortex

brain = 20 petaflops !

GPUs (since 2006)

7800 GTX(2006)

Monster16GPU(2008)

Tesla Cluster(2009)

OpenGL/Cg CUDA CUDA/OpenCL

C++/Python Python Python

Build your own!

Cell Broadband Engine (since 2007)

DiCarlo Lab / MIT Cox Lab / Harvard

Teraflop Playstation3 clusters:

A Match Made in HeavenBrains are parallel, GPUs are parallel

Multiple scales of parallelism:“Embarrasingly” parallel: video frames, regionsFine-grained: independent “neurons,” operating on overlapping inputs

≈

A Match Made in HeavenImages In, Images Out

Image processing particularly well-suitedExcellent Arithmetic Intensity: very natural to load image patches into shared memoryData: 2D / 3D locality

≈

Fukushima (1980)

LeCun et al. (1989)

Riesenhuber & Poggio (1999)

Serre & Poggio (2007)

L1

L2

L3

input

Read-out

n. of !lters

kernel size

kernel size

number of !lters

number of !lters

Learning

kernel size

normalizationneighborhood



norm strengththresh/sat



RateTrace“Temp. Adv.”“Auto-reset”

...

Learning


...

Learning


...

L1

L2

L3

n. of !lters

kernel size

kernel size

number of !lters

Learning



neighborhood




...

Learning


...


...





How to optimize?

Two conflicting requirements

FAST

FLEXIBLE

What’s the bottleneck?

3D Filterbank Convolutions!

Fast vs Flexible: what can you do?

MATLAB/CUDA by Jim Mutch (2010)

- Make your code accessible

- No focus on raw performance

Examples:

by John Moore (1995)


- Use standard libraries (e.g. CUBLAS, CUFFT, Jacket)

- But: “remap” problem to fit?

- Memory issues (not always optimal)


- Fully optimized, by hand

- But for only a few input configurations...


- Focus on flexibility/accessibility first

- But add strong foundations for raw performance from the beginning

Example:

http://deeplearning.netby James Bergstra & Yoshua Bengio (2010)

Python/C/CUDA

(OpenCL*)

http://deeplearning.net

http://deeplearning.net

Our answer?

Meta-programmingand

Auto-tuning

Meta-programming !

Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW)• Dynamically compile specialized versions

of the same kernel for different conditions • Empirical run-time tuning• For free: smooth syntactic ugliness: unroll

loops, index un-indexable registers, etc.

“Instrument” your solutions:• Block size • Work size• Loop unrolling• Pre-fetching• Spilling• etc.

Meta-programming !

Let the computer generate and find the optimal code:• brute-force search with a global objective• machine-learning approach with local

objectives and hidden variables (advanced)• e.g. PyCuda makes this easy:

Meta-programming !

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

Cheetah








#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];


__global__ void convolve_beta_j0(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }



















}

conv_kernel_template.cuconv_kernel_4x4x4.cu








conv_kernel_template.cu

conv_kernel_4x4x4.cu

20 kB

conv_kernel_8x8x4.cu

64 kB

Benefits?

Smooth syntactic ugliness


Manipulations that are not easily accessible in CUDA C code:• variable-length argument lists


Manipulations that are not easily accessible in CUDA C code:• syntax-level code control (e.g. conditionals)


Manipulations that are not easily accessible in CUDA C code:• loop unrolling (possibly fine-controlled)


Manipulations that are not easily accessible in CUDA C code:• fine-controlled loop unrolling

(...) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }



















}

How about #pragma unroll ?(why don’t you trust the compiler?)

Using GPUs for Signal

Correlation

Michael Clarkwith

Paul La Plante and Lincoln Greenhill

The Murchison Widefield Array

Daniel A. Mitchell

Figure 3: On the left is an image of the J2107-2526 field produced by integrating 8-second snapshots over

the entire time interval without blanking. On the right is an image of the field after RFI blanking and peeling,

along with contours of the unpeeled image.

occasion, reflect or refract into the receivers at levels that are orders of magnitude above the noise

floor. During deep integrations the MWA real-time system will simply discard dubious data. This

will require a series of data-quality tests, of which the simple median-based detector shown here

will form an integral part.

References

[1] A.E.E. Rogers, RFI Statistics at Boolardy, EDGES Memo, 058, 2010.

[2] D.A. Mitchell, L.J. Greenhill, R.B. Wayth, R.J. Sault, C.J. Lonsdale, R.J. Cappallo, M.F. Morales, and

S.M. Ord, Real-Time Calibration of the Murchison Widefield Array, IEEE Journal of Selected Topics

in Signal Processing, 2 (5), 707–717, 2008, [astro-ph/0807.191

2].

[3] C.J. Lonsdale, et al., The Murchison Widefield Array: Design Overview, Proceedings of the IEEE, 97

(8), 1497–1506, 2009, [astro-ph/0903.182

8].

[4] S.M. Ord, L.J. Greenhill, R.B. Wayth, D.A. Mitchell, K. Dale, H. Pfister, and R.G. Edgar, Graphics

Processing Units for Data Processing in the Murchison Wide-field Array, ASP Conference Series,

411, 127, 2009.

[5] J.P. Hamaker, J.D. Bregman, and R.J. Sault, Understanding radio polarimetry. I. Mathematical

foundations, Astron. Astrophys. Suppl. Ser., 117, 137–147, 1996.

[6] J.P. Hamaker, Understanding radio polarimetry. IV. The full-coherency analogue of scalar

self-calibration: Self-alignment, dynamic range and polarimetric fidelity, Astron. Astrophys. Suppl.

Ser., 143, 515–543, 2000.

[7] S.M. Ord, et al., Wide-field interferometric imaging via the combination of warped snapshots, in prep.

[8] P.A. Fridman, Statistically Stable Estimates of Variance in Radio-Astronomy Observations as Tools

for Radio-Frequency Interference Mitigation, The Astronomical Journal, 135 (5), 1810–1824, 2008.

[9] A.E. Wright and R. Otrupcek, (Eds), Parkes Catalogue, Australia Telescope National Facility, 1990.

[10] J.E. Noordam, LOFAR Calibration Challenges, in Proc. SPIE: Groundbased Telescopes, 5489,

817–825, 2004.

6

Thursday, 27 January 2011

IICS‘2011

we are not alone....

Don’t trust compilers

• Compare these “identical” code fragments

a += b*c +

d*c + e*f

+ g*h;

a += b*c;

a += d*c;

a += e*f;

a += g*h;

1020 GFLOPS

770 GFLOPS

Thursday, 27 January 2011


Manipulations that are not easily accessible in CUDA C code:• index un-indexable resources (e.g. regs)

Explore design decision space more freely





Pinto N, Cox DD

Exploring design decision space more freely

Meta-programming:

• enables efficient learning of the GPU hardware/software

• allows full exploitation of the GPU architecture








conv_kernel_beta_template.cu ...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4

...

version A

version B

...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...2x faster... Why ?

using decuda by Wladimir J. van der Laan


When USE_THREAD_PER_FILTER is True

• each thread will access different cmem locations (in order)

using the decuda disassembler by Wladimir J. van der Laan (Python-based)


When USE_THREAD_PER_FILTER is False

• each thread will access the same cmem locations (broadcast)

using the decuda disassembler by Wladimir J. van der Laan (Python-based)


2x faster... Why ?

v.s.

more registers

thread-dependent data movement

Strategy

• intermediate design decisions can be made explicit

• multiple “forks” in the path can be kept in place

• frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code)

• retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code

http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah

Matmul Toy Example



Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter code

• is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda)

➡ facilitates auto-tuning!

Need a pause?

How to get to the ninja level?

Practice, practice, practice...

Auto-tuning





Pinto N, Cox DD

Auto-tuning

The goal is to empirically optimize execution time given:

• the environment

- hardware (GPU, CPU, Memory, Mobo)

- software (SDK, Compiler suite)

• the data (input dimensions, repetitions, etc.)

Basic auto-tuning: pseudo-code (1/3)

Filter-bank Convolution / Correlation

Scripting, Py{CUDA,CL}

NoSQL (CouchDB, MongoDB) ?


PyCUDA/CL

Cheetah, Jinja, Mako


PyCUDA/CL

NoSQL (CouchDB, MongoDB)

Optimizing what?

Optimizing strategy

• Like many operations, filter-bank convolution is usually “communication bound” on the GPU:- compute is cheap- communication is expensive

• We must take advantage of all types of memory:- explicit: gmem (global), smem (shared), cmem

(constant), tmem (texture)- implicit: rmem (registers), bmem (bin-code?) *

• Different optimal access patterns

Example: thread gmem output size

stupid float4 xyzw trick

Example: multiple smem loads

Example: using texture fetches

Example: register spilling

Example: register pressure (nvcc)

Example: capitalizing on bmem (bin code) ??

input offset in cubin code?

multiple versions of the same function with different input offsets

Results

Results

GPU / SDK Input Filter-bank Meta-progdefault (gflops)

Meta-progauto-tuned (gflops)

Boost

9600M GTCUDA3.1

256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %

1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 %2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 %

C1060CUDA2.3

256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %

1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 %2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 %

GTX285CUDA2.3

256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %

1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 %2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 %

GTX480CUDA3.1

256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %

1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 %2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %Pi

nto,

Cox

(Sub

mitt

ed)

Analysis

Empirical results...

Performance (g!ops)

Q9450 (Matlab/C) [2008]

Q9450 (C/SSE) [2008]

7900GTX (Cg) [2006]

PS3/Cell (C/ASM) [2007]

8800GTX (CUDA1.x) [2007]

GTX280 (CUDA2.x) [2008]

GTX480 (CUDA3.x) [2010] 974.3

339.3

192.7

111.4

68.2

9.0

0.3

>1000X speedup is game changing...

Summary

Summary

• Meta-programming makes developing high-performing code for GPU easier

• Fantastic tools exist (e.g. PyCUDA) to help

• Interesting way to explore/learn about GPUs (hw/sw)

• Coarse auto-tuning yields good results

Future

• More fermi optimizations(L1 cache, concurrent kernels)

• OpenCL to optimize across vendors

• Smarter auto-tuning techniques (ML)- (boosted) decision trees- evolutionary programming strategies

• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)

• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)

• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)

• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)

• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)

• ...

More ?

iPhD one more thingor two...

Life/Code Hacking #2.xSpeed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.2bSpeed writing




RSI ?

Life/Code Hacking #2.3Speed reading




1. Collect many papers, docs, chapters, etc. (100)

2. Skim through them quickly / select (50)

3. Read w/o full understanding / select (25)

4. Read completely w/ full understanding / select (10)

5. Complete mastery + reproduction (5)



http://readerssoft.com/speed_reading_obstacles.php



vs.

Speed reading



Life/Code Hacking #2.3

normal reading

speed reading



Speed reading


Life/Code Hacking #2.3

like David Guetta, use one finger !

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Education

Transcript of [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning