[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
description
Transcript of [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
Lecture #6: CUDA Ninja Tricks | March 1st, 2011
Nicolas Pinto (MIT, Harvard) [email protected]
Massively Parallel ComputingCS 264 / CSCI E-292
Lecture #6: CUDA Ninja Tricks | February 29th, 2011
Nicolas Pinto (MIT, Harvard) [email protected]
Massively Parallel ComputingCS 264 / CSCI E-292
GPU “Scripting”, Meta-programming, Auto-tuning
News
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for CS264
Todayyey!!
Outline
1. Scripting GPUs with PyCUDA
2.Meta-programming and RTCG
3.Case study in brain-inspired AI
Outline
1. Scripting GPUs with PyCUDA
2.Meta-programming and RTCG
3.Case study in brain-inspired AI
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Why do Scripting for GPUs?
GPUs are everything that scriptinglanguages are not.
Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput
→ complement each other
CPU: largely restricted to controltasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
How are High-Performance Codes constructed?
“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Scripting: Python
One example of a scripting language: Python
Mature
Large and active community
Emphasizes readability
Written in widely-portable C
A ‘multi-paradigm’ language
Rich ecosystem of sci-comp related
software
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Scripting Languages
Python:
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
uses run-time typing.
works well for “gluing” lower-level blocks together.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.Use the right tool for the job.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.Use the right tool for the job.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.Use the right tool for the job.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Scripting: Speed
Usual answer to the “Speed
Question”:
Hybrid (“mixed”) Code.
Plays to the strengths of each
language.
But: Introduces (some)
complexity.
Observation: GPU code is already hybrid.
Consequence: No added complexity through hybrid code.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)
[This is examples/demo.py in the PyCUDA distribution.]
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a
Compute kernel
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a
Compute kernel
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Whetting your appetite, Part II
Did somebody say “Abstraction is good”?
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Whetting your appetite, Part II
1 import numpy2 import pycuda.autoinit3 from pycuda import gpuarray45 a cpu = numpy.random.randn(4,4).astype(numpy.float32)6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)7 c cpu = a cpu ∗ b cpu89 a gpu = gpuarray.to gpu(a cpu)
10 b gpu = gpuarray.to gpu(b cpu)11 c gpu = (a gpu ∗ b gpu).get()1213 print c cpu − c gpu
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Remember me?
1 // trivia2 #include <stdio.h>3
4 #define CUDA CHK(NAME, ARGS) { \5 cudaError t cuda err code = NAME ARGS; \6 if (cuda err code != cudaSuccess) { \7 printf (”%s failed with code %d\n”, #NAME, cuda err code); \8 abort (); \9 } \
10 }11 // end12
13 // kernel14 global void square array ( float ∗a, float ∗b, int n)
15 {16 int i = (blockIdx .x ∗ blockDim.y + threadIdx.y)
17 ∗ blockDim.x + threadIdx.x;
18 if ( i < n)
19 a[ i ] = a[i ] ∗ b[i ];
20 }21 // end22
23 // main124 int main()
25 {26 cudaSetDevice(0); // EDIT ME27
28 const int n = 4096;
29
30 float ∗a host = (float ∗) malloc(n∗sizeof(float ));
31 float ∗b host = (float ∗) malloc(n∗sizeof(float ));
32
33 float ∗a device, ∗b device;
34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));36 // end
1 // main22 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }3
4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),5 cudaMemcpyHostToDevice));
6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),7 cudaMemcpyHostToDevice));
8
9 dim3 block dim(16, 16);
10 int block size = block dim.x∗block dim.y;
11 int n blocks = (n + block size−1) / block size ;
12 square array <<<n blocks, block dim>>>(a device, b device, n);
13 // end14
15 // main316 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),17 cudaMemcpyDeviceToHost));
18
19 for ( int i = 0; i < n; i++)
20 printf (”%.0f ”, a host [ i ]);
21 puts(”\n”);
22
23 free (a host );
24 CUDA CHK(cudaFree, (a device));
25 }26 // end
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
PyCUDA Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Check for and report errorsautomatically
Full documentation
Integrate tightly with numpy
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
PyCuda: Workflow
Edit
PyCuda
Run
SourceModule("...")
Cache!
nvcc .cubin
Upload to GPU
Run on GPU
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Automatic Cleanup
Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspecified future time.
Scarce resources (memory) can be
explicitly freed. (obj.free())
Correctly deals with multiple
contexts and dependencies.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
gpuarray: Simple Linear Algebra
pycuda.gpuarray:Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudataattribute.
Use as kernel arguments, textures, etc.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
What’s this “numpy”, anyway?
Numpy: package for large,multi-dimensional arrays.
Vectors, Matrices, . . .
A+B, sin(A), dot(A,B)
la.solve(A, b), la.eig(A)
cube[:, :, n-k:n+k], cube+5
All much faster than functional equivalents inPython.
“Python’s MATLAB”:Basis for SciPy, plotting, . . .
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
gpuarray: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curanda gpu = curand((50,))b gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernellin comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)
c gpu = gpuarray.empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)
assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
gpuarray: Reduction made easy
Example: A scalar product calculation
from pycuda.reduction import ReductionKerneldot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=”const float ∗x, const float ∗y”)
from pycuda.curandom import rand as curandx = curand((1000∗1000), dtype=numpy.float32)y = curand((1000∗1000), dtype=numpy.float32)
x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Step 3: Usage
Complex numbers
. . . in GPUArray
. . . in user code
(pycuda-complex.hpp)
If/then/else for GPUArrays
Support for custom device pointers
Smarter device picking/context
creation
PyFFT: FFT for PyOpenCL and
PyCUDA
scikits.cuda: CUFFT, CUBLAS,
CULA
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Sparse Matrix-Vector on the GPU
New feature in 0.94:Sparse matrix-vectormultiplication
Uses “packeted format”by Garland and Bell (alsoincludes parts of their code)
Integrates with scipy.sparse.
Conjugate-gradients solverincluded
Deferred convergencechecking
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Kernel Invocation: Automatic Copies
mod = pycuda.driver.SourceModule(
” global my func(float ∗out, float ∗in ){...} ”)
func = mod.get function(”my func”)
src = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.empty like(src)
my func(
cuda.Out(dest),
cuda.In( src ),
block=(400,1,1))
“InOut” exists, too.
Only for immediate invocation style.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Step 4: Debugging
New in 0.94.1: Support for CUDA gdb:
$ cuda-gdb --args python -m
pycuda.debug demo.py
Automatically:
Sets Compiler flags
Retains source code
Disables compiler cache
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
CUDA APIs
Hardware
Kernel Driver
Driver API
Runtime API PyCuda
C/C++ Python CUDA has two Programming
Interfaces:
“Runtime” high-level
(libcudart.so, in the
“toolkit”)
“Driver” low-level
(libcuda.so, comes with
GPU driver)
(mutually exclusive)
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
Runtime vs. Driver API
Runtime ↔ Driver differences:
Explicit initialization.
Code objects (“Modules”) become programming language
objects.
Texture handling requires slightly more work.
Only needs nvcc for compiling GPU code.
Driver API:
Conceptually cleaner
Less sugar-coating (provide in Python)
Not very different otherwise
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
PyCuda: API Tracing
With ./configure --cuda-trace=1:
import pycuda. driver as cuda
import pycuda. autoinit
import numpy
a = numpy.random.randn(4,4).astype(numpy.float32)
a gpu = cuda.mem alloc(a.nbytes)
cuda.memcpy htod(a gpu, a)
mod = cuda.SourceModule(”””
global void doublify ( float ∗a)
{int idx = threadIdx.x + threadIdx.y∗4;
a[ idx ] ∗= 2;
}”””)
func = mod.get function(”doublify”)
func(a gpu, block=(4,4,1))
a doubled = numpy.empty like(a)
cuda.memcpy dtoh(a doubled, a gpu)
print a doubled
print a
cuInit
cuDeviceGetCount
cuDeviceGet
cuCtxCreate
cuMemAlloc
cuMemcpyHtoD
cuCtxGetDevice
cuDeviceComputeCapability
cuModuleLoadData
cuModuleGetFunction
cuFuncSetBlockShape
cuParamSetv
cuParamSetSize
cuLaunchGrid
cuMemcpyDtoH
cuCtxPopCurrent
cuCtxPushCurrent
cuMemFree
cuCtxPopCurrent
cuCtxPushCurrent
cuModuleUnload
cuCtxPopCurrent
cuCtxDestroy
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
PyCUDA: Vital Information
http://mathema.tician.de/
software/pycuda
Complete documentation
MIT License
(no warranty, free for all use)
Requires: numpy, Python 2.4+
(Win/OS X/Linux)
Support via mailing list
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Sleepy ?
Outline
1. Scripting GPUs with PyCUDA
2.Meta-programming and RTCG
3.Case study in brain-inspired AI
caching
... too much ?
bank conflicts
coalescing
partition campingclam
ping
mix
ed p
reci
sion
broadcasting
streamszero-copy
can’t decide ?
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
GPU Programming: Implementation Choices
Many difficult questions
Insufficient heuristics
Answers are hardware-specific andhave no lasting value
Proposed Solution: Tune automaticallyfor hardware at run time, cache tuningresults.
Decrease reliance on knowledge ofhardware internals
Shift emphasis fromtuning results to tuning ideas
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
GPU Programming: Implementation Choices
Many difficult questions
Insufficient heuristics
Answers are hardware-specific andhave no lasting value
Proposed Solution: Tune automaticallyfor hardware at run time, cache tuningresults.
Decrease reliance on knowledge ofhardware internals
Shift emphasis fromtuning results to tuning ideas
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python
The News
4 Run-Time Code
Generation
WritingCode
whenthe most K
nowledge is Ava
ilable
Showcase
slide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDA
PyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Machine-generated Code
Why machine-generate code?
Automated Tuning(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables(→ register pressure)
Loop Unrolling
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
PyCuda: Support for Metaprogramming
Access properties of compiled code:
func.{num regs,shared size bytes,local size bytes}Exact GPU timing via events
Can calculate hardware-dependent MP occupancy
codepy (by Andreas):
Build C syntax trees from Python
Generates readable, indented C
Or use a templating engine (many available, e.g. Cheetah)
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial
r
slide by Andreas Klockner (NYU)
Outline
1. Scripting GPUs with PyCUDA
2.Meta-programming and RTCG
3.Case study in brain-inspired AI (vision)
Motivation
fastaccuratetolerant to variationseffortlesscritical to survival
Visual Object RecognitionThe Problem:
The ApproachReverse and Forward Engineering the Brain
The ApproachReverse and Forward Engineering the Brain
Build Artificial System
FORWARD REVERSE Study
Natural System
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Why is modeling challenging?
Advice from Dave Cox:
“Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Why is modeling challenging?
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Visual Cortex
brain = 20 petaflops !
GPUs (since 2006)
7800 GTX(2006)
Monster16GPU(2008)
Tesla Cluster(2009)
OpenGL/Cg CUDA CUDA/OpenCL
C++/Python Python Python
Build your own!
Cell Broadband Engine (since 2007)
DiCarlo Lab / MIT Cox Lab / Harvard
Teraflop Playstation3 clusters:
A Match Made in HeavenBrains are parallel, GPUs are parallel
Multiple scales of parallelism:“Embarrasingly” parallel: video frames, regionsFine-grained: independent “neurons,” operating on overlapping inputs
≈
A Match Made in HeavenImages In, Images Out
Image processing particularly well-suitedExcellent Arithmetic Intensity: very natural to load image patches into shared memoryData: 2D / 3D locality
≈
Why is modeling challenging?
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Fukushima (1980)
LeCun et al. (1989)
Riesenhuber & Poggio (1999)
Serre & Poggio (2007)
L1
L2
L3
input
Read-out
n. of !lters
kernel size
kernel size
number of !lters
number of !lters
Learning
kernel size
normalizationneighborhood
normalizationneighborhood
normalizationneighborhood
norm strengththresh/sat
norm strengththresh/sat
norm strengththresh/sat
RateTrace“Temp. Adv.”“Auto-reset”
...
Learning
RateTrace“Temp. Adv.”“Auto-reset”
...
Learning
RateTrace“Temp. Adv.”“Auto-reset”
...
L1
L2
L3
n. of !lters
kernel size
kernel size
number of !lters
Learning
normalizationneighborhood
normalizationneighborhood
neighborhood
norm strengththresh/sat
norm strengththresh/sat
RateTrace“Temp. Adv.”“Auto-reset”
...
Learning
RateTrace“Temp. Adv.”“Auto-reset”
...
RateTrace“Temp. Adv.”“Auto-reset”
...
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
How to optimize?
Two conflicting requirements
FAST
FLEXIBLE
What’s the bottleneck?
3D Filterbank Convolutions!
Fast vs Flexible: what can you do?
MATLAB/CUDA by Jim Mutch (2010)
- Make your code accessible
- No focus on raw performance
Examples:
by John Moore (1995)
Fast vs Flexible: what can you do?
- Use standard libraries (e.g. CUBLAS, CUFFT, Jacket)
- But: “remap” problem to fit?
- Memory issues (not always optimal)
Fast vs Flexible: what can you do?
- Fully optimized, by hand
- But for only a few input configurations...
Fast vs Flexible: what can you do?
- Focus on flexibility/accessibility first
- But add strong foundations for raw performance from the beginning
Example:
http://deeplearning.netby James Bergstra & Yoshua Bengio (2010)
Python/C/CUDA
(OpenCL*)
Our answer?
Meta-programmingand
Auto-tuning
What?
Meta-programming !
Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW)• Dynamically compile specialized versions
of the same kernel for different conditions • Empirical run-time tuning• For free: smooth syntactic ugliness: unroll
loops, index un-indexable registers, etc.
“Instrument” your solutions:• Block size • Work size• Loop unrolling• Pre-fetching• Spilling• etc.
Meta-programming !
Let the computer generate and find the optimal code:• brute-force search with a global objective• machine-learning approach with local
objectives and hidden variables (advanced)• e.g. PyCuda makes this easy:
Meta-programming !
Basic GPU Meta-programming System
GPU Meta-Programming: A Case Study
in Biologically-Inspired Machine Vision
[GPU Computing Gems]
Pinto N, Cox DD
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)extern "C" {
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for
Cheetah
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)extern "C" {
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for
#include <stdio.h>
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];
#define IMUL(a, b) __mul24(a, b)extern "C" {
__global__ void convolve_beta_j0(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j1(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j2(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j3(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
}
conv_kernel_template.cuconv_kernel_4x4x4.cu
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)extern "C" {
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for
conv_kernel_template.cu
conv_kernel_4x4x4.cu
20 kB
conv_kernel_8x8x4.cu
64 kB
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness
Manipulations that are not easily accessible in CUDA C code:• variable-length argument lists
Smooth syntactic ugliness
Manipulations that are not easily accessible in CUDA C code:• syntax-level code control (e.g. conditionals)
Smooth syntactic ugliness
Manipulations that are not easily accessible in CUDA C code:• loop unrolling (possibly fine-controlled)
Smooth syntactic ugliness
Manipulations that are not easily accessible in CUDA C code:• fine-controlled loop unrolling
(...) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j1(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j2(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
__global__ void convolve_beta_j3(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }
}
How about #pragma unroll ?(why don’t you trust the compiler?)
Using GPUs for Signal
Correlation
Michael Clarkwith
Paul La Plante and Lincoln Greenhill
The Murchison Widefield Array
Daniel A. Mitchell
Figure 3: On the left is an image of the J2107-2526 field produced by integrating 8-second snapshots over
the entire time interval without blanking. On the right is an image of the field after RFI blanking and peeling,
along with contours of the unpeeled image.
occasion, reflect or refract into the receivers at levels that are orders of magnitude above the noise
floor. During deep integrations the MWA real-time system will simply discard dubious data. This
will require a series of data-quality tests, of which the simple median-based detector shown here
will form an integral part.
References
[1] A.E.E. Rogers, RFI Statistics at Boolardy, EDGES Memo, 058, 2010.
[2] D.A. Mitchell, L.J. Greenhill, R.B. Wayth, R.J. Sault, C.J. Lonsdale, R.J. Cappallo, M.F. Morales, and
S.M. Ord, Real-Time Calibration of the Murchison Widefield Array, IEEE Journal of Selected Topics
in Signal Processing, 2 (5), 707–717, 2008, [astro-ph/0807.191
2].
[3] C.J. Lonsdale, et al., The Murchison Widefield Array: Design Overview, Proceedings of the IEEE, 97
(8), 1497–1506, 2009, [astro-ph/0903.182
8].
[4] S.M. Ord, L.J. Greenhill, R.B. Wayth, D.A. Mitchell, K. Dale, H. Pfister, and R.G. Edgar, Graphics
Processing Units for Data Processing in the Murchison Wide-field Array, ASP Conference Series,
411, 127, 2009.
[5] J.P. Hamaker, J.D. Bregman, and R.J. Sault, Understanding radio polarimetry. I. Mathematical
foundations, Astron. Astrophys. Suppl. Ser., 117, 137–147, 1996.
[6] J.P. Hamaker, Understanding radio polarimetry. IV. The full-coherency analogue of scalar
self-calibration: Self-alignment, dynamic range and polarimetric fidelity, Astron. Astrophys. Suppl.
Ser., 143, 515–543, 2000.
[7] S.M. Ord, et al., Wide-field interferometric imaging via the combination of warped snapshots, in prep.
[8] P.A. Fridman, Statistically Stable Estimates of Variance in Radio-Astronomy Observations as Tools
for Radio-Frequency Interference Mitigation, The Astronomical Journal, 135 (5), 1810–1824, 2008.
[9] A.E. Wright and R. Otrupcek, (Eds), Parkes Catalogue, Australia Telescope National Facility, 1990.
[10] J.E. Noordam, LOFAR Calibration Challenges, in Proc. SPIE: Groundbased Telescopes, 5489,
817–825, 2004.
6
Thursday, 27 January 2011
IICS‘2011
we are not alone....
Don’t trust compilers
• Compare these “identical” code fragments
a += b*c +
d*c + e*f
+ g*h;
a += b*c;
a += d*c;
a += e*f;
a += g*h;
1020 GFLOPS
770 GFLOPS
Thursday, 27 January 2011
Smooth syntactic ugliness
Manipulations that are not easily accessible in CUDA C code:• index un-indexable resources (e.g. regs)
Explore design decision space more freely
Basic GPU Meta-programming System
GPU Meta-Programming: A Case Study
in Biologically-Inspired Machine Vision
[GPU Computing Gems]
Pinto N, Cox DD
Exploring design decision space more freely
Meta-programming:
• enables efficient learning of the GPU hardware/software
• allows full exploitation of the GPU architecture
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)extern "C" {
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for
conv_kernel_beta_template.cu ...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
...
version A
version B
...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1
...2x faster... Why ?
using decuda by Wladimir J. van der Laan
Exploring design decision space more freely
Exploring design decision space more freely
When USE_THREAD_PER_FILTER is True
• each thread will access different cmem locations (in order)
using the decuda disassembler by Wladimir J. van der Laan (Python-based)
Exploring design decision space more freely
When USE_THREAD_PER_FILTER is False
• each thread will access the same cmem locations (broadcast)
using the decuda disassembler by Wladimir J. van der Laan (Python-based)
Exploring design decision space more freely
2x faster... Why ?
v.s.
more registers
thread-dependent data movement
Strategy
• intermediate design decisions can be made explicit
• multiple “forks” in the path can be kept in place
• frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code)
• retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code
http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
Matmul Toy Example
Summary
Meta-programming:
• can assist exploration and manual optimization
• can de-clutter code
• is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda)
➡ facilitates auto-tuning!
Need a pause?
How to get to the ninja level?
Practice, practice, practice...
Auto-tuning
Basic GPU Meta-programming System
GPU Meta-Programming: A Case Study
in Biologically-Inspired Machine Vision
[GPU Computing Gems]
Pinto N, Cox DD
Auto-tuning
The goal is to empirically optimize execution time given:
• the environment
- hardware (GPU, CPU, Memory, Mobo)
- software (SDK, Compiler suite)
• the data (input dimensions, repetitions, etc.)
Basic auto-tuning: pseudo-code (1/3)
Filter-bank Convolution / Correlation
Scripting, Py{CUDA,CL}
NoSQL (CouchDB, MongoDB) ?
Basic auto-tuning: pseudo-code (2/3)
PyCUDA/CL
Cheetah, Jinja, Mako
Basic auto-tuning: pseudo-code (3/3)
PyCUDA/CL
NoSQL (CouchDB, MongoDB)
Optimizing what?
Optimizing strategy
• Like many operations, filter-bank convolution is usually “communication bound” on the GPU:- compute is cheap- communication is expensive
• We must take advantage of all types of memory:- explicit: gmem (global), smem (shared), cmem
(constant), tmem (texture)- implicit: rmem (registers), bmem (bin-code?) *
• Different optimal access patterns
Example: thread gmem output size
stupid float4 xyzw trick
Example: multiple smem loads
Example: using texture fetches
Example: register spilling
Example: register pressure (nvcc)
Example: capitalizing on bmem (bin code) ??
input offset in cubin code?
multiple versions of the same function with different input offsets
Results
Results
GPU / SDK Input Filter-bank Meta-progdefault (gflops)
Meta-progauto-tuned (gflops)
Boost
9600M GTCUDA3.1
256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %
1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 %2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 %
C1060CUDA2.3
256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %
1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 %2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 %
GTX285CUDA2.3
256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %
1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 %2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 %
GTX480CUDA3.1
256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %
1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 %2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %Pi
nto,
Cox
(Sub
mitt
ed)
Analysis
Analysis
Empirical results...
Performance (g!ops)
Q9450 (Matlab/C) [2008]
Q9450 (C/SSE) [2008]
7900GTX (Cg) [2006]
PS3/Cell (C/ASM) [2007]
8800GTX (CUDA1.x) [2007]
GTX280 (CUDA2.x) [2008]
GTX480 (CUDA3.x) [2010] 974.3
339.3
192.7
111.4
68.2
9.0
0.3
>1000X speedup is game changing...
Summary
Summary
• Meta-programming makes developing high-performing code for GPU easier
• Fantastic tools exist (e.g. PyCUDA) to help
• Interesting way to explore/learn about GPUs (hw/sw)
• Coarse auto-tuning yields good results
Future
• More fermi optimizations(L1 cache, concurrent kernels)
• OpenCL to optimize across vendors
• Smarter auto-tuning techniques (ML)- (boosted) decision trees- evolutionary programming strategies
• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)
• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)
• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)
• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)
• ...
More ?
iPhD one more thingor two...
Life/Code Hacking #2.xSpeed {listen,read,writ}ing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2bSpeed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2bSpeed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
RSI ?
Life/Code Hacking #2.2bSpeed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
RSI ?
Life/Code Hacking #2.2bSpeed writing
Life/Code Hacking #2.3Speed reading
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3Speed reading
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
1. Collect many papers, docs, chapters, etc. (100)
2. Skim through them quickly / select (50)
3. Read w/o full understanding / select (25)
4. Read completely w/ full understanding / select (10)
5. Complete mastery + reproduction (5)
Life/Code Hacking #2.3Speed reading
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
http://readerssoft.com/speed_reading_obstacles.php
vs.
Speed reading
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
http://readerssoft.com/speed_reading_obstacles.php
Life/Code Hacking #2.3
normal reading
speed reading
Speed reading
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3
like David Guetta, use one finger !
COME