PyCUDA: Harnessing the power of GPU with Python
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?
Talk Structure
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?
Talk Structure
PyCon 4 – Florence 2010 – Fabrizio Milo
WHY A GPU ?
PyCon 4 – Florence 2010 – Fabrizio Milo
APPLICATIONS & DEMOS
PyCon 4 – Florence 2010 – Fabrizio Milo
Why GPU?
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ?
Talk Structure
PyCon 4 – Florence 2010 – Fabrizio Milo
How does it works ?
PyCon 4 – Florence 2010 – Fabrizio Milo
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
DRAM
GPU
PyCon 4 – Florence 2010 – Fabrizio Milo
DRAM
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU GPU
PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA
PyCon 4 – Florence 2010 – Fabrizio Milo
Compute Unified Device Architecture
PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA
A Parallel Computing Architecture for NVIDIA GPUs
Direct X Compute
PyCon 4 – Florence 2010 – Fabrizio Milo
Execution Model
CUDA Device Model
PyCon 4 – Florence 2010 – Fabrizio Milo
EXECUTION MODEL
PyCon 4 – Florence 2010 – Fabrizio Milo
Thread
Smallest unit of logic
PyCon 4 – Florence 2010 – Fabrizio Milo
A Block
A Group of Threads
PyCon 4 – Florence 2010 – Fabrizio Milo
A Grid
A Group of Blocks
PyCon 4 – Florence 2010 – Fabrizio Milo
One Block can have many threads
PyCon 4 – Florence 2010 – Fabrizio Milo
One Grid can have many blocks
PyCon 4 – Florence 2010 – Fabrizio Milo
DEVICE MODEL The hardware
PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor
PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor
PyCon 4 – Florence 2010 – Fabrizio Milo
Many Scalar Processors
PyCon 4 – Florence 2010 – Fabrizio Milo
+ Register File
PyCon 4 – Florence 2010 – Fabrizio Milo
+ Shared Memory
PyCon 4 – Florence 2010 – Fabrizio Milo
Multiprocessor
PyCon 4 – Florence 2010 – Fabrizio Milo
Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Real Example: 10-Series Architecture
" 240 Scalar Processor (SP) cores execute kernel threads " 30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors " 1 double precision unit " Shared memory
PyCon 4 – Florence 2010 – Fabrizio Milo
Software Hardware
Thread
Scalar Processor
Thread Block Multiprocessor
Grid Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory
PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory
PyCon 4 – Florence 2010 – Fabrizio Milo
Host - Device
RAM
Global Memory CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
Host – Multi Device
RAM
CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?
PyCon 4 – Florence 2010 – Fabrizio Milo
Software Hardware
Thread
Scalar Processor
Thread Block Multiprocessor
Grid Device
PyCon 4 – Florence 2010 – Fabrizio Milo
__global__ void multiply_them( float *dest, float *a, float *b )
{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}
Kernel
Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
__global__ void multiply_them( float *dest, float *a, float *b )
{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}
Kernel
Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
__global__ void multiply_them( float *dest, float *a, float *b )
{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}
Kernel
Block
PyCon 4 – Florence 2010 – Fabrizio Milo
__global__ void kernel( … ){ const int idx =
blockIdx.x * blockDim.x + threadIdx.x;…
}
Kernel
Grid
PyCon 4 – Florence 2010 – Fabrizio Milo
.bin
NVCC
How do I Program it ?
GCC
.cubin CPU GPU
Kernel Main Logic
PyCon 4 – Florence 2010 – Fabrizio Milo
.bin
NVCC
How do I Program it ?
GCC
.cubin
CPU
GPU
Kernel Main Logic
..bin .cubin
PyCon 4 – Florence 2010 – Fabrizio Milo
Host - Device
RAM
Global Memory CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
RAM
Global Memory CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
Allocate Memory
cudaMalloc( pointer, size )
PyCon 4 – Florence 2010 – Fabrizio Milo
Copy to device
cudaMalloc( pointer, size )
cudaMemcpy( dest, src, size, direction)
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch
cudaMalloc( pointer, size )
cudaMemcpy( dest, src, size, direction)
Kernel<<< # blocks, # threads >> (*params)
PyCon 4 – Florence 2010 – Fabrizio Milo
Get Back the Results
cudaMalloc( pointer, size )
cudaMemcpy( dest, src, size, direction)
Kernel<<< # blocks, # threads >> (*params)
cudaMemcpy( dest, src, size, direction)
PyCon 4 – Florence 2010 – Fabrizio Milo
Error Handling
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?
PyCon 4 – Florence 2010 – Fabrizio Milo
= PYCUDA
+
& ANDREAS KLOCKNER
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
Provide Complete Access
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
AutoMatically Manage
Resources
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
Check and Report Errors
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
Cross Platform
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
Allow Interactive
Use
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy
NumPy Integration
PyCon 4 – Florence 2010 – Fabrizio Milo
NUMPY - ARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
import numpy
my_array = numpy.array([1,] * 100)
1 1 1 1 1 1
99 0
PyCon 4 – Florence 2010 – Fabrizio Milo
import numpy
my_array = numpy.array([1,] * 100)
my_array[3] = 0
0 1 1 1 1 1
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow
PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Allocation
cuda.mem_alloc( size_bytes )
PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Copy
gpu_mem = cuda.mem_alloc( size_bytes )
cuda.memcpy_htod( gpu_mem, cpu_mem )
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel
gpu_mem = cuda.mem_alloc( size_bytes )
cuda.memcpy_htod( gpu_mem, cpu_mem )
SourceModule(“””__global__ void multiply_them( float *dest, float *a,
float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch
mod = SourceModule(“””__global__ void multiply_them( float *dest, float *a,
float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)
multiply_them = mod.get_function(“multiply_them”)multiply_them ( *args, block=(30, 64, 1))
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
DEMO Hello Gpu
PyCon 4 – Florence 2010 – Fabrizio Milo
GPUARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
gpuarray
PyCon 4 – Florence 2010 – Fabrizio Milo
gpuarray.to_gpu(numpy array)
numpy array = gpuarray.get()
PyCuda: GpuArray
PyCon 4 – Florence 2010 – Fabrizio Milo
gpuarray.to_gpu(numpy array)
numpy array = gpuarray.get()
PyCuda: GpuArray
+, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product …
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise
from pycuda.elementwise import ElementwiseKernel
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise
from pycuda.elementwise import ElementwiseKernel
lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”
)
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise
from pycuda.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”
)
c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu)
assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5
PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming
__kernel_template__ = “””__global__ void kernel( args ){
for (int i=0; i={{ iterations }}; i++){ {{operations}}}
}”””
See for example jinja2
PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming
PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming
Generate Source !
PyCon 4 – Florence 2010 – Fabrizio Milo
Performances ?
PyCon 4 – Florence 2010 – Fabrizio Milo
DEMO mandelbrot
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Documentation
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda
WebSite: http://mathema.tician.de/software/ pycuda
License: X Consortium License
(no warranty, free for all use)
Dependencies: Python 2.4+, numpy, Boost
PyCon 4 – Florence 2010 – Fabrizio Milo
In the Future …
OPENCL
PyCon 4 – Florence 2010 – Fabrizio Milo
THANK YOU & HAVE FUN !
PyCon 4 – Florence 2010 – Fabrizio Milo
?