Download - PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

PyCUDA: Harnessing the power of GPU with Python

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?

Talk Structure


WHY A GPU ?


APPLICATIONS & DEMOS


Why GPU?


1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ?

Talk Structure


How does it works ?


Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU


DRAM

GPU


DRAM

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU GPU


CUDA


Compute Unified Device Architecture


CUDA

A Parallel Computing Architecture for NVIDIA GPUs

Direct X Compute


Execution Model

CUDA Device Model


EXECUTION MODEL


Thread

Smallest unit of logic


A Block

A Group of Threads


A Grid

A Group of Blocks


One Block can have many threads


One Grid can have many blocks


DEVICE MODEL The hardware


Scalar Processor


Many Scalar Processors


+ Register File


+ Shared Memory


Multiprocessor


Device


Real Example: 10-Series Architecture

"  240 Scalar Processor (SP) cores execute kernel threads "  30 Streaming Multiprocessors (SMs) each contain "  8 scalar processors "  1 double precision unit "  Shared memory


Software Hardware

Thread

Scalar Processor

Thread Block Multiprocessor

Grid Device


Global Memory


Host - Device

RAM

Global Memory CPU


Host – Multi Device

RAM

CPU


Software Hardware

Thread

Scalar Processor

Thread Block Multiprocessor

Grid Device


__global__ void multiply_them( float *dest, float *a, float *b )

{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}

Kernel

Thread


__global__ void multiply_them( float *dest, float *a, float *b )

{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}

Kernel

Block


__global__ void kernel( … ){ const int idx =

blockIdx.x * blockDim.x + threadIdx.x;…

}

Kernel

Grid


.bin

NVCC

How do I Program it ?

GCC

.cubin CPU GPU

Kernel Main Logic


.bin

NVCC

How do I Program it ?

GCC

.cubin

CPU

GPU

Kernel Main Logic

..bin .cubin


Host - Device

RAM

Global Memory CPU


RAM

Global Memory CPU


Allocate Memory

cudaMalloc( pointer, size )


Copy to device


cudaMemcpy( dest, src, size, direction)


Kernel Launch



Kernel<<< # blocks, # threads >> (*params)


Get Back the Results



Kernel<<< # blocks, # threads >> (*params)



Error Handling

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}


And soon it becomes …


if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }


And soon it becomes …


























= PYCUDA

+

& ANDREAS KLOCKNER


PyCuda Philosopy

Provide Complete Access


PyCuda Philosopy

AutoMatically Manage

Resources


PyCuda Philosopy

Check and Report Errors


PyCuda Philosopy

Cross Platform


PyCuda Philosopy

Allow Interactive

Use


PyCuda Philosopy

NumPy Integration


NUMPY - ARRAY


import numpy

my_array = numpy.array([1,] * 100)

1 1 1 1 1 1

99 0


import numpy

my_array = numpy.array([1,] * 100)

my_array[3] = 0

0 1 1 1 1 1


PyCuda: Workflow


Memory Allocation

cuda.mem_alloc( size_bytes )


Memory Copy

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )


Kernel

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )

SourceModule(“””__global__ void multiply_them( float *dest, float *a,

float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)


Kernel Launch

mod = SourceModule(“””__global__ void multiply_them( float *dest, float *a,

float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)

multiply_them = mod.get_function(“multiply_them”)multiply_them ( *args, block=(30, 64, 1))


DEMO Hello Gpu


GPUARRAY


gpuarray


gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()

PyCuda: GpuArray


gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()

PyCuda: GpuArray

+, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product …


PyCuda: GpuArray: ElementWise

from pycuda.elementwise import ElementwiseKernel




lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”

)




lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”

)

c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu)

assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5


Meta-Programming

__kernel_template__ = “””__global__ void kernel( args ){

for (int i=0; i={{ iterations }}; i++){ {{operations}}}

}”””

See for example jinja2


Meta-Programming


Meta-Programming

Generate Source !


Performances ?


DEMO mandelbrot


PyCuda: Documentation


PyCuda

WebSite: http://mathema.tician.de/software/ pycuda

License: X Consortium License

(no warranty, free for all use)

Dependencies: Python 2.4+, numpy, Boost


In the Future …

OPENCL


THANK YOU & HAVE FUN !


?