OpenCL Tutorial - Basics

24
OpenCL Tutorial Guillermo Marcus

description

OpenCL Tutorial - Basics

Transcript of OpenCL Tutorial - Basics

Page 1: OpenCL Tutorial - Basics

OpenCL TutorialGuillermo Marcus

Page 2: OpenCL Tutorial - Basics

14:00 Part IOpenCL OverviewHello Vector

15:30 Coffee Break

16:00 Part IIReductionMatrix Multiply

Overview

Page 3: OpenCL Tutorial - Basics

About me

Dr. Guillermo [email protected]

PhD from Heidelberg in Computer Science 2011Head of the Scientific Computing Research Group until March 2013NVIDIA (OptiX Group) from May 2013

Teached the ZITI Master Lecture in GPU Computing between 2011-2013

Page 4: OpenCL Tutorial - Basics

OpenCL Overview

Standarized language to program acceleratorshttp://www.khronos.org/opencl

C-based, APIs and GPU code is C or C-likeCompiles at runtime

Supported by multiple hardware vendorsNVIDIA, AMD, ARM, PowerVR, Altera

While code is portable, optimizations are not!

Page 5: OpenCL Tutorial - Basics

OpenCL Basics

Application Models

Execution Model

Memory Model

Page 6: OpenCL Tutorial - Basics

Application Model

Activities are driven by the host computer

Multiple platforms, multiple devices possible

IO is an important part of the model

Page 7: OpenCL Tutorial - Basics

GPU Kernels

- Starts a computation in the GPU- "Launches" (starts) a collection of threads- Requires code to execute AND a specification (how the threads are organized)- Can be blocking or non-blocking

Page 8: OpenCL Tutorial - Basics

Execution Model

Work ItemsKernel code"Serial" execution threadPrivate variables

Work GroupsSynchronization inside the groupData sharing inside the group

Program GridCollection of Work GroupsNo synchronizationNo data Sharing

int a[N], b[N], c[N];int i, tid;

tid = getThreadID();for(i=tid; i<N; i+=4)

c[i] = a[i] + b[i];

Work Item

Page 9: OpenCL Tutorial - Basics

Work Items

A single thread in the GPUThe are executed normally as SIM

Thread code is the same for all work itemsWork items can have private variablesHave an Unique ID inside the kernel

int a[N], b[N], c[N];int i, tid;

tid = getThreadID();for(i=tid; i<N; i+=4)

c[i] = a[i] + b[i];

Page 10: OpenCL Tutorial - Basics

Single Instruction, Multiple Threads

Combines the flexibility of the thread model with the efficiency of the Single Instruction, Multiple Data architecture.

Normally, there are many more threads than workers.

worker

1 worker

2 worker

3 worker

4

int a[N], b[N], c[N];int tid;

tid = getThreadID();c[tid] = a[tid] + b[tid];

Page 11: OpenCL Tutorial - Basics

Work Groups

Work Groups are collections of Work ItemsItems inside a Work Group ...

are executed in parallel*share local datahave a local IDcan be organized as 1D,2D,3D* arrays

Work Groups ...are independent of each otherhave an unique ID inside the kernel

Page 12: OpenCL Tutorial - Basics

Program Grid

Work Groups are organized as a 1D, 2D, 3D array

Between Work Groups there is ...No communicationNo data synchronization

In fact, often there is not even data coherency between work groups!

Page 13: OpenCL Tutorial - Basics

Memory Model

Hierarchical organization of areas:Host, Global, Local, Registers

Moving data between areas is expensive

Data coherency is not garanteed at all times or across all areas

Every area has its own constraint set

Controlled by attributes in the code definition

Page 14: OpenCL Tutorial - Basics

Memory Model Overview

Page 15: OpenCL Tutorial - Basics

Host Memory

Main Memory of the Host Computer

Can move data only between the host and the GPU Global Memory

Transfer is always initiated by the Host,can be Synchronous or Asynchronous

Bandwidth is limited by the PCIe links

Page 16: OpenCL Tutorial - Basics

Global Memory

Main GPU Memory available to all threadsBiggest in size, up to several GBs

Huge bandwidth, but also huge latencytypically 400-800 cyclesnot always cached

Performance is very dependent of access patterns

Page 17: OpenCL Tutorial - Basics

Local Memory

Available to all threads inside a Work GroupLimited in size (typical: 8KB-64KB)

Latency comparable to registers

Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns

Used as scratchpad or cache of global memory

Page 18: OpenCL Tutorial - Basics

GPU Registers

Private to every thread

Normally hidden, no direct access, optimized by the compiler

Fastest access, only constrained in number of available registers

Some platforms may use more registers than others..... depends on the hardware architecture

Page 19: OpenCL Tutorial - Basics

Constant Memory

Read only memory

Cached

Good for storing Look Up Tables and non-changeable values

It is normally a small area of the global memory

Page 20: OpenCL Tutorial - Basics

Private Memory

Unique to every Work Item

Normally it is mapped first to registers, then to global memory when there is no more free registers

Page 21: OpenCL Tutorial - Basics

Kernel Specification

Defines the number and distribution of threads inside the kernel.

A GPU program can be launched with different specifications, creating different kernels.

The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.

Page 22: OpenCL Tutorial - Basics

Global and Local Settings (1D)

// Create kernel specification (ND range)NDRange global(VECT_SIZE);NDRange local(1);

// Create kernel specification (ND range)int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1);NDRange global(64*groups);NDRange local(64);

Page 23: OpenCL Tutorial - Basics

Global and Local Settings (2D)// Create kernel specification (ND range)

int gX = X_SIZE/4 + ((X_SIZE % 4 == 0) ? 0 : 1);int gY = Y_SIZE/3 + ((Y_SIZE % 3 == 0) ? 0 : 1);NDRange global(gX*4, gY*3);NDRange local(4,3);

Page 24: OpenCL Tutorial - Basics

Basic built-in functions values