Download - OpenCL Guide

Open Computing Language

Introduction• OpenCL (Open Computing Language) • Is an open royalty-free standard• For general purpose parallel programming

across CPUs, GPUs and other processors

OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform

OpenCL consists of..

• API for coordinating parallel computation across heterogeneous processors.

• A cross-platform programming language

Supports both data- and task-based parallel programming models

Utilizes a subset of ISO C99 with extensions for parallelism

Defines a configuration profile for handheld and embedded devices

The BIG Idea behind OpenCL• OpenCL execution model …• execute a kernel at each point in a problem domain• - E.g., process a 1024 x 1024 image with one kernel

invocation per pixel• or 1024 x 1024 = 1,048,576 kernel executions

To use OpenCL, you must

• Define the platform• Execute code on the platform• Move data around in memory• Write (and build) programs

OpenCL Platform Model• One Host + one or more Compute Devices• - Each Compute Device is composed of one or more

Compute Units• - Each Compute Unit is further divided into one or

more Processing Elements

OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices

• Work item: the basic unit of work on an OpenCL device• Kernel: the code for a work item. Basically a C function• Program: Collection of kernels and otherfunctions (Analogous to a dynamic library)• Context: The environment within which workitemsexecutes … includes devices and their memories and command queues

Applications queue kernel execution instancesQueued in-order … one queue to a deviceExecuted in-order or out-of-order

Example of the NDRange organization.. • SIMT: SINGLE INSTRUCTION

MULTIPLE THREAD. The same code isexecuted in parallel by a differentthread, and each thread executes thecode with different data.

• Work-item: are equivalent to theCUDA threads.

• Work-group: allow communicationand cooperation between work-items. They reflect how work-itemsare organized . Equivalent to CUDAthread blocks

• ND-Range: the ND-Range is the nextorganization level, specifying howwork-groups are organized

Example of the NDRange organization..

OpenCL Memory Model

OpenCL programs

OpenCL programs are divided in two part:

qOne that executes on the device (in our case, on the GPU).ü write KernelsüThe device program is the one you may be concerned about

qOne that executes on the host (in our case, the CPU).üOffers an API so that you can manage your device execution.üCan be programmed in C or C++ and it controls the OpenCL

environment (context, command-queue,...).

Sample: a kernel that adds two vectors

void vector_add_cpu (const float* src_a, constfloat* src_b, float* res, const int num)

{ for (int i = 0; i < num; i++) res[i] = src_a[i] + src_b[i];

}

This kernel should take four parameters: twovectors to be added, another to store theresult, and the vectors size. If you write aprogram that solves this problem on the CPU itwill be something like this:

__kernel void vectorAdd(__global const float* src_a, __global const float* src_b, __global, float* res, const int num)

{

/* get_global_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.*/

const int idx = get_global_id(0);/* Now each work-item asks itself: "is my ID inside the vector's range?"

If the answer is YES, the work-item performs the corresponding computation*/

if (idx < num)res[idx] = src_a[idx] + src_b[idx];

}

However, on the GPU the logic would be slightly different.

Instead of having one thread iterating through all elements, we could have eachthread computing one element, which index is the same of the thread.

Sample: a kernel that adds two vectors

// Some interesting data for the vectorsint InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};// Number of elements in the vectors to be added#define SIZE 2048// Main function// *********************************************************************int main(int argc, char **argv){// Two integer source vectors in Host memoryint HostVector1[SIZE], HostVector2[SIZE];

// Initialize with some interesting repeating datafor(int c = 0; c < SIZE; c++){HostVector1[c] = InitialData1[c%20];HostVector2[c] = InitialData2[c%20];}

Sample..

Sample..// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPUcl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,NULL, NULL, NULL);

// Get the list of GPU devices associated with this contextsize_t ParmDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);

// Create a command-queue on the first GPU devicecl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0], 0, NULL);

// Allocate GPU memory for source vectors AND initialize from CPU memorycl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);

Sample..// Allocate output memory on GPU

cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,

sizeof(int) * SIZE, NULL, NULL);

// Create OpenCL program with source code

cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7,

OpenCLSource, NULL, NULL);

// Build the program (OpenCL JIT compilation)

clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);

// Create a handle to the compiled OpenCL function (Kernel)

cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);

// In the next step we associate the GPU memory with the Kernel arguments

clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);

clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);

clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

Sample..// Launch the Kernel on the GPUsize_t WorkSize[1] = {SIZE}; // one dimensional RangeclEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);

// Copy the output in GPU memory back to CPU memoryint HostOutputVector[SIZE];clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);

// Cleanupfree(GPUDevices);clReleaseKernel(OpenCLVectorAdd);clReleaseProgram(OpenCLProgram);clReleaseCommandQueue(GPUCommandQueue);clReleaseContext(GPUContext);clReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleaseMemObject(GPUOutputVector);

Sample…

// Print out the results

for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n"))

{

for(int c = 0; c <20; c++)

{

printf("%c",(char)HostOutputVector[Rows * 20 + c]);

}

}

ThanksThanks!!