Open Computing Language
Introduction• OpenCL (Open Computing Language) • Is an open royalty-free standard• For general purpose parallel programming
across CPUs, GPUs and other processors
OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform
OpenCL consists of..
• API for coordinating parallel computation across heterogeneous processors.
• A cross-platform programming language
Supports both data- and task-based parallel programming models
Utilizes a subset of ISO C99 with extensions for parallelism
Defines a configuration profile for handheld and embedded devices
The BIG Idea behind OpenCL• OpenCL execution model …• execute a kernel at each point in a problem domain• - E.g., process a 1024 x 1024 image with one kernel
invocation per pixel• or 1024 x 1024 = 1,048,576 kernel executions
To use OpenCL, you must
• Define the platform• Execute code on the platform• Move data around in memory• Write (and build) programs
OpenCL Platform Model• One Host + one or more Compute Devices• - Each Compute Device is composed of one or more
Compute Units• - Each Compute Unit is further divided into one or
more Processing Elements
OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices
• Work item: the basic unit of work on an OpenCL device• Kernel: the code for a work item. Basically a C function• Program: Collection of kernels and otherfunctions (Analogous to a dynamic library)• Context: The environment within which workitemsexecutes … includes devices and their memories and command queues
Applications queue kernel execution instancesQueued in-order … one queue to a deviceExecuted in-order or out-of-order
Example of the NDRange organization.. • SIMT: SINGLE INSTRUCTION
MULTIPLE THREAD. The same code isexecuted in parallel by a differentthread, and each thread executes thecode with different data.
• Work-item: are equivalent to theCUDA threads.
• Work-group: allow communicationand cooperation between work-items. They reflect how work-itemsare organized . Equivalent to CUDAthread blocks
• ND-Range: the ND-Range is the nextorganization level, specifying howwork-groups are organized
Example of the NDRange organization..
OpenCL Memory Model
OpenCL programs
OpenCL programs are divided in two part:
qOne that executes on the device (in our case, on the GPU).ü write KernelsüThe device program is the one you may be concerned about
qOne that executes on the host (in our case, the CPU).üOffers an API so that you can manage your device execution.üCan be programmed in C or C++ and it controls the OpenCL
environment (context, command-queue,...).
Sample: a kernel that adds two vectors
void vector_add_cpu (const float* src_a, constfloat* src_b, float* res, const int num)
{ for (int i = 0; i < num; i++) res[i] = src_a[i] + src_b[i];
}
This kernel should take four parameters: twovectors to be added, another to store theresult, and the vectors size. If you write aprogram that solves this problem on the CPU itwill be something like this:
__kernel void vectorAdd(__global const float* src_a, __global const float* src_b, __global, float* res, const int num)
{
/* get_global_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)res[idx] = src_a[idx] + src_b[idx];
}
However, on the GPU the logic would be slightly different.
Instead of having one thread iterating through all elements, we could have eachthread computing one element, which index is the same of the thread.
Sample: a kernel that adds two vectors
// Some interesting data for the vectorsint InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};// Number of elements in the vectors to be added#define SIZE 2048// Main function// *********************************************************************int main(int argc, char **argv){// Two integer source vectors in Host memoryint HostVector1[SIZE], HostVector2[SIZE];
// Initialize with some interesting repeating datafor(int c = 0; c < SIZE; c++){HostVector1[c] = InitialData1[c%20];HostVector2[c] = InitialData2[c%20];}
Sample..
Sample..// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPUcl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,NULL, NULL, NULL);
// Get the list of GPU devices associated with this contextsize_t ParmDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);
// Create a command-queue on the first GPU devicecl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0], 0, NULL);
// Allocate GPU memory for source vectors AND initialize from CPU memorycl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);
Sample..// Allocate output memory on GPU
cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,
sizeof(int) * SIZE, NULL, NULL);
// Create OpenCL program with source code
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7,
OpenCLSource, NULL, NULL);
// Build the program (OpenCL JIT compilation)
clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
// Create a handle to the compiled OpenCL function (Kernel)
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
// In the next step we associate the GPU memory with the Kernel arguments
clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);
clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);
Sample..// Launch the Kernel on the GPUsize_t WorkSize[1] = {SIZE}; // one dimensional RangeclEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);
// Copy the output in GPU memory back to CPU memoryint HostOutputVector[SIZE];clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
// Cleanupfree(GPUDevices);clReleaseKernel(OpenCLVectorAdd);clReleaseProgram(OpenCLProgram);clReleaseCommandQueue(GPUCommandQueue);clReleaseContext(GPUContext);clReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleaseMemObject(GPUOutputVector);
Sample…
// Print out the results
for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n"))
{
for(int c = 0; c <20; c++)
{
printf("%c",(char)HostOutputVector[Rows * 20 + c]);
}
}
ThanksThanks!!
Top Related