1 - Introduction to OpenCL
-
Upload
manicheese -
Category
Documents
-
view
26 -
download
2
description
Transcript of 1 - Introduction to OpenCL
Introduction to OpenCL
Module Overview
• Overview• OpenCL Architecture & Programming Model• Basic components for getting started• Information on tools
OVERVIEW
OpenCL
• OpenCL – Open Computing Language• Open Standard
– Royalty free, cross-platform, vendor neutral• Standard for accessing heterogeneous
computational resources– GPU, CPU, GPU+CPU or multiple GPUs
What is OpenCL : Processor Parallelism
CPUsMultiple cores driving
performance increases
GPUsIncreasingly general purpose data-parallel
computingImproving numerical
precision
Graphics APIs and Shading
Languages
Multi-processor programming – e.g. OpenMP
EmergingIntersection
OpenCLHeterogenous
Computing
OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
Design Goals of OpenCL
• Use all computational resources in system– Program GPUs, CPUs, Cell, DSP and other processors as peers– Support both data- and task- parallel compute models
• Low-level, high-performance but portable– Primarily targeted at expert developers– Foundation for parallel computing ecosystem
• C-based programming model• Specify accuracy of floating-point computations
– IEEE 754 compliant rounding behavior– Define maximum allowable error of math functions
• Defines a configuration profile for handheld and embedded devices• Close integration with OpenGL and other 3D APIs
OpenCL
• Interface designed for graphics free API• Software Stack
– High level Language• “Extended C” to show parallelism
– Runtime libraries• Allows GPU memory management
How does it fit with vendor specific Architecture
OPENCL ARCHITECTURE & PROGRAMMING MODEL
OpenCL Platform Model
• One Host + one or more compute devices– Each Compute Device is composed of one or more Compute Units
• Each Compute Unit is further divided into one or more Processing Elements
OpenCL Platform Model
• Computations on a device occur within the processing elements
• An OpenCL application runs on a host and submits commands from the host to execute computations on the processing elements within a device
GPU as Co-processor
• GPU as Compute device– Has its own DRAM (Video memory)– Can run multiple threads in parallel
• Application runs on host• The compute intensive, data-parallel part is
sent to GPU– Written as C functions called kernel– The kernel is executed on device simultaneously
by multiple threads
Programming Model
Main Memory GPU Memory
Copy Input Data from Host to GPU Memory
Load/Initialize Input Data
Process InputData andWrite to output
Copy Output from GPU to Host Memory
FireStreamOpteron
Host application GPU kernel
Implicit Data Parallelism
Cvoid sum(float A[],
float B[],
float C[])
{
for(int i = 0; i < n; i++)
{
C[i] = A[i] +
B[i];
}
}
C - Rewrittenfloat sum_kernel(int x, float
A[], float B[])
{
return A[x] + B[x];
}
void sum(float A[],
float B[],
float C[])
{
for(int i = 0; i < n; i++)
C[j][i] =
sum_kernel(i, A, B);
}
Implicit Data Parallelism
C – Rewritten 2void sum(float A[],
float B[],
float C[])
{
for(int i = 0; i < n; i++)
launch_thread(C[i] =
sum_kernel(i, A, B));
sync_threads();
}
float sum_kernel(int x, float A[][], float B[][])
{
return A[x] + B[x];
}
OpenCL// Kernel definition__kernel void vecAdd(__global float* A,
__global float* B, __global float* C){ int i = get_local_id(0); C[i] = A[i] + B[i];}
int main(){ // Kernel invocation size_t globalWorkSize[] = {n}; size_t localWorkSize[] = {n}; clEnqueueNDRangeKernel(..,1, NULL,
globalWorkSize, localWorkSize, 0, NULL,NULL);
}
Kernel invocation from host•Number of OpenCL threads
Kernel
• Each thread has a unique thread ID__kernel void vecAdd(__global float* A, __global float* B, __global float* C){ int i = get_local_id(0); C[i] = A[i] + B[i];}
Unique Thread ID• Accessible within the kernel through intrinsic function
Function Qualifier•“__kernel” qualifier declares a function as a Kernel
Work-Group
• Work-items are organized into work-groups
• Group can be a 1D, 2D or 3D array of work-items– Specified during kernel invocation– Helpful to invoke kernels on
Matrices, fields– Each work-item within a group can
be identified by a 1D, 2D or 3D id• Built-in function get_local_id()
Work-Group
WI(0, 1)
WI(1, 1)
WI(2, 1)
WI(3, 1)
WI(4, 1)
WI(0, 2)
WI(1, 2)
WI(2, 2)
WI(3, 2)
WI(4, 2)
WI(0, 0)
WI(1, 0)
WI(2, 0)
WI(3, 0)
WI(4, 0)
Work-Group
• Example of 2D work-group// Add two matrices A and B of dimension NxN and store the// result into C__kernel void matAdd(int N, __global float* A, __global float* B, __global float* C){ int i = get_local_id(0); int j = get_local_id(1); C[j * N + i] = A[j * N + i] + B[j * N + i];}
// host codeint main(){ // Declare, allocate and initialize device memory A, B & C
// Kernel invocation size_t globalWorkSize[] = {N, N}; size_t localWorkSize[] = {N, N}; clEnqueueNDRangeKernel(.., 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);}
An N-dimension domain of work-items
• Global Dimensions: 1024 x 1024 (whole problem space) • Local Dimensions: 128 x 128 (executed together)• Choose the dimensions that are “best” for your algorithm
Example Problem Dimensions
• 1D: 1 million elements in an array:– global_dim[3] = {1000000, 1, 1};
• 2D: 1920 x 1200 HD video frame, 2.3M pixels:– global_dim[3] = {1920, 1200, 1};
• 3D: 256 x 256 x 256 volume, 16.7M voxels:– global_dim[3] = {256, 256, 256};
• Choose the dimensions that are “best” for your algorithm
– Maps well– Performs well
BASIC COMPONENTS FOR GETTING STARTED
Basic OpenCL Program Structure
• Kernels– C code with some restrictions and extensions
• Host program– Query compute devices– Create contexts– Create memory objects associated to contexts– Compile and create kernel program objects– Issue commands to command-queue– Synchronization of commands– Clean up OpenCL resources
Language
Platform Layer
Runtime
Typical OpenCL Program
• Computation intensive, data parallel function written as kernel
• Host side code– Context Creation– Allocate memory on device– Host to Device Data transfer– Compilation and creation of kernel program objects– Bind memory objects to kernel arguments– Call a kernel function to be executed on device– Read-back result data from device
INFORMATION ON TOOLS
OpenCL Implementation
• AMD’s implementation– Ships with ATI Stream SDK v2.0– Released on: 21th Dec, 2009
• Requires ATI GPU >= RV7XX
OpenCL Installation
• ATI Stream SDK– Environment variable
• $(ATISTREAMSDKROOT) = ATI Stream SDK installation directory
• $(ATISTREAMSDKSAMPLESROOT) = ATI Stream SDK Samples installation directory
ATI OpenCL SDK
• Header files– cl.h, cl_gl.h, cl_platform.h under$(ATISTREAMSDKROOT)\include\CL
• Library files– OpenCL.lib under $(ATISTREAMSDKROOT)\lib\x86
• Dynamic Link Library– OpenCL.dll under$(ATISTREAMSDKROOT)\bin\x86– Make sure Path contains this directory
Recap and Q&A
• Overview & Programming model• Basic components for getting started• Information on tools