1 - Introduction to OpenCL

Introduction to OpenCL

Module Overview

• Overview• OpenCL Architecture & Programming Model• Basic components for getting started• Information on tools

OVERVIEW

OpenCL

• OpenCL – Open Computing Language• Open Standard

– Royalty free, cross-platform, vendor neutral• Standard for accessing heterogeneous

computational resources– GPU, CPU, GPU+CPU or multiple GPUs

What is OpenCL : Processor Parallelism

CPUsMultiple cores driving

performance increases

GPUsIncreasingly general purpose data-parallel

computingImproving numerical

precision

Graphics APIs and Shading

Languages

Multi-processor programming – e.g. OpenMP

EmergingIntersection

OpenCLHeterogenous

Computing

OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous

parallel computing CPUs, GPUs, and other processors

OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous

parallel computing CPUs, GPUs, and other processors

Design Goals of OpenCL

• Use all computational resources in system– Program GPUs, CPUs, Cell, DSP and other processors as peers– Support both data- and task- parallel compute models

• Low-level, high-performance but portable– Primarily targeted at expert developers– Foundation for parallel computing ecosystem

• C-based programming model• Specify accuracy of floating-point computations

– IEEE 754 compliant rounding behavior– Define maximum allowable error of math functions

• Defines a configuration profile for handheld and embedded devices• Close integration with OpenGL and other 3D APIs

OpenCL

• Interface designed for graphics free API• Software Stack

– High level Language• “Extended C” to show parallelism

– Runtime libraries• Allows GPU memory management

How does it fit with vendor specific Architecture

OPENCL ARCHITECTURE & PROGRAMMING MODEL

OpenCL Platform Model

• One Host + one or more compute devices– Each Compute Device is composed of one or more Compute Units

• Each Compute Unit is further divided into one or more Processing Elements

OpenCL Platform Model

• Computations on a device occur within the processing elements

• An OpenCL application runs on a host and submits commands from the host to execute computations on the processing elements within a device

GPU as Co-processor

• GPU as Compute device– Has its own DRAM (Video memory)– Can run multiple threads in parallel

• Application runs on host• The compute intensive, data-parallel part is

sent to GPU– Written as C functions called kernel– The kernel is executed on device simultaneously

by multiple threads

Programming Model

Main Memory GPU Memory

Copy Input Data from Host to GPU Memory

Load/Initialize Input Data

Process InputData andWrite to output

Copy Output from GPU to Host Memory

FireStreamOpteron

Host application GPU kernel

Implicit Data Parallelism

Cvoid sum(float A[],

float B[],

float C[])

{

for(int i = 0; i < n; i++)

{

C[i] = A[i] +

B[i];

}

}

C - Rewrittenfloat sum_kernel(int x, float

A[], float B[])

{

return A[x] + B[x];

}

void sum(float A[],

float B[],

float C[])

{

for(int i = 0; i < n; i++)

C[j][i] =

sum_kernel(i, A, B);

}

Implicit Data Parallelism

C – Rewritten 2void sum(float A[],

float B[],

float C[])

{

for(int i = 0; i < n; i++)

launch_thread(C[i] =

sum_kernel(i, A, B));

sync_threads();

}

float sum_kernel(int x, float A[][], float B[][])

{

return A[x] + B[x];

}

OpenCL// Kernel definition__kernel void vecAdd(__global float* A,

__global float* B, __global float* C){ int i = get_local_id(0); C[i] = A[i] + B[i];}

int main(){ // Kernel invocation size_t globalWorkSize[] = {n}; size_t localWorkSize[] = {n}; clEnqueueNDRangeKernel(..,1, NULL,

globalWorkSize, localWorkSize, 0, NULL,NULL);

}

Kernel invocation from host•Number of OpenCL threads

Kernel

• Each thread has a unique thread ID__kernel void vecAdd(__global float* A, __global float* B, __global float* C){ int i = get_local_id(0); C[i] = A[i] + B[i];}

Unique Thread ID• Accessible within the kernel through intrinsic function

Function Qualifier•“__kernel” qualifier declares a function as a Kernel

Work-Group

• Work-items are organized into work-groups

• Group can be a 1D, 2D or 3D array of work-items– Specified during kernel invocation– Helpful to invoke kernels on

Matrices, fields– Each work-item within a group can

be identified by a 1D, 2D or 3D id• Built-in function get_local_id()

Work-Group

WI(0, 1)

WI(1, 1)

WI(2, 1)

WI(3, 1)

WI(4, 1)

WI(0, 2)

WI(1, 2)

WI(2, 2)

WI(3, 2)

WI(4, 2)

WI(0, 0)

WI(1, 0)

WI(2, 0)

WI(3, 0)

WI(4, 0)

Work-Group

• Example of 2D work-group// Add two matrices A and B of dimension NxN and store the// result into C__kernel void matAdd(int N, __global float* A, __global float* B, __global float* C){ int i = get_local_id(0); int j = get_local_id(1); C[j * N + i] = A[j * N + i] + B[j * N + i];}

// host codeint main(){ // Declare, allocate and initialize device memory A, B & C

// Kernel invocation size_t globalWorkSize[] = {N, N}; size_t localWorkSize[] = {N, N}; clEnqueueNDRangeKernel(.., 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);}

An N-dimension domain of work-items

• Global Dimensions: 1024 x 1024 (whole problem space) • Local Dimensions: 128 x 128 (executed together)• Choose the dimensions that are “best” for your algorithm

Example Problem Dimensions

• 1D: 1 million elements in an array:– global_dim[3] = {1000000, 1, 1};

• 2D: 1920 x 1200 HD video frame, 2.3M pixels:– global_dim[3] = {1920, 1200, 1};

• 3D: 256 x 256 x 256 volume, 16.7M voxels:– global_dim[3] = {256, 256, 256};

• Choose the dimensions that are “best” for your algorithm

– Maps well– Performs well

BASIC COMPONENTS FOR GETTING STARTED

Basic OpenCL Program Structure

• Kernels– C code with some restrictions and extensions

• Host program– Query compute devices– Create contexts– Create memory objects associated to contexts– Compile and create kernel program objects– Issue commands to command-queue– Synchronization of commands– Clean up OpenCL resources

Language

Platform Layer

Runtime

Typical OpenCL Program

• Computation intensive, data parallel function written as kernel

• Host side code– Context Creation– Allocate memory on device– Host to Device Data transfer– Compilation and creation of kernel program objects– Bind memory objects to kernel arguments– Call a kernel function to be executed on device– Read-back result data from device

INFORMATION ON TOOLS

OpenCL Implementation

• AMD’s implementation– Ships with ATI Stream SDK v2.0– Released on: 21th Dec, 2009

• Requires ATI GPU >= RV7XX

OpenCL Installation

• ATI Stream SDK– Environment variable

• $(ATISTREAMSDKROOT) = ATI Stream SDK installation directory

• $(ATISTREAMSDKSAMPLESROOT) = ATI Stream SDK Samples installation directory

ATI OpenCL SDK

• Header files– cl.h, cl_gl.h, cl_platform.h under$(ATISTREAMSDKROOT)\include\CL

• Library files– OpenCL.lib under $(ATISTREAMSDKROOT)\lib\x86

• Dynamic Link Library– OpenCL.dll under$(ATISTREAMSDKROOT)\bin\x86– Make sure Path contains this directory

Recap and Q&A

• Overview & Programming model• Basic components for getting started• Information on tools

1 - Introduction to OpenCL

Documents

Transcript of 1 - Introduction to OpenCL