OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf ·...

38
OpenCL Parallel Computing on the GPU and CPU Aaftab Munshi

Transcript of OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf ·...

Page 1: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

OpenCLParallel Computing on the GPU and CPU

Aaftab Munshi

Page 2: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Today’s processors are increasingly parallel•CPUs

■ Multiple cores are driving performance increases•GPUs

■ Transforming into general purpose data-parallel computational coprocessors

■ Improving numerical precision (single and double)

Opportunity: Processor

Page 3: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Writing parallel programs different for the CPU and GPU■ Differing domain-specific techniques■ Vendor-specific technologies

•Graphics API is not an ideal abstraction for general purpose compute

Challenge: Processor Parallelism

Page 4: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•OpenCL – Open Computing Language•Approachable language for accessing heterogeneous computational resources

•Supports parallel execution on single or multiple processors■ GPU, CPU, GPU + CPU or multiple GPUs

•Desktop and Handheld Profiles•Designed to work with graphics APIs such as OpenGL

Introducing OpenCL

Page 5: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL = Open Standard•Specification under review

■ Royalty free, cross-platform, vendor neutral■ Khronos OpenCL working group (www.khronos.org)

•Based on a proposal by Apple■ Developed in collaboration with industry leaders■ Performance-enhancing technology in Mac OS X Snow Leopard

Page 7: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

OpenCL — A Sneak Preview

Page 8: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Use all computational resources in system ■ GPUs and CPUs as peers■ Data- and task- parallel compute model

•Efficient parallel programming model■ Based on C■ Abstract the specifics of underlying hardware

•Specify accuracy of floating-point computations■ IEEE 754 compliant rounding behavior■ Define maximum allowable error of math functions

•Drive future hardware requirements

Design Goals of OpenCL

Page 9: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Platform Layer■ query and select compute devices in the system■ initialize a compute device(s)■ create compute contexts and work-queues

•Runtime ■ resource management■ execute compute kernels

•Compiler■ A subset of ISO C99 with appropriate language additions

■ Compile and build compute program executables■ online or offline

OpenCL Software Stack

Page 10: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Compute Kernel■ Basic unit of executable code — similar to a C function

■ Data-parallel or task-parallel•Compute Program

■ Collection of compute kernels and internal functions■ Analogous to a dynamic library

•Applications queue compute kernel execution instances■ Queued in-order ■ Executed in-order or out-of-order■ Events are used to implement appropriate

OpenCL Execution Model

Page 11: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Define N-Dimensional computation domain■ Each independent element of execution in N-D domain is called a work-item

■ The N-D domain defines the total number of work-items that execute in parallel — global work size.

•Work-items can be grouped together — work-group■ Work-items in group can communicate with each other

■ Can synchronize execution among work-items in group to coordinate memory access

•Execute multiple work-groups in parallel•Mapping of global work size to work-groups

OpenCL Data-Parallel Execution

Page 12: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Data-parallel execution model must be implemented by all OpenCL compute devices

•Some compute devices such as CPUs can also execute task-parallel compute kernels■ Executes as a single work-item■ A compute kernel written in OpenCL ■ A native C / C++ function

OpenCL Task-Parallel Execution

Page 13: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model•Implements a relaxed consistency, shared memory model

•Multiple distinct address spaces■ Address spaces can be collapsed

Page 14: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M

OpenCL Memory Model•Implements a relaxed consistency, shared memory model

•Multiple distinct address spaces■ Address spaces can be collapsed ■ Address Qualifiers

■ __private

Page 15: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M

Local Memory Local Memory

OpenCL Memory Model•Implements a relaxed consistency, shared memory model

•Multiple distinct address spaces■ Address spaces can be collapsed ■ Address Qualifiers

■ __private■ __local

Page 16: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

Compute Device

Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M

Local Memory Local Memory

Global / Constant Memory Data Cache

Compute Device Memory

Global Memory

OpenCL Memory Model•Implements a relaxed consistency, shared memory model

•Multiple distinct address spaces■ Address spaces can be collapsed ■ Address Qualifiers

■ __private■ __local ■ __constant and __global

■ Example: ■ __global float4 *p;

Page 17: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Derived from ISO C99•A few restrictions

■ Recursion, function pointers, functions in C99 standard headers ...

•Preprocessing directives defined by C99 are supported

•Built-in Data Types■ Scalar and vector data types■ Structs, Pointers■ Data-type conversion functions

■ convert_type<_sat><_roundingmode> ■ Image types

Language for writing compute

Page 18: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

Language for writing compute

Page 19: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Built-in Functions — Required■ work-item functions■ math.h■ read and write image■ relational■ geometric functions■ synchronization functions

Language for writing compute

Page 20: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Built-in Functions — Required■ work-item functions■ math.h■ read and write image■ relational■ geometric functions■ synchronization functions

•Built-in Functions — Optional■ double precision■ atomics to global and local memory■ selection of rounding mode

Language for writing compute

Page 21: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Page 22: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU device

Page 23: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

Page 24: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue

Page 25: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

Page 26: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects

Page 27: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context,

Page 28: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

Page 29: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context,

Page 30: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,

Page 31: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

Page 32: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Page 33: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

// create the compute programprogram = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL);

// build the compute program executableclBuildProgramExecutable(program, false, NULL, NULL);

// create the compute kernel kernel = clCreateKernel(program, “fft1D_1024”);

Page 34: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Page 35: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create N-D range object with work-item dimensionsglobal_work_size[0] = n; local_work_size[0] = 64;range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size);

// set the args valuesclSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);

// execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Page 36: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into// calls to a radix 16 function, another radix 16 function and then a radix 4 function// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0);int blockIdx = get_group_id(0) * 1024 + tid;float2 data[16];

// starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx;

globalLoads(data, in, 64); // coalesced global readsfftRadix16Pass(data); // in-place radix-16 passtwiddleFactorMul(data, tid, 1024, 0);

// local shuffle using local memorylocalShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 passtwiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));// four radix-4 function callsfftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);

// coalesced global writesglobalStores(data, out, 64);

}

Page 37: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•Sharing OpenGL Resources■ OpenCL is designed to efficiently share with OpenGL

■ Textures, Buffer Objects and Renderbuffers■ Data is shared, not copied

•Efficient queuing of OpenCL and OpenGL commands•Apps can select compute device(s) that will run OpenGL and OpenCL

OpenCL and OpenGL

Page 38: OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Beyond Programmable Shading: Fundamentals

•A new compute language that works across GPUs and CPUs■ C99 with extensions ■ Familiar to developers■ Includes a rich set of built-in functions■ Makes it easy to develop data- and task- parallel compute programs

•Defines hardware and numerical precision requirements

•Open standard for heterogeneous parallel computing

Summary