OpenCL Slides

27
OpenCL The Open Standard for Programming Heterogeneous Parallel Hardware Master Seminar Winter Term 2008/09 Multicore Parallel Programming Peter Thoman 04-12-2008

Transcript of OpenCL Slides

Page 1: OpenCL Slides

OpenCLThe Open Standard for Programming

Heterogeneous Parallel Hardware

Master Seminar Winter Term 2008/09Multicore Parallel Programming

Peter Thoman 04-12-2008

Page 2: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 2

Outline● Introduction & Motivation● Background:

● GPGPU Programming History● Task-based Multicore CPU Programming

● OpenCL● Design Overview● Components● Execution Model● Memory Model● Examples

● Open Questions & Research Opportunities

Page 3: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 3

Introduction● Recent years:

● Proliferation of parallel computing devices:– Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs

→ Standardized programming environment desirable● OpenCL is intended to be that standard

● Allow targeting various computing devices with the same program

● Simplify development for “exotic” hardware → Stipulate further growth beyond HPC & research

Page 4: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 4

Motivation – why bother?● Higher level of parallelism & specialization generally

yields higher maximum performance

Computation (GFlop/s) Bandwidth (Gb/s)0

100

200

300

400

500

600

700

800

900

1000

Intel Core 2 Quad Q9450IBM Cell BENVIDIA Geforce GTX260

Page 5: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 5

GPGPU History● Starting around 2003:

● Programmable Shaders introduced on GPUs– Originally intended for lighting calculations on surfaces etc.

● Side effect – allow GPUs to be used as general purpose computing devices

→ GPGPU born● Two broad phases so far:

● Early GPGPU: 2003-07 – Graphics APIs (DirectX/OpenGL) used to write GPGPU programs

● Current GPU computing: 2007-? – Vendor-supplied APIs

Page 6: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 6

Early GPGPU● Graphics APIs used

● “Rendering” with pixel shaders and ping-ponging

● Disadvantages● Programmer must know graphics

APIs and concepts● Overheads introduced by

Graphics pipeline● No communication and

synchronization primitives

Page 7: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 7

Current GPU Computing● Vendor-supplied APIs: CUDA, CTM● CUDA far more popular

● CudaZone lists 144 projects in a large variety of fields● With speedups (over CPU) from factor 2 to 480

● Advantages:● Standard C with simple extensions● Arbritrary read/writes from/to memory (no texture restriction)● Small high-speed shared memory as manual cache or for

communication● Traditional CPU functionality like bitwise integer ops

● Disadvantage: Vendor/Hardware specific

Page 8: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 8

Task-based Parallelism on CPUs● As opposed to a data-parallelism model like on GPUs● Long history of implicitly task-based systems:

● MPI or other message-passing● Basic Threading● Even fork-join

● Explicitly task-based models rather new:● OpenMP 3.0● Research projects like Star Superscalar

– Presented last week!

Page 9: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 9

OpenMP 3.0 Task Model● Simple spawning and synchronization of tasks● Same memory model as existing OMP constructs● No dependency handling

Example: Parallel Postorder Tree Traversal

Page 10: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 10

OpenCL● Important:

● Specification not yet released, all information based on public presentations given at Siggraph and SC08

● Timeline:

Page 11: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 11

OpenCL● Broad industry support:

● Next version of Apple OSX will most likely include first implementation

Page 12: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 12

OpenCL – Design Goals● Enable use of all computational resources in a system

– allow programming GPUs, CPUs, Cell, etc.● Support data- and task-parallel compute models● Approachable low-level, high-performance abstraction

with silicon-portability● Familiar C-like parallel programming model● Drive future hardware requirements including floating

point precision limits● Close integration with OpenGL for visualization

Page 13: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 13

OpenCL – Design Illustration● Convergence of both hardware and programming

models:

Page 14: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 14

OpenCL Components (1)● OpenCL consists of 3 components:

● Platform Layer● Runtime System● Compiler/Language Specification

● Platform Layer:● Query, select and initialize devices● Create compute contexts and command queues

● Runtime System:● Resource management (memory, program scheduling)● Executing compute kernels

Page 15: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 15

OpenCL Components (2)● Compiler (either online or offline compilation)

● Builds components written in compute kernel language● Language:

● Based on ISO C99, no recursion or function pointers● Built-in types:

– Scalar and vector data types, pointers– Data type conversion functions– Image-related types

● Built-in functions:– Work-item and synchronization functions– Math: math.h, relational and geometric functions– Functions to read and write images– Double precision support and rounding modes– Atomics to global and shared memory– Writes to 3D images

Optional

Required

Page 16: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 16

OpenCL Execution Model● Components:

● Compute Kernels:– Basic units of computation, similar to C functions

● Compute Programs:– Collection of kernels and internal functions

● Components are queued in a command queue to execute on a specific device

● Two different Execution models:● Data-Parallel● Task-Parallel

Page 17: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 17

OpenCL Data-Parallel Model● Programmer specifies N-dimensional computation

domain● Every element is a work-item

● Total number of items = global work size● Global work size is the maximum degree of parallelism for

this computation● Work-items can be grouped in work-groups

● Mapped either explicitly or implicitly● Items in groups can communicate and synchronize● Work-groups can also be executed in parallel

Page 18: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 18

OpenCL Task-Parallel Model● Optional for compute devices

● Most current GPUs probably won't support it● Tasks are executed as a single work-item

● Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++

● No clearer specification for now, conjectured to be similar to OpenMP 3.0 model

Page 19: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 19

OpenCL – Memory Model● Relaxed consistency shared memory model● Multiple distinct address spaces, can be collapsed on

some devices:● Private Memory

per work-item● Local Memory

per compute unit● Global/Constant Memory

● Qualifiers:__private, __local, __constant and __global

Page 20: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 20

OpenCL – Examples (1)● Simple vector addition kernel (compute device) code

Page 21: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 21

OpenCL – Examples (2)● Host Code – Initialization of a GPU device

and associated context / command queue:

Page 22: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 22

OpenCL – Examples (3)● Host Code – allocate device memory buffers

and create / build program:

Page 23: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 23

OpenCL – Examples (4)● Host code – create and run compute kernel:

Page 24: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 24

OpenCL – Examples (5)● Kernel Code – Matrix Transpose:

Page 25: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 25

Open Questions / Research● Shared code for vastly different hardware enables

research opportunities:● Distribution of kernels on devices

● Run given kernel on GPU or CPU, or maybe split?● Requires

● Analysis of kernels, either statically or dynamically● Lookup or benchmarking of available hardware at runtime● Fast decision algorithm using this information

– Either analytical or machine learning

Page 26: OpenCL Slides

04-12-2008 OpenCL Peter Thoman 26

Summary● Modern and future systems contain massively parallel,

heterogeneous hardware● Worth the headache because of performance potential

● OpenCL● Open standard platform for programming such systems● Data- and task-parallel execution models

– In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter

● Relaxed consistency shared memory– Distinct collapsible address spaces

● Release soon!

Page 27: OpenCL Slides

Thank you!

Consult the accompanying seminar document for a complete list of references.

STI Cell

NVIDIA GTX 280 ATI/AMD Rv770

AMD PhenomIntel Nehalem