OpenCL/OpenMP Offload

OpenCLTM & OpenMP® Offloadon SitaraTM AM57x Processors

1

Agenda

•OpenCL

– Overview of Platform, Execution and Memory models

– Mapping these models to AM57x

•Overview of OpenMP Offload Model

•Compare and contrast OpenCL and OpenMP Offload

2

OpenCL

OpenCL & OpenMP Offload on Sitara AM57x Processors

3

OpenCL Overview• OpenCL is a framework for parallel processing on heterogeneous devices.• OpenCL is an open standard and is royalty‐free.• OpenCL consists of two components:

– An API for the host program to create and submit kernels for execution– A cross‐platform language for expressing kernels; OpenCL C

• Based on C99 C with a some additions, some restrictions, and a set of built‐in functions

• OpenCL promotes portability of applications from device to device and across generations of a single device roadmap:– Abstracts low‐level communication and dispatch mechanisms– Uses a descriptive rather than prescriptive model: kernel + enqueue mechanism

4

OpenCL Platform Model• One Host is connected to one or more OpenCL compute devices.

5

Host

• Compute Devices are composed of one or more compute units (for example, C66x DSP core).

• Compute Units (CU) are composed of one or more Processing Elements (PE).

• The Host program:

– Executes on host, submits kernels that execute on device(s)

– Can run asynchronous with kernels

– Defines context for kernels and manages execution

ComputeUnit

ComputeDevice

ProcessingElement

OpenCL Execution Model(s)Data Parallel (ND Range Kernel):• A Kernel is associated with an index space

(NDRange, N = 1, 2 or 3 dimensions).

• Work Item (WI) is an instance of the kernelfor a point in index space, executes on a PE.

• Work Groups (WG) are groups of work items, WIs execute concurrently on a single CU.

• All WGs finish before another kernel is dispatched.

Task Parallel (Task + Out‐Of‐Order Queue):• Task is enqueued.

• OpenCL dispatches task to one of the CUs.

• OpenCL can accept additional tasks and dispatch asynchronously.

6

Compute Device Memory

OpenCL Memory Model • Private: Memory region private to a work item

• Local: Memory region local to a workgroup

• Global: Read/write access to all work items in all work groups

• Constant: A region of global memory that is constant during kernel execution

7

Compute DeviceCompute Unit 1

PrivateMemory 1

PrivateMemory N

PE 1 PE N

LocalMemory 1

Compute Unit NPrivate

Memory 1Private

Memory N

PE 1 PE N

LocalMemory N

Global/Constant Memory Data Cache

Global Memory

Constant Memory

OpenCL on AM572x (CL_DEVICE_TYPE_ACCELERATOR)

ARMA15ARMA15

ARMA15ARMA15

2MB2MB

C66xDSPC66xDSP

C66xDSPC66xDSP

288KB288KB 288KB288KB

OpenCL Host OpenCL Device

Compute Unit

OCMC SRAM (on chip memory)DDR (off chip memory)

GlobalMemory160MB DDR and 1MB OCMC SRAM available for OpenCL

(configurable)

L2SRAM• 128KB accessible by

kernels on DSPs via OpenCL Local

• 128K cache (configurable)• 32K used by runtime

• Platform Model:– Host is the dual Cortex‐A15 cluster running SMP Linux.– One OpenCL device with two C66x DSP cores.– Compute unit is a single C66x DSP.

• Memory Model:– Global memory: 160MB of DDR and 1MB of OCMC SRAM in the default configuration– Local memory: 128KB of L2SRAM

8

AM572x Data Parallel Execution Model

Data Parallel (NDRangeKernel)1. An OpenCL application enqueues NDRangeKernels to an OpenCL Queue.2. Asynchronously, the OpenCL Runtime does the following:

1. Pull an NDRangeKernel from the OpenCL Queue.2. Create an appropriate number of Work Groups for the NDRangeKernel.3. Place those WGs on a ready list for the DSP cores.

3. Asynchronously, DSP cores independently do the following:1. Pull WGs from the ready list, then execute them.2. Repeat until the ready list is empty.

4. The OpenCL Runtime then does the following:1. Perform any needed explicit cache coherency operations. 2. Signal the OpenCL that the NDRangeKernel has completed.

5. Repeat from #2 for any other NDRangeKernels in the OpenCL Queue.

NDR WG

WG

OpenCL Queue

DSP1

NDR

OpenCL RuntimeOpenCL Application Ready List

WG

WG WG

DSP2

9

AM572x Task Parallel Execution Model

Task Parallel (Out of Order Tasks)1. An OpenCL application enqueues Tasks to an Out of Order OpenCL Queue.2. Asynchronously, the OpenCL Runtime does the following:

1. Pull a Task from the OpenCL Queue.2. Place that Task on a ready list for the DSP cores.3. Repeat while the OpenCL queue is not empty and the ready list is not full.

3. Asynchronously, the DSP cores independently do the following:1. Pull Tasks from the ready list and execute them. 2. Perform any needed explicit cache coherency operations. 3. Signal the OpenCL application that the Task has completed.4. Repeat while the ready list is not empty.

OOT OOT

OOT

OpenCL Queue

DSP1

OOT


OOT

OOT OOT

DSP2

10

AM572x “OpenMP” Execution Model (TI Extension)

OpenMP (TI extension, Task + In‐Order Queue)1. An OpenCL application enqueues Tasks with calls to C code containing OpenMP

parallel regions to an In‐Order OpenCL Queue.2. Asynchronously, the OpenCL Runtime will do the following:

1. Pull a Task from the OpenCL Queue.2. Place that Task on a ready list for the DSP cores.3. Repeat while the OpenCL queue is not empty and the ready list is not full.

3. Asynchronously, the “master” DSP core (DSP1) independently does the following:1. Pull a Task from the ready list and execute it. OpenMP constructs are used to

parallelize code across DSP1 and DSP2.2. Perform any needed explicit cache coherency operations.3. Signal the OpenCL application that the Task has completed.4. Repeat while the ready list is not empty.

IOT IOT

OpenCL Queue

DSP1“master”

IOT


IOT

IOT IOT

DSP2“worker”

OpenMP

11

TI OpenCL Features• Takes advantage of on‐chip and off‐chip shared memory on the AM572x to enable zero copy data access across the A15s and C66x DSPs

• TI extensions:– OpenCL kernels can call into standard C code, including code with OpenMP pragmas (e.g. optimized C66x libraries such as fftlib)

– On‐chip OpenCL buffers in OCMC SRAM– Access to C66x intrinsics such as _dcmpy and EDMA APIs from OpenCL kernels

• OpenCL implementation conformant to v1.1 (full profile):– i.e., TI’s implementation has passed the Conformance Test Suite for OpenCL v1.1.– Test results certified by Khronos, the organization governing OpenCL.– No support for images; ‘double’ is supported as an extension and is not included in conformance submission.

• Refer to TI’s OpenCL User Guide for details.12

OpenMP ®


13

OpenMP Overview•API for specifying shared‐memory parallelism in C, C++, and Fortran

•Consists of compiler directives, library routines, and environment variables:– Easy & incrementalmigration for existing code bases– De facto industry standard for shared memory parallel programming

•Portable across shared‐memory architectures•v4.0 supports programming heterogeneous architectures

14

OpenMP‐DSP on AM572x: Execution Model• “OpenMP‐DSP” refers to the runtime used to enable parallelism across the C66x DSPs on AM572x.

• Master thread creates a team of threads on encountering a parallel region:– One OpenMP thread runs on each C66x DSP core.– Master thread begins execution on DSP1.– DSP2 is a worker core; It participates in executing the parallel region.

• Data Parallel –Work‐sharing constructs are used to distribute work across the DSP cores (e.g., loop iterations).

• Task Parallel – Task construct used to generate tasks which are executed by one of the DSP cores on the team.

15

OpenMP‐DSP on AM572x: Memory Model

Master Thread

Parallel Region

Synchronization Points

16

C66xDSPC66xDSP

C66xDSPC66xDSP

288KB288KB 288KB288KB

• Threads have access to shared memory:– Each thread can have a temporary view of the shared memory (e.g. registers, cache).

– Temporary view made consistent with shared view of memory at synchronization points.

• Threads have privatememory:– For data local to each thread

• No hardware cache coherency across DSP cores

• OpenMP runtime makes a thread’s view of memory consistent with shared view by performing cache operations at synchronization points.

OCMC SRAM (on chip memory)DDR (off chip memory)

Private Memory

Shared Memory

OpenMP 4.0 “Offload Model”

Extends OpenMP by adding:• A ‘target’ construct to indicate regions to be dispatched

– Target regions can contain OpenMP constructs• Map clause to indicate data transfer between host <‐> device• Constructs to indicate that variables/functions reside on host/device/both

void add(int *in1, int *in2, int *out1, int count){

#pragma omp target map (to: in1[0:count‐1], in2[0:count‐1], count, \from: out1[0:count‐1])

{#pragma omp parallel shared(in1, in2, out1){

int i;#pragma omp forfor (i = 0; i < count; i++)

out1[i] = in1[i] + in2[i];}

}}

17

The “Offload Model” is a subset of OpenMP 4.0 specification enabling execution on heterogeneous devices.

OpenMP Offload Execution & Memory Model• Notion of a host device and target device(s)

• Execution Model:– Each device has its own threads.– No migration of threads across devices.– Host device offloads ‘target regions’ to target devices.– Host waits till the target region has completed execution.

• Memory Model:– Each device, including the host device, has an initial data environment.– Data mapping clauses determine how variables are mapped from the host device data environment to that of the target device.

– Variables in different data environments may share storage.18

target construct

• Variables a, b, c, and size initially reside in host memory.• On encountering a target construct:

– Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size.– Any variables annotated ‘to’ are mapped from host memory to device memory.– The target region is executed on the device.– Any variables annotated ‘from’ are mapped from device memory to host memory.

void vadd_openmp(float *a, float *b, float *c, int size){

#pragma omp target map(to:a[0:size],b[0:size],size) \map(from: c[0:size])

{int i;#pragma omp parallel forfor (i = 0; i < size; i++)

c[i] = a[i] + b[i];}

}

a

b

c

size

a

b

c

size

Host Memory Device Memory

to

to

to

from

19

Offload Model on AM572x

SMP Linux

OpenMP Offload Runtime (ARM) OpenMP DSP

Runtime

#pragma omp parallel shared(in1, in2, out1){

…}

void foo(int *in1, int *in2, int *out1, int count){

#pragma omp target map (to: in1[0:count‐1], \in2[0:count‐1], count, \from: out1[0:count‐1])

{#pragma omp parallel shared(in1, in2, out1){

…}

}}

Host program with

OpenMPtarget directives

Source‐to‐Source Lowering

• Host Device: ARM Cortex‐A15(s) running SMP Linux• Target Device: one device with two C66x DSP cores• When an ARM thread encounters a target region, it

begins execution on the master core (DSP1)– ‘omp parallel’ pragma within target region starts

execution on the worker core (DSP2)• The Offload Model runtime maps OpenMP target

constructs & data movement clauses are mapped to OpenCL API calls:

– Target regions ‐> OpenCL kernels– Map clauses ‐> OpenCL data movement APIs

Execution Model

20

ARMA15ARMA15

ARMA15ARMA15

2MB2MB

C66xDSPC66xDSP

C66xDSPC66xDSP

288KB288KB 288KB288KB

Comparing OpenCL and OpenMP Offload


21

OpenCL/OpenMP Offload: Making the Choice• Different approaches to expressing offload and parallelism:

– OpenCL uses APIs and OpenCL C. User will need to rework existing code.– OpenMP uses pragmas (target, parallel, etc.), which can be used for quick prototyping.

• Which execution/memory model is a better fit for application code dispatched to the DSP?

• The choice also depends on other factors:– Nature of existing code base

• Is it already written to use OpenCL for dispatch?• Does it already use OpenMP to go parallel across threads?

– Control over data movement required; OpenCL APIs offer more precise control over data movement between the host & device.

– Programmer expertise & preference

22

For More Information• TI OpenCL User Guide

• TI OpenMP‐DSP User Guide

• TI OpenMP Offload Model User Guide

• Processor SDK for AM57x

• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website: http://e2e.ti.com

23

OpenCL/OpenMP Offload

Documents

Transcript of OpenCL/OpenMP Offload