OpenCL/OpenMP Offload
Transcript of OpenCL/OpenMP Offload
OpenCLTM & OpenMP® Offloadon SitaraTM AM57x Processors
1
Agenda
•OpenCL
– Overview of Platform, Execution and Memory models
– Mapping these models to AM57x
•Overview of OpenMP Offload Model
•Compare and contrast OpenCL and OpenMP Offload
2
OpenCL
OpenCL & OpenMP Offload on Sitara AM57x Processors
3
OpenCL Overview• OpenCL is a framework for parallel processing on heterogeneous devices.• OpenCL is an open standard and is royalty‐free.• OpenCL consists of two components:
– An API for the host program to create and submit kernels for execution– A cross‐platform language for expressing kernels; OpenCL C
• Based on C99 C with a some additions, some restrictions, and a set of built‐in functions
• OpenCL promotes portability of applications from device to device and across generations of a single device roadmap:– Abstracts low‐level communication and dispatch mechanisms– Uses a descriptive rather than prescriptive model: kernel + enqueue mechanism
4
OpenCL Platform Model• One Host is connected to one or more OpenCL compute devices.
5
Host
• Compute Devices are composed of one or more compute units (for example, C66x DSP core).
• Compute Units (CU) are composed of one or more Processing Elements (PE).
• The Host program:
– Executes on host, submits kernels that execute on device(s)
– Can run asynchronous with kernels
– Defines context for kernels and manages execution
ComputeUnit
ComputeDevice
ProcessingElement
OpenCL Execution Model(s)Data Parallel (ND Range Kernel):• A Kernel is associated with an index space
(NDRange, N = 1, 2 or 3 dimensions).
• Work Item (WI) is an instance of the kernelfor a point in index space, executes on a PE.
• Work Groups (WG) are groups of work items, WIs execute concurrently on a single CU.
• All WGs finish before another kernel is dispatched.
Task Parallel (Task + Out‐Of‐Order Queue):• Task is enqueued.
• OpenCL dispatches task to one of the CUs.
• OpenCL can accept additional tasks and dispatch asynchronously.
6
Compute Device Memory
OpenCL Memory Model • Private: Memory region private to a work item
• Local: Memory region local to a workgroup
• Global: Read/write access to all work items in all work groups
• Constant: A region of global memory that is constant during kernel execution
7
Compute DeviceCompute Unit 1
PrivateMemory 1
PrivateMemory N
PE 1 PE N
LocalMemory 1
Compute Unit NPrivate
Memory 1Private
Memory N
PE 1 PE N
LocalMemory N
Global/Constant Memory Data Cache
Global Memory
Constant Memory
OpenCL on AM572x (CL_DEVICE_TYPE_ACCELERATOR)
ARMA15ARMA15
ARMA15ARMA15
2MB2MB
C66xDSPC66xDSP
C66xDSPC66xDSP
288KB288KB 288KB288KB
OpenCL Host OpenCL Device
Compute Unit
OCMC SRAM (on chip memory)DDR (off chip memory)
GlobalMemory160MB DDR and 1MB OCMC SRAM available for OpenCL
(configurable)
L2SRAM• 128KB accessible by
kernels on DSPs via OpenCL Local
• 128K cache (configurable)• 32K used by runtime
• Platform Model:– Host is the dual Cortex‐A15 cluster running SMP Linux.– One OpenCL device with two C66x DSP cores.– Compute unit is a single C66x DSP.
• Memory Model:– Global memory: 160MB of DDR and 1MB of OCMC SRAM in the default configuration– Local memory: 128KB of L2SRAM
8
AM572x Data Parallel Execution Model
Data Parallel (NDRangeKernel)1. An OpenCL application enqueues NDRangeKernels to an OpenCL Queue.2. Asynchronously, the OpenCL Runtime does the following:
1. Pull an NDRangeKernel from the OpenCL Queue.2. Create an appropriate number of Work Groups for the NDRangeKernel.3. Place those WGs on a ready list for the DSP cores.
3. Asynchronously, DSP cores independently do the following:1. Pull WGs from the ready list, then execute them.2. Repeat until the ready list is empty.
4. The OpenCL Runtime then does the following:1. Perform any needed explicit cache coherency operations. 2. Signal the OpenCL that the NDRangeKernel has completed.
5. Repeat from #2 for any other NDRangeKernels in the OpenCL Queue.
NDR WG
WG
OpenCL Queue
DSP1
NDR
OpenCL RuntimeOpenCL Application Ready List
WG
WG WG
DSP2
9
AM572x Task Parallel Execution Model
Task Parallel (Out of Order Tasks)1. An OpenCL application enqueues Tasks to an Out of Order OpenCL Queue.2. Asynchronously, the OpenCL Runtime does the following:
1. Pull a Task from the OpenCL Queue.2. Place that Task on a ready list for the DSP cores.3. Repeat while the OpenCL queue is not empty and the ready list is not full.
3. Asynchronously, the DSP cores independently do the following:1. Pull Tasks from the ready list and execute them. 2. Perform any needed explicit cache coherency operations. 3. Signal the OpenCL application that the Task has completed.4. Repeat while the ready list is not empty.
OOT OOT
OOT
OpenCL Queue
DSP1
OOT
OpenCL RuntimeOpenCL Application Ready List
OOT
OOT OOT
DSP2
10
AM572x “OpenMP” Execution Model (TI Extension)
OpenMP (TI extension, Task + In‐Order Queue)1. An OpenCL application enqueues Tasks with calls to C code containing OpenMP
parallel regions to an In‐Order OpenCL Queue.2. Asynchronously, the OpenCL Runtime will do the following:
1. Pull a Task from the OpenCL Queue.2. Place that Task on a ready list for the DSP cores.3. Repeat while the OpenCL queue is not empty and the ready list is not full.
3. Asynchronously, the “master” DSP core (DSP1) independently does the following:1. Pull a Task from the ready list and execute it. OpenMP constructs are used to
parallelize code across DSP1 and DSP2.2. Perform any needed explicit cache coherency operations.3. Signal the OpenCL application that the Task has completed.4. Repeat while the ready list is not empty.
IOT IOT
OpenCL Queue
DSP1“master”
IOT
OpenCL RuntimeOpenCL Application Ready List
IOT
IOT IOT
DSP2“worker”
OpenMP
11
TI OpenCL Features• Takes advantage of on‐chip and off‐chip shared memory on the AM572x to enable zero copy data access across the A15s and C66x DSPs
• TI extensions:– OpenCL kernels can call into standard C code, including code with OpenMP pragmas (e.g. optimized C66x libraries such as fftlib)
– On‐chip OpenCL buffers in OCMC SRAM– Access to C66x intrinsics such as _dcmpy and EDMA APIs from OpenCL kernels
• OpenCL implementation conformant to v1.1 (full profile):– i.e., TI’s implementation has passed the Conformance Test Suite for OpenCL v1.1.– Test results certified by Khronos, the organization governing OpenCL.– No support for images; ‘double’ is supported as an extension and is not included in conformance submission.
• Refer to TI’s OpenCL User Guide for details.12
OpenMP ®
OpenCL & OpenMP Offload on Sitara AM57x Processors
13
OpenMP Overview•API for specifying shared‐memory parallelism in C, C++, and Fortran
•Consists of compiler directives, library routines, and environment variables:– Easy & incrementalmigration for existing code bases– De facto industry standard for shared memory parallel programming
•Portable across shared‐memory architectures•v4.0 supports programming heterogeneous architectures
14
OpenMP‐DSP on AM572x: Execution Model• “OpenMP‐DSP” refers to the runtime used to enable parallelism across the C66x DSPs on AM572x.
• Master thread creates a team of threads on encountering a parallel region:– One OpenMP thread runs on each C66x DSP core.– Master thread begins execution on DSP1.– DSP2 is a worker core; It participates in executing the parallel region.
• Data Parallel –Work‐sharing constructs are used to distribute work across the DSP cores (e.g., loop iterations).
• Task Parallel – Task construct used to generate tasks which are executed by one of the DSP cores on the team.
15
OpenMP‐DSP on AM572x: Memory Model
Master Thread
Parallel Region
Synchronization Points
16
C66xDSPC66xDSP
C66xDSPC66xDSP
288KB288KB 288KB288KB
• Threads have access to shared memory:– Each thread can have a temporary view of the shared memory (e.g. registers, cache).
– Temporary view made consistent with shared view of memory at synchronization points.
• Threads have privatememory:– For data local to each thread
• No hardware cache coherency across DSP cores
• OpenMP runtime makes a thread’s view of memory consistent with shared view by performing cache operations at synchronization points.
OCMC SRAM (on chip memory)DDR (off chip memory)
Private Memory
Shared Memory
OpenMP 4.0 “Offload Model”
Extends OpenMP by adding:• A ‘target’ construct to indicate regions to be dispatched
– Target regions can contain OpenMP constructs• Map clause to indicate data transfer between host <‐> device• Constructs to indicate that variables/functions reside on host/device/both
void add(int *in1, int *in2, int *out1, int count){
#pragma omp target map (to: in1[0:count‐1], in2[0:count‐1], count, \from: out1[0:count‐1])
{#pragma omp parallel shared(in1, in2, out1){
int i;#pragma omp forfor (i = 0; i < count; i++)
out1[i] = in1[i] + in2[i];}
}}
17
The “Offload Model” is a subset of OpenMP 4.0 specification enabling execution on heterogeneous devices.
OpenMP Offload Execution & Memory Model• Notion of a host device and target device(s)
• Execution Model:– Each device has its own threads.– No migration of threads across devices.– Host device offloads ‘target regions’ to target devices.– Host waits till the target region has completed execution.
• Memory Model:– Each device, including the host device, has an initial data environment.– Data mapping clauses determine how variables are mapped from the host device data environment to that of the target device.
– Variables in different data environments may share storage.18
target construct
• Variables a, b, c, and size initially reside in host memory.• On encountering a target construct:
– Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size.– Any variables annotated ‘to’ are mapped from host memory to device memory.– The target region is executed on the device.– Any variables annotated ‘from’ are mapped from device memory to host memory.
void vadd_openmp(float *a, float *b, float *c, int size){
#pragma omp target map(to:a[0:size],b[0:size],size) \map(from: c[0:size])
{int i;#pragma omp parallel forfor (i = 0; i < size; i++)
c[i] = a[i] + b[i];}
}
a
b
c
size
a
b
c
size
Host Memory Device Memory
to
to
to
from
19
Offload Model on AM572x
SMP Linux
OpenMP Offload Runtime (ARM) OpenMP DSP
Runtime
#pragma omp parallel shared(in1, in2, out1){
…}
void foo(int *in1, int *in2, int *out1, int count){
#pragma omp target map (to: in1[0:count‐1], \in2[0:count‐1], count, \from: out1[0:count‐1])
{#pragma omp parallel shared(in1, in2, out1){
…}
}}
Host program with
OpenMPtarget directives
Source‐to‐Source Lowering
• Host Device: ARM Cortex‐A15(s) running SMP Linux• Target Device: one device with two C66x DSP cores• When an ARM thread encounters a target region, it
begins execution on the master core (DSP1)– ‘omp parallel’ pragma within target region starts
execution on the worker core (DSP2)• The Offload Model runtime maps OpenMP target
constructs & data movement clauses are mapped to OpenCL API calls:
– Target regions ‐> OpenCL kernels– Map clauses ‐> OpenCL data movement APIs
Execution Model
20
ARMA15ARMA15
ARMA15ARMA15
2MB2MB
C66xDSPC66xDSP
C66xDSPC66xDSP
288KB288KB 288KB288KB
Comparing OpenCL and OpenMP Offload
OpenCL & OpenMP Offload on Sitara AM57x Processors
21
OpenCL/OpenMP Offload: Making the Choice• Different approaches to expressing offload and parallelism:
– OpenCL uses APIs and OpenCL C. User will need to rework existing code.– OpenMP uses pragmas (target, parallel, etc.), which can be used for quick prototyping.
• Which execution/memory model is a better fit for application code dispatched to the DSP?
• The choice also depends on other factors:– Nature of existing code base
• Is it already written to use OpenCL for dispatch?• Does it already use OpenMP to go parallel across threads?
– Control over data movement required; OpenCL APIs offer more precise control over data movement between the host & device.
– Programmer expertise & preference
22
For More Information• TI OpenCL User Guide
• TI OpenMP‐DSP User Guide
• TI OpenMP Offload Model User Guide
• Processor SDK for AM57x
• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website: http://e2e.ti.com
23