Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department
description
Transcript of Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department
![Page 1: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/1.jpg)
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
Fábio Soldado, Fernando Alexandre, Hervé Paulino
CITI/Computer Science DepartmentFaculty of Science and Technology NOVA University of Lisbon
HeteroPar 2014 @ Euro-Par 2014Porto, PortugalAugust 25
![Page 2: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/2.jpg)
2
Motivation
Current computational systems are heterogeneous by nature: CPUs + GPUs
The GPU is increasingly being used in general purpose computing
The programming and execution models for CPUs and GPUs are quite different Programmer forced to direct the computation to one kind of
processing unit
High-level programming of multiple GPUs + multiple CPUs environments as a whole
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 3: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/3.jpg)
3
OpenCL provides code but not performance portability
Low-level programming model – no composition support
Problem
Host Device
Bus
Resource
management Orchestration of
data transfer and
execution requests
SPMD programming
model Memory organization
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 4: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/4.jpg)
4
OpenCL provides code but not performance portability
Low-level programming model – no composition support
Problem
Host Devices
Bus
⬆ Resource management
⬆ Orchestration of data
transfer and execution
requests
+ Decompose the computation
among the CPUs and GPUs
+ Scheduling and load
balancing
+ Device-type specific
optimizations
SPMD programming
model Device-type specific
memory organization
ALGORITHMICSKELETONS
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 5: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/5.jpg)
5
The Marrow Framework
C++ algorithmic skeleton framework for the orchestration of OpenCL computations [Euro-Par 2013]
Task and Data-parallel skeletons Task-parallel: Pipeline and Loop Data-parallel: Map(Reduce)
Skeleton nesting
GPU heterogeneity support
GPU-directed optimizations
Distinguishing Features
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 6: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/6.jpg)
6
The Marrow Framework
Fast Fourier Transform (FFT) pipeline Adapted from the SHOC benchmark suite FFT kernel Inverse FFT kernel
Programming Example
Pipeline
iFFTFFT
Executable FFT (new KernelWrapper(kernelFile,
kernelFunction, inInfo, outInfo));
Executable pipeline (new Pipeline(FFT, iFFT));
new Buffer<cl_float2>()
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 7: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/7.jpg)
7
Proposal
Support the execution of compound OpenCL computations in multi-CPU/multi-GPU environments
Grow the Marrow algorithmic skeleton framework
Transparently Distribute the load of a Marrow computations across
multiple CPUs and GPUs Adapt this distribution to different input data-sets and to the
CPUs’ load fluctuations.
Multiple (possibly heterogeneous) GPUs
+ Multiple CPUs
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 8: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/8.jpg)
8
Challenges
How to efficiently decompose a Marrow Computation Tree (CT) among the multiple CPU and GPU devices
How to efficiently distribute the work load among the available hardware resources
How to adapt this distribution to different input data-sets and to the CPUs’ load fluctuations
How to integrate these concepts in the programming model in a non-intrusive way
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 9: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/9.jpg)
9
CT DecompositionReplicating the skeleton tree
Integrates seamlessly with the SPMD model
Avoids data migration between devices
Scales well with the increase of devices
Locality-aware domain decomposition
Pipeline
iFFTFFT
Pipeline
iFFTFFT
Input dataset
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 10: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/10.jpg)
10
OverlapComp/CommFactor of 3
OpenCL Fission Fission of 2
CT Decomposition
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
Sub CPU
Sub CPU
Sub CPU
Sub CPU
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Data
Best Fission level?
Best overlap factor?
![Page 11: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/11.jpg)
11
CT Decomposition
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
Sub CPU
Sub CPU
Sub CPU
Sub CPU
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Overlap Partition
Data
f
1-f
ata
Evenly distributed
Distributed according to the relative performance of the devices [SAC 2014]
f?
Best Fission level?
Best overlap factor?
![Page 12: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/12.jpg)
12
Work Distribution – CPUs +GPUs
We are particularly interested in recurrent applications of CTs upon possibly different data-sets with different sizes
Lightweight mechanism to derive a suitable configuration for a CT’s execution, given a particular parameterization
Profile-based self-adaptation Resort to a profile built from a past executions
and to the current CPU load information
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 13: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/13.jpg)
13
Work Distribution – CPUs +GPUs
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
Decision Process
Execution request
New CT?
CT info?
Train flag?
yes yes
no yes
Perform training
Persist result
Monitored execution
Compute lbt
![Page 14: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/14.jpg)
14
Work Distribution – CPUs +GPUs
Dimensions to consider Fission level Overlap factor
Compute the best workload distribution (f) for each considered fission/overlap configuration Two approaches:
50/50 split CPU assisted GPU execution
Final result: the best overall performance
Uniform search over the search space (to improve)
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
Training Process
![Page 15: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/15.jpg)
Work Distribution – CPUs +GPUs
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
15
Decision Process
Execution request
NewCT?
CT info?
Train flag?
yes yes no
Persist result
Monitored execution
Compute lbt
Derive configuration
![Page 16: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/16.jpg)
16
Distribution Adaptation
Derive an initial work distribution Interpolation from past executions Nearest-neighbor
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 17: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/17.jpg)
Work Distribution – CPUs +GPUs
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
17
Decision Process
Execution request
NewCT?
CT info?
Train flag?
yes yes
no
yes
no
Persist result
Monitored execution
Compute lbt
Derive configuration
New data-set?
yes
Adjust distribution
no
Retrieve lbt
Must rebalnce?
no
![Page 18: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/18.jpg)
18
Distribution Adaptation
Derive an initial work distribution Interpolation from past executions – Nearest-neighbor
Adjust work distribution When lbt(t) ≈ 1 Two-level approach
1. Transfer load from the worst performing computing unit type to the best performing
2. Retrigger the process to find the best configuration for the current fission/overlap configuration
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 19: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/19.jpg)
19
Evaluation
Speed-up relatively to GPU-only executions
Efficiency of the work distribution strategy
Efficiency load balancing strategy
Metrics
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 20: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/20.jpg)
20
Evaluation
Case Studies
Image Filter Pipeline: 3 stage pipeline
FFT (Fast-Fourier Transformation): 2 stage pipeline
N-Body (Direct-sum, O(N2)): For loop
Saxpy: Map
Segmentation: Map
Case Studies and Test Platforms
Test Platform
CPU Intel Core i7-3930K @
3.20 GHz 6 cores 12 hardware
threads 6 L1 and L2 caches 1 L3 cache
GPUs 2 AMD HD 7950 (2x PCIe
bus)
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 21: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/21.jpg)
Evaluation - Speedup
1024x1
024
2048x2
048
4096x4
096
128M
B
256M
B
512M
B
16384
32768
65536
1M
10M
15M
1M
B
8M
B
60M
B
Image Pipeline FFT NBody Saxpy Segmentation
0.5
1
1.5
2
2.5
3
Divisão 50/50 Execução GPU assistida pelo CPU
Speedup
1 GPU + CPU vs 1 GPU
HeteroPar 2014 - Porto, Portugal 21
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
50/50 split CPU assisted GPU execution
![Page 22: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/22.jpg)
22
Evaluation - Speedup
1024x1
024
2048x2
048
4096x4
096
128M
B
256M
B
512M
B
16384
32768
65536
1M
10M
15M
1M
B
8M
B
60M
B
Filter Pipeline FFT Nbody Saxpy Segmentation
0.5
1
1.5
2
2.5
3
Divisão 50/50 Execução GPU assistida pelo CPU
Speedup
HeteroPar 2014 - Porto, Portugal
2 GPUs + CPU vs 2 GPUs
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
50/50 split CPU assisted GPU execution
![Page 23: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/23.jpg)
23
Evaluation – Config. Derivation
Fraction assigned to the GPUs
Image 2 Image 3 Image 4 Image 5 Image 680
82
84
86
88
90
92
94
96
W/ Full Training Derived Configuration
Execution time
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
Image 1 Image 2 Image 3 Image 4 Image 5 Image 60.1
1
10
100
W/ Full training Derived Configuration
![Page 24: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/24.jpg)
24
Evaluation – Load Balancing
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
L1 L1 L1 L1 L1 L1 L2 L1 L1 L1 L1 L1 L1 L2 L1 L1 L1 L1 L1 L1 L240%
42%
44%
46%
48%
50%
52%
54%
56%
58%
60% GPU percentageCPU percentage
![Page 25: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/25.jpg)
25
Conclusions
We are able to support the execution of Nestable task-parallel skeletons in heterogeneous multi-
CPU / multi-GPU environments With device specific-optimizations
CPU – locality via Fission GPU – overlap of communication and computation
Transparent work distribution and load balancing in the presence of recurrent executions
The experimental results are promising
The program size is reduced more than 5x for a simple map example (Saxpy)
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 26: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/26.jpg)
26
Future Work
Regarding CPU + GPU Optimize configuration derivation Conjoin the use of profiling with performance models
Regarding Marrow Other types of accelerators Cluster of multi-CPU / multi-GPU nodes Generate code for kernels and orchestration from higher-
level representations More skeletons
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 27: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/27.jpg)
27
Questions?
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 28: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/28.jpg)
Work Distribution – CPUs +GPUs 50/50 Split
HeteroPar 2014 - Porto, Portugal 28
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 29: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/29.jpg)
Work Distribution – CPUs +GPUs 50/50 Split
HeteroPar 2014 - Porto, Portugal 29
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 30: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/30.jpg)
Work Distribution – CPUs +GPUs 50/50 Split
HeteroPar 2014 - Porto, Portugal 30
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 31: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/31.jpg)
31
Execução só com CPUs
1024x1
024
2048x2
048
4096x4
096
8192x8
192
1M
10M
50M
1M
B
8M
B
60M
B
Image Pipeline Saxpy Segmentation
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
Com melhor nível de fission Sem Fission
Execu
tion T
ime
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 32: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/32.jpg)
32
Treino FFT 256 Mb
L1 cache L2 cache L3 cache none0.0
50.0
100.0
150.0
200.0
250.0
60.7 58.182.2
197.9
Execu
tion T
ime
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 33: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/33.jpg)
33
Online Monitoring
Equi l ibrado Desiqui l ibrado
CPUGPU
Execu
tion t
ime
HeteroPar 2014 - Porto, Portugal
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
![Page 34: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/34.jpg)
34
EvaluationDistribution Quality
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 35: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/35.jpg)
35
Evaluation
Saxpy: Z[i] = alpha * X[i] + Y[i]
Initialization/
Finalization
Orquestration
Total
OpenCL 104 94 198
Marrow 18 38 56
Reduction 5.7x 2.5x 3.5x
Productivity – Lines of code
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 36: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/36.jpg)
36
Decomposing Marrow ComputationsThe Loop Skeleton
Evaluate condition
on the host
Upload/Update partition to GPU
#1
BodyDownload
data to host
Update loop state
True
False
Evaluate condition
on the host
Upload/Update partition to GPU
#N
BodyDownload
data to host
Update loop state
True
False
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 37: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/37.jpg)
37
Programming Interface
Control over What may and may not be partitioned
PARTITIONABLE COPY
The elementary size of a partition
Merge functions
New Features
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal
![Page 38: Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department](https://reader036.fdocuments.us/reader036/viewer/2022062517/56813e30550346895da81281/html5/thumbnails/38.jpg)
38
Programming Example
shared_ptr<IWorkData> (new BufferData<cl_float2>());
Pipeline
iFFTFFT
unique_ptr<Executable> FFT (new KernelWrapper(kernelFile,
kernelFunction, inInfo, outInfo));
FFT Pipeline Revisited
shared_ptr<IWorkData> (new BufferData<cl_float2>(fftSize,
IWorkData::PARTITIONABLE));
unique_ptr<Executable> pipeline (new Pipeline(FFT, iFFT));
Partition elementary size
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments
HeteroPar 2014 - Porto, Portugal