Post on 01-Jan-2022
Heterogeneous C++
Michael Wong, Codeplay Software
VP of Research and Development
Chair of SYCL Heterogeneous Programming LanguageISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong
Head of Delegation for C++ Standard for Canada
Chair of Programming Languages for Standards Council of Canada
Chair of WG21 SG5 Transactional MemoryChair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded
Editor: C++ SG5 Transactional Memory Technical Specification
Editor: C++ SG1 Concurrency Technical Specification
http:://wongmichael.com/aboutADC++ 2018 – May 2018
© 2018 Codeplay Software Ltd.2
Acknowledgement Disclaimer
Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk.I even lifted this acknowledgement and disclaimer from some of them.But I claim all credit for errors, and stupid mistakes. These are mine, all mine!
© 2018 Codeplay Software Ltd.3
Legal Disclaimer
This work represents the view of the author and does not necessarily represent the view of Codeplay.
Other company, product, and service names may be trademarks or service marks of others.
© 2018 Codeplay Software Ltd.4
Partners
Codeplay - Connecting AI to Silicon
Customers
C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. TensorFlow™
The heart of Codeplay's compute technologyenabling OpenCL™, SPIR™, HSA™ and Vulkan™
Products
Automotive (ISO 26262)IoT, Smartphones & Tablets
High Performance Compute (HPC)Medical & Industrial
Technologies: Vision ProcessingMachine Learning
Artificial IntelligenceBig Data Compute
Addressable Markets
High-performance software solutions for custom heterogeneous systems
Enabling the toughest processor systems with tools and middleware based on open standards
Established 2002 in Scotland
~70 employees
Company
© 2018 Codeplay Software Ltd.5
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.6
A tale of two cities
© 2018 Codeplay Software Ltd.7
Will the two galaxies ever join/collide?
© 2018 Codeplay Software Ltd.8
© 2018 Codeplay Software Ltd.9
A tale of two/three cities
© 2018 Codeplay Software Ltd.10
Programming GPU/Accelerators
• OpenGL• DirectX• CUDA• OpenCL• OpenMP• OpenACC• C++ AMP• HPX
• HSA• SYCL• Vulkan• Boost.Compute• Halide• Kokkos• Raja• UPC++• HCC• Charm++
© 2018 Codeplay Software Ltd.11
• P3MA at ISC• WACCPD at SC• DHPCC++ at IWOCL• Repara at EuroPar• HeteroPar at EuroPar
• Distributed and Heterogeneous Programming in C/C++
• 2nd year attached to IWOCL• Latest research in HPX, SYCL,
Kokkos, Raja, …• Latest standardization efforts• Work loads in machine
learning, vision, safety critical
Several Key workshops
© 2018 Codeplay Software Ltd.12
How can we compile source code for a sub architectures?
Separate source
Single source
© 2018 Codeplay Software Ltd.13
Benefits of Single Source
• Device code is written in C++ in the same source file as the host CPU code
• Allows compile-time evaluation of device code
• Supports type safety across host CPU and device
• Supports generic programming
• Removes the need to distribute source code
© 2018 Codeplay Software Ltd.14
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to HeterogeneousHeterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.15
C++ Directions Group: P0939
© 2018 Codeplay Software Ltd.16
What is C++
C++ is a language for defining and using lightweight abstractions
C++ supports building resource constrained applications and software infrastructure
C++ support large-scale software development
© 2018 Codeplay Software Ltd.17
How do we want C++ to develop?
Improve support for large -scale dependable softwareImprove support for high-level concurrency models
Simplify language useAddress major sources of dissatisfaction
Address major sources of error
© 2018 Codeplay Software Ltd.18
C++ rests on two pillars
A direct map to hardware (initially from C)Zero-overhead abstraction in production code (initially from
Simula, where it wasn’t zero-overhead)
© 2018 Codeplay Software Ltd.19
4.3 Concrete Suggestions• Pattern matching• Exception and error returns• Static reflection• Modern networking• Modern hardware:
We need better support for modern hardware, such as executors/execution context, affinity support in C++ leading to heterogeneous/distributed computing support,
SIMD/task blocks, more concurrency data structures, improved atomics/memory model/lock-free data structures support. The challenge is to turn this (incomplete) laundry list into a
coherent set of facilities and to introduce them in a manner that leaves each new standard with a coherent subset of our ideal.
• Simple graphics and interaction• Anything from the Priorities for C++20 that didn’t make C++20
© 2018 Codeplay Software Ltd.20
Use the Proper Abstraction today with C++17Abstraction How is it supported
Cores C++11/14/17 threads, async
HW threads C++11/14/17 threads, async
Vectors Parallelism TS2
Atomic, Fences, lockfree, futures, counters, transactions
C++11/14/17 atomics, Concurrenct TS1, Transactional Memory TS1
Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms
Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos,Raja
Distributed HPX, MPI, UPC++
Caches OpenMP affinity places, Not supported in C++
Numa OpenMP affinity places, Not Supported in C++
TLS C++ 11 TLS (thread_local)
Exception handling in concurrent environment Not supported
© 2018 Codeplay Software Ltd.21
• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs
• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs • Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp
• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU
Problems of Transforming from a non-Heterogeneous to Heterogeneous Language
© 2018 Codeplay Software Ltd.22
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.23
C++ Std Timeline/statushttps://isocpp.org/std/status
23
© 2018 Codeplay Software Ltd.24
Status after Mar JAX C++ MeetingISO number Name Status links C++20?
ISO/IEC TS 19841:2015 Transactional Memory TS
Published 2015-09-16, (ISO Store). Final draft: n4514(2015-05-08)
Composablelock-free programming that scales
No. Already in GCC 6 release and waiting for subsequent usage experience.
ISO/IEC TS 19217:2015 C++ Extensions for Concepts
Published 2015-11-13. (ISO Store). Final draft: n4553(2015-10-02)
Constrained templates
Merged into C++20 without terse syntax. . Already in GCC 6 release and and waiting for subsequent usage experience.
© 2018 Codeplay Software Ltd.25
Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?
ISO/IEC TS 19571:2016 C++ Extensions for Concurrency
Published 2016-01-19. (ISO Store) Final draft: p0159r0(2015-10-22)
improvements to future, latches and barriers, atomic smart pointers
Latches, atomic<shared_ptr<t>> headed into C++20. Already in Visual Studio release and Anthony Williams Just Threads! and waiting for subsequent usage experience.
ISO/IEC TS 19568:2017 C++ Extensions for Library Fundamentals, Version 2
Published 2017-03-30. (ISO Store) Draft: n4617 (2016-11-28)
source code information capture and various utilities Published with parts in C++17
ISO/IEC DTS 21425:2017 Ranges TS Published 2017-11Draft n4651 (2017-03-15)
Range-based algorithms and views Published
ISO/IEC DTS 19216:xxxx Networking TS PDTS, Draft n4656 (2017-03-17)
Sockets library based on Boost.ASIO
Publish soon but must come with executors
ISO/IEC DTS 21544:xxxx Modules Proposed Draft n4689(2017-07-31) out for ballot
A component system to supersede the textual header file inclusion model
Voted for publication.
© 2018 Codeplay Software Ltd.26
Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?
Numerics TS Early development. Draft p0101(2015-09-27)
Various numerical facilities,Including DSP rounding types Under active development
ISO/IEC DTS 19571:xxxx Concurrency TS 2 Early developmentExploring , lock-free, hazard pointers, RCU, atomic views,concurrent data structures
Under active development. Possible new clause
ISO/IEC DTS 19570:xxxx Parallelism TS 2 Early development. Draft n4578(2016-02-22)
Exploring task blocks, progress guarantees, SIMD<T> type, vec, no_vec loop based execution policy
PDTS ballot now on remaining part after pushing seq policy into C++20
ISO/IEC DTS 19841:xxxx Transactional Memory TS 2 Early developmentExploring on_commit, in_transaction. Lambda-based executor model.
Under active development.
Graphics TS Early development. Draft p0267r0 (2016-02-12)
2D drawing API using Cairo interface, adding stateless interface
LEWG reviewed but now became controversial
Library Fundamental V3 Initial draft, early development Maybe mdspan and expected<T> Under development
© 2018 Codeplay Software Ltd.27
Status after Mar JAX C++ MeetingISO number Name Status What is it? C++20?
ISO/IEC DTS 22277:2017 Coroutine TS Published 2017-11Draft n4663 (2017-03-25)
Resumable functions, based on Microsoft’s await design Published
Reflection TS
Early development. Draft p0194r2 (2016-10-15) with rationale in p0385r2 (2017-02-06). Alternative: p0590r0 (2017-02-05)
Code introspection and (later) reification mechanisms
Introspection proposal passed core language design review; next stop is design review of the library components. Targeting a Reflection TS.
Contracts TS Unified proposal reviewed favourably. )
Preconditions, postconditions, etc.
Proposal passed core language design review; next stop is design review of the library components. Targeting C++20.
Executor TS Separated from Concurrency TS. have a unified proposal .
Describes how, where, when of execution. Enables distributed and heterogeneous computing.
weekly calls but have consensus in SG1 and reviewed in LEWG and LWG. Now considering how to make it into C++20.
Heterogeneous Device TSAffinity, execution context, eh in concurrent environment, Execution agent local storage
Support Hetereogeneous Devices
Weekly calls. Affinity and EH progressing
© 2018 Codeplay Software Ltd.28
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.29
Parallelize and vectorize example
29
© 2018 Codeplay Software Ltd.30
C++17 Parallel STL: Democratizing Parallelism in C++What is Parallel STL?
Parallel STL greatly facilitates the usage of parallelism in C++ by exposing a parallel interface for the STL algorithms.
Why do I care?
Hardware architecture is becoming increasingly parallel. You cannot escape. See Herb Sutter The Free Lunch Is Over, which is now over 10 years old now! More updated version: Welcome to the jungle
What does it include?
It adds wording for parallel execution on the C++ standard, and Execution Policies to the STL interface that enable selecting the appropriate level of parallelism.
New parallel algorithms are also added to the interface.
What do I take from this talk?
You will understand what Parallel STL is and learn the basic to use them. You’ll be ready to use also the SYCL ParallelSTL on your accelerator.
C++ goes parallel!
© 2018 Codeplay Software Ltd.31
Sorting with the STL
std::vector<int> data = { 8, 9, 1, 4 };
std::sort(std::begin(data), std::end(data));
if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;
}
std::vector<int> data = { 8, 9, 1, 4 };
std::sort(std::execution_policy::par, std::begin(data), std::end(data));
if (std::is_sorted(data)) {Std::cout << “ Data is sorted!” << std::endl;
}
Extra parameter to STL algorithms enable parallelism
Normal sequential sort algorithm
© 2018 Codeplay Software Ltd.32
The Execution Policy: Standard policy classes
● Defined in the execution namespace○ Sequenced policy
■ Never do parallel, sequenced in-order execution■ constexpr sequenced_policy sequenced;
○ Parallel policy■ Can use caller thread but may span others (std::thread)■ Invocations do not interleave on a single thread■ constexpr sequenced_policy par;
○ Parallel unsequenced■ Can use caller thread or others (e.g std::thread)■ Multiple invocations may be interleaved on a single thread■ constexpr sequenced_policy par_unseq;
© 2018 Codeplay Software Ltd.33
Many different existing implementations
Available today● Microsoft: http://parallelstl.codeplex.com● HPX: http://stellar-group.github.io/hpx/docs/html/hpx/manual/parallel.html● HSA: http://www.hsafoundation.com/hsa-for-math-science● Thibaut Lutz: http://github.com/t-lutz/ParallelSTL● NVIDIA: https://thrust.github.io/doc/group__execution__policies.html● Codeplay: http://github.com/KhronosGroup/SyclParallelSTL● Clang: Not yet available
Expect major C++ compilers to implement it soon!
© 2018 Codeplay Software Ltd.34
Using execution policies
using std::execution_policy;
// May execute in parallelstd::sort(par, std::begin(data), std::end(data))// May be parallelized and vectorizedstd::sort(std::par_unseq, std::begin(data), std::end(data));// Will not be parallelized/vectorizedstd::sort(std::sequenced, std::begin(data), std::end(data));// Vendor-specific policy, read their documentation!std::sort(custom_vendor_policy, std::begin(data), std::end(data));
© 2018 Codeplay Software Ltd.35
Propagating the policy to the end user
using std::execution_policy;
template<typename Policy, typename Iterator>void library_function(Policy p, Iterator begin,
Iterator end) {std::sort(p, begin, end);std::for_each(p, begin, end,
[&](Iterator::value_type e&) { e ++;}) ;std::for_each(std::sequenced, begin, end,
non_parallel_operation) ;}
© 2018 Codeplay Software Ltd.36
Parallel overloads available
© 2018 Codeplay Software Ltd.37
New algorithms into the STL: Parallel For Each
template<class ExecutionPolicy, class InputIterator, class Function>void for_each(ExecutionPolicy && exec,InputIterator first, InputIterator last, Function f);
template<class ExecutionPolicy, class InputIterator, class Size, class Function>InputIterator for_each_n(ExecutionPolicy && exec,
InputIterator first, Size n,Function f) ;
template<class InputIterator, class Size, class Function>InputIterator for_each_n(InputIterator first, Size n, Function f);
● for_each: Applies f to elements in range [first, last).● for_each_n: Applies f to elements in [first, first + n)
© 2018 Codeplay Software Ltd.38
New algorithms into the STL
Numerical Parallel Algorithmstemplate < class InputIterator >typename iterator_traits < InputIterator >:: value_typereduce ( InputIterator first , InputIterator last ) ;
template < class InputIterator , class T >T reduce ( InputIterator first , InputIterator last , T init ) ;
template < class InputIterator , class T , class BinaryOperation >T reduce ( InputIterator first , InputIterator last , T init ,BinaryOperation binary_op ) ;
Implements a reduction operation (the order of the binary_op is not relevant). The sequential equivalent is accumulate
© 2018 Codeplay Software Ltd.39
New algorithms into the STL (Serial Reduction pattern)
0 1 2 3 4 5 6
21
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::accumulate(std::begin(v1), nElems, 1);
Only one core is used for the different additions.
© 2018 Codeplay Software Ltd.40
New algorithms into the STL (Parallel Reduction Pattern)
0 1 2 3 4 5 6
21
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::reduce(std::execution_policy::par,std::begin(v1), nElems, 1);
If operation is commutative and associative, can be run in parallel. Reduction uses all cores!
© 2018 Codeplay Software Ltd.41
Transform
● transform (a.k.a map) applies a function to an input range and stores the result on the output range. Operation is out of order.
std::transform(std::execution::par, v1.begin(), v1.end(), v2.begin(), output.begin(), [=](int val1, int val2)
{ return val1 + val2 + 1; });
© 2018 Codeplay Software Ltd.42
Transform reduce
● transform_reduce applies a function to an input range and then applies the binary operation to reduce the values
© 2018 Codeplay Software Ltd.43
Transform Reduce example
0,1 2,3 4,5 6,7 8,9 10,11
12, 13
1275.4866..
Apply to each element of v
Reduction function: addition
AdditionMultiplication per element
Neutral element of reduction
© 2018 Codeplay Software Ltd.44
What can I do with a Parallel For Each?
Intel Core i7 7th generation
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::fill_n(std::begin(v1), nElems, 1);
std::for_each(std::begin(v), std::end(v),[=](float f) { f * f + f });
Traditional for each uses only one core, rest of the die is unutilized!
10000 elems
© 2018 Codeplay Software Ltd.45
What can I do with a Parallel For Each?
Intel Core i7 7th generation
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);
std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });
Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)
2500 elems
2500 elems
2500 elems
2500 elems
© 2018 Codeplay Software Ltd.46
What can I do with a Parallel For Each?
Intel Core i7 7th generation
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::fill_n(std::execution_policy::par,std::begin(v1), nElems, 1);
std::for_each(std::execution_policy::par,std::begin(v), std::end(v),[=](float f) { f * f + f });
Workload is distributed across cores!(mileage may vary, implementation-specific behaviour)
2500 elems
2500 elems
2500 elems
2500 elems
What about this part?
© 2018 Codeplay Software Ltd.47
What can I do with a Parallel For Each?
Intel Core i7 7th generation
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::fill_n(sycl_policy,std::begin(v1), nElems, 1);
std::for_each(sycl_named_policy<class KernelName>,
std::begin(v), std::end(v),[=](float f) { f * f + f });
Workload is distributed on the GPU cores(mileage may vary, implementation-specific behaviour)
10000 elems
© 2018 Codeplay Software Ltd.48
What can I do with a Parallel For Each?
Intel Core i7 7th generation
size_t nElems = 1000u;std::vector<float> nums(nElems);
std::fill_n(sycl_heter_policy(cpu, gpu, 0.5),std::begin(v1), nElems, 1);
std::for_each(sycl_heter_policy<class kName>(cpu, gpu, 0.5),std::begin(v), std::end(v),[=](float f) { f * f + f });
Workload is distributed on all cores!(mileage may vary, implementation-specific behaviour)
5000 elems
1250 elems
1250 elems
1250 elems
1250 elems
Experimental!
© 2018 Codeplay Software Ltd.49
Parallel overloads available in SYCL Parallel STL
© 2018 Codeplay Software Ltd.50
SYCL for OpenCL
➢ Cross-platform, single-source, high-level, C++ programming layer➢ Built on top of OpenCL and based on standard C++14
© 2018 Codeplay Software Ltd.51
Enabling Machine Learning Frameworks on OpenCL and Modern C++ 11, 14, 17, 20, 23, ...
© 2018 Codeplay Software Ltd.52
The SYCL Ecosystem
C++ Application
C++ Template Library
SYCL for OpenCL
OpenCL
C++ Template LibraryC++ Template Library
GPU APUCPU FPGAAccelerator DSP
© 2018 Codeplay Software Ltd.53
Example: Vector Add
© 2018 Codeplay Software Ltd.54
Example: Vector AddExample: Vector Add#include <CL/sycl.hpp>
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
}
© 2018 Codeplay Software Ltd.55
Example: Vector Add#include <CL/sycl.hpp>
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());
}
The buffers synchronise upon
destruction
© 2018 Codeplay Software Ltd.56
Example: Vector Add#include <CL/sycl.hpp>
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;
}
© 2018 Codeplay Software Ltd.57
Example: Vector Add#include <CL/sycl.hpp>
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {
});}
Create a command group to define an asynchronous task
© 2018 Codeplay Software Ltd.58
Example: Vector Add#include <CL/sycl.hpp>
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {
auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);
});}
© 2018 Codeplay Software Ltd.59
Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {
auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),
[=](cl::sycl::id<1> idx) {
}));});
}
You must provide a name for the
lambda
Create a parallel_for to define the device
code
© 2018 Codeplay Software Ltd.60
Example: Vector Add#include <CL/sycl.hpp>template <typename T> kernel;
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> &out) {
cl::sycl::buffer<T, 1> inputABuf(inA.data(), out.size());cl::sycl::buffer<T, 1> inputBBuf(inB.data(), out.size());cl::sycl::buffer<T, 1> outputBuf(out.data(), out.size());cl::sycl::queue defaultQueue;defaultQueue.submit([&] (cl::sycl::handler &cgh) {
auto inputAPtr = inputABuf.get_access<cl::sycl::access::read>(cgh);auto inputBPtr = inputBBuf.get_access<cl::sycl::access::read>(cgh);auto outputPtr = outputBuf.get_access<cl::sycl::access::write>(cgh);cgh.parallel_for<kernel<T>>(cl::sycl::range<1>(out.size())),
[=](cl::sycl::id<1> idx) {outputPtr[idx] = inputAPtr[idx] + inputBPtr[idx];
}));});
}
© 2018 Codeplay Software Ltd.61
Example: Vector Add
template <typename T>void parallel_add(std::vector<T> inA, std::vector<T> inB, std::vector<T> out);
int main() {
std::vector<float> inputA = { /* input a */ };std::vector<float> inputB = { /* input b */ };std::vector<float> output = { /* output */ };
parallel_add(inputA, inputB, output);
...}
© 2018 Codeplay Software Ltd.62
Implementing Parallel STL with SYCL/* sycl_execution_policy.* The sycl_execution_policy enables algorithms to be executed using* a SYCL implementation.*/template <class KernelName = DefaultKernelName>class sycl_execution_policy {cl::sycl::queue m_q;public:// The kernel name when using lambdasusing kernelName = KernelName;sycl_execution_policy() = default;sycl_execution_policy(cl::sycl::queue q) : m_q(q){};sycl_execution_policy(const sycl_execution_policy&) = default;// Returns the name of the kernel as a stringstd::string get_name() const { return typeid(kernelName).name(); };// Returns the queue, if anycl::sycl::queue get_queue() const { return m_q; }
Creates a SYCL policy using an existing queue
Typeid information only valid for debugging
© 2018 Codeplay Software Ltd.63
Implementing Parallel STL with SYCL
/* for_each*/template <class Iterator, class UnaryFunction>void for_each(Iterator b, Iterator e, UnaryFunction f) {impl::for_each(*this, b, e, f);
}
For_each member function on the policy forwards to implementation
Iterator can be any RandomAccess tag
Functions can take C++ iterators or SYCL-specific iterators
© 2018 Codeplay Software Ltd.64
template <class ExecutionPolicy, class Iterator, class UnaryFunction>void for_each(ExecutionPolicy &sep, Iterator b, Iterator e, UnaryFunction op) {{cl::sycl::queue q(sep.get_queue());auto device = q.get_device();size_t localRange =
device.get_info<cl::sycl::info::device::max_work_group_size>();auto bufI = sycl::helpers::make_buffer(b, e);auto vectorSize = bufI.get_count();size_t globalRange = sep.calculateGlobalSize(vectorSize, localRange);
Continues...
Obtain the queue from the policy
Obtain device parameters
Prepare allocations on device
© 2018 Codeplay Software Ltd.65
auto f = [vectorSize, localRange, globalRange, &bufI, op](cl::sycl::handler &h) mutable {
cl::sycl::nd_range<1> r{cl::sycl::range<1>{std::max(globalRange, localRange)},cl::sycl::range<1>{localRange}};
auto aI = bufI.template get_access<cl::sycl::access::mode::read_write>(h);h.parallel_for<typename ExecutionPolicy::kernelName>(
r, [aI, op, vectorSize](cl::sycl::nd_item<1> id) {if (id.get_global(0) < vectorSize) {op(aI[id.get_global(0)]);
}});
};q.submit(f);
}
Device Lambda
User functor
Submit for execution on the device
© 2018 Codeplay Software Ltd.66
Demo Results - Running std::sort(Running on Intel i7 6600 CPU & Intel HD Graphics 520)
size 2^16 2^17 2^18 2^19
std::seq 0.27031s 0.620068s 0.669628s 1.48918s
std::par 0.259486s 0.478032s 0.444422s 1.83599s
std::unseq 0.24258s 0.413909s 0.456224s 1.01958s
sycl_execution_policy 0.273724s 0.269804s 0.277747s 0.399634s
© 2018 Codeplay Software Ltd.67
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.68
SG1 Par/Con TS
SG5 Transactional Memory TS
SG14 Low Latency
…
3
……
…
…
…
…
…
…
The Parallel and concurrency planets of C++ today
© 2018 Codeplay Software Ltd.69
C++1Y(1Y=17/20/22) SG1/SG5/SG14 TS Planred=C++17, blue=C++20? Black=future?
Parallelism• Parallel Algorithms:• Data-Based Parallelism.
(Vector, SIMD, ...)• Task-based parallelism (cilk, OpenMP, fork-join)• Loop based execution policy• Execution Agents• Progress guarantees• MapReduce• Pipelines
Concurrency• Future++ (then, wait_any, wait_all): • Resumable Functions, await (with futures)• Lock free techniques/Transactions • Synchronics• Atomic Views• Counters/Queues• Concurrent Vector/Unordered Associative Containers• Latches and Barriers• upgrade_lock• Atomic smart pointers
• Executors
• Coroutines
• Networking
© 2018 Codeplay Software Ltd.70
C++ 20 (barely possible)• Executors Lite• Networking
C++ 23• Executors• Networking• More alg.policies.• Futures• Async• Affinity• EATLS• EH in a concurrent
environment
Two “Possible” Universes
© 2018 Codeplay Software Ltd.71
Use the Proper Abstraction with C++20Abstraction How is it supported
Cores C++11/14/17 threads, async
HW threads C++11/14/17 threads, async
Vectors Parallelism TS2->C++20
Atomic, Fences, lockfree, futures, counters, transactions
C++11/14/17 atomics, Concurrenct TS1->C++20,Transactional Memory TS1
Parallel Loops Async, TBB:parallel_invoke, Parallelism TS1, C++17 parallel algorithms
Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos, RajaP0796 on affinity
Distributed HPX, MPI, UPC++P0796 on affinity
Caches OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinity,P0443->Executor TS or IS20
Numa OpenMP affinity places, Not supported in C++Executors, Execution Context, Affinity, P0796 on affinityP0443->Executor TS or IS20,
TLS EATLS, P0772
Exception handling in concurrent environment EH reduction propertiesP0797
© 2018 Codeplay Software Ltd.72
Agenda
A tale of two cities: HPC and CommercialTransforming from non-heterogeneous to Heterogeneous Heterogeneous C++ statusParallel STL on CPU and GPU todayVector Loop in C++ and OpenMPCollaborating across standardsBackup: Executors and Affinity in C++
© 2018 Codeplay Software Ltd.73
• The 3 Data problems even after you exposed Parallelism• Data Movement: the COST! But is it implicit or explicit?• Data Layout: AoS vs SoA: coalesced memory accesses; How about data tiling?• Data Addressing: Partitioned addressing, how do you treat pointers, can you share arrays across CPU/GPUs
• Other General issues• What is a Thread?: is it a System thread? GPU thread, OpenMP/OpenCL thread; Thread of Execution, Execution Agents• Memory Model: is it flat, abstract? Or scoped based on the warp, team, • Thread Local Storage: do GPUs convoy startups• Scheduling: dependencies • More specific Hardware kinds: FPGA, DSPs, streaming, cloud and Tensor Processing Units?• Forward Progress Guarantees and weak execution agents• Dynamic online/offline device • How to fill a GPU team, block, warp
• More specific C++ issues• Separation of Concerns: What, where, how, when is it executed?• Affinity: both CPU and memory affinity• Exceptions and Error Model: how to have concurrent exceptions• Templates: how do you support compile-time polymorphism, type erasure, AI • Polymorphisms: Supporting everything on a GPU vs CPU
Problems of Transforming from a non-Heterogeneous to Heterogeneous Language
© 2018 Codeplay Software Ltd.74
SYCL / OpenCL / CUDA / HCC OpenMP / MPI C++ Thread Pool
Boost.Asio / Networking TS
Unified interface for execution
defer define_task_block dispatch strand<>asynchronous operations
future::thenasyncinvoke postparallel algorithms
© 2018 Codeplay Software Ltd.75
Execution Resource
Execution Context Executor
Instruction Stream
Lightweight Execution
Agent
Execution Function
Lightweight Execution
Agent
Lightweight Execution
Agent
● An execution context is responsible for managing an execution resource
● An execution context provides an executor for executing work on it’s managed execution resource
● An execution context manages a number of light-weight execution agents
Separation of concerns with executors
© 2018 Codeplay Software Ltd.76
#pragma omp for [clause[[,] clause] ... ] new-line
for-loop
and is bound to a parallel region that looks as follows:
#pragma omp parallel [clause[ [, ]clause] ...] new-line
structured-block
while both constructs can be combined into the following:
#pragma omp parallel for [clause[[,] clause] ...] new-line
for-loop
for [[omp::for(clause, clause), … ]] (loop-head)
loop-body
The enclosing parallel region would look like this:using [[omp::parallel(clause,clause), … ]]
{ }
OpenMP without Pragmas
© 2018 Codeplay Software Ltd.77
Learning from each other
C++ from OpenMP
• Affinity & Numa Model• SIMD model
• Learning shared code drop to clang• Leverage same OpenMP clang fat
binary• Build uniform tools for debugging and
performance analysis
OpenMP from C++
• Separation of Concerns • Attributes over pragmas
• Futures• Thread of Execution model• Error Model, Better C++ support• Consumer Producer Model
• Academic Membership Model • Local National chapters
OpenCL from OpenMP
OpenMP from OpenCL
© 2018 Codeplay Software Ltd.78
• In time, we will move accelerator code and data segment into object layouts,
• such that OSes can load it directly to various devices
Another possible future
78
@codeplaysoft codeplay.cominfo@codeplay.com
Thank you for Listening