oneAPI, SYCL and standard C++ - where do we need to go ...
Transcript of oneAPI, SYCL and standard C++ - where do we need to go ...
2
Nevin “:-)” Liber• Argonne National Laboratory
• Advanced Leadership Computing Facility (ALCF)
• Continue to do C++ standardization
• Kokkos backend for Aurora
• SYCL
• oneAPI
• DPC++
2
C++ Standardization• 2007
• First BoostCon
•
• Meet Beman Dawes
•
• Founder of Boost
• Strong advocate for putting Stepanov’s STL into C++98
• Tells me about an upcoming meeting close to me
• In three years…
3
C++ Standardization• 2010
• Local meeting at Fermilab
• Joined the committee
• Learn more about C++
• Represent users
• Give back to the C++ community
4
C++11March 2011
Madrid5
C++11March 2011
Madrid6
C++20February 2020
Prague7
C++20February 2020
Prague8
C++20February 2020
Prague9
February 2020 - Prague• Volunteered to be Vice Chair, Library Evolution Working Group
Incubator (LEWGI) / Study Group 18 (SG18)
• A bit of prep work before and after meeting
• Focus on LEWGI proposals
• Slight change
• Pandemic
10
C++ Committee
• Every member wants to make C++ a better language
• Even if no two of us can agree that I am right on what that is
11
–The Rolling Stones
“You can’t always get what you want, but if you try sometimes, well, you might find, you get what you need.”
12
C++ Committee• Consensus-by-Committee
• Not Design-by-Committee
• We work on proposals
• It is all about tradeoffs
• Consensus of participants -> Consensus of countries
• Getting what you can live with
13
C++ Committee
• Not an Ivory Tower
• Well all have day jobs
• It is all tradeoffs
• Which you might or might not agree with
• Unlikely we haven’t considered other (major) positions
14
C++ Standardization Limitations
• We have surprisingly little authority
• No authority over hardware, OSes, systems, etc.
• Understanding with implementers
15
Example: memset_explicit
• A memset that is “guaranteed” not to be optimized away
• What happens if the OS pages out this memory?
• What about other threads or cores?
• How does a guaranteed write fit in with observable behavior?
• At best: undefined, unspecified, or implementation-defined
16
SYCL• Committee much smaller than C++
• Group effort of really smart people from many different companies
• Standardization effort much newer
• Flesh out ideas for C++ Standardization
• SYCL 2020
• Growing beyond its OpenCL and 3D graphics roots
17
SYCL Limitations
• The code must be valid C++ code
• Even if we interpret it in strange ways
18
19
Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });
19
20
Unnamed Lambdas
• Weird but valid C++ syntax
• Forward declaration of a function local class
• SYCL 1.2.1
• Name every kernel
• Unique global name for toolchains with separate device compiler
cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });
20
Intel, oneAPI & DPC++
• Implementer (hardware & software), interface, & implementation
• Initially tools for Aurora
• Flesh out ideas for SYCL
• Flesh out ideas for C++ Standardization
21
22
Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });
22
23
Unnamed Lambdas
• Initially Intel, now SYCL 2020
• No need to specify it
• Compiler will internally generate a unique name
• May want to specify it to help with debugging
cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });
23
Major oneAPI contributions to SYCL
• Unified Shared Memory (USM)
• Fundamentally simpler programming model for a lot of cases
• Tradeoff
• Dependency graph has to be done explicitly
• As opposed to accessors
24
Major oneAPI contributions to SYCL
• Parallel Reductions
• Class Template Argument Deduction (CTAD)
• Adopting C++17 feature
• Makes it easier to write SYCL code
25
Kokkos
• Performance Portability EcoSystem
• Flesh out ideas for C++ Standardization
• atomic_ref
• C++20
• Interface adopted by SYCL 2020
26
Kokkos• C++23 (hopefully) -> SYCL Next (hopefully)
• P0009 mdspan
• P1673 Basic Linear Algebra (BLAS)
• oneMKL (hopefully)
• P0443 Executors
• P2128 Multidimensional subscript operator
• mdspn(x,y) mdspn[x,y]
27
Short term (SYCL - Next)
• Continue to grow beyond three dimensions
• Why not N dimensions?
• C++ has had variadic templates since C++11
• Requires interface and implementation work
28
Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);
//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;
• We can be clever and keep this pattern going for N dimensions
• But it is generic code hostile
29
Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);
//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;
template <int dimensions = 1> struct range { static_assert(0 < dimensions);
template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;
30
Range Constructortemplate <int dimensions = 1> struct range { static_assert(0 < dimensions);
template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;
• Is this really the interface we want?
31
Better C++ Support• Virtual functions and function pointers
• Why not just use variant?
• Virtual functions model 1 of an indefinite number of types
• std::variant models 0 (valueless_by_exception) or 1 of N known types
• Visitor needs a lot of non-obvious machinery
32
Virtualstruct Base { virtual void Call() = 0; virtual ~Base() = default; };
struct D1 : Base { void Call() override { /* ... */ } };
struct D2 : Base { void Call() override { /* ... */ } };
inline void CallIt(Base& b) { b.Call(); }
• Fairly straightforward
• Collection: vector<unique_ptr<Base>>
33
Variantstruct Base { virtual void Call() = 0; virtual ~Base() = default; };
struct D1 : Base { void Call() override { /* ... */ } };
struct D2 : Base { void Call() override { /* ... */ } };
inline void CallIt(Base& b) { b.Call(); }
• Classes are simpler
• hand-written machinery
• Collection: vector<VariantD>
34
using VariantD = std::variant<D1, D2>;
struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };
inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }
// Implicit conversion from D1 or D2 inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }
• Inversion of control
• Pattern matching (C++23?) may help alleviate this
Template-landstruct Base { virtual void Call() = 0; virtual ~Base() = default; };
struct D1 : Base { void Call() override { /* ... */ } };
struct D2 : Base { void Call() override { /* ... */ } };
inline void CallIt(Base& b) { b.Call(); }
template <typename D> void CallIt(D& d) { d.Call(); }
• Errors generated at the call
35
using VariantD = std::variant<D1, D2>;
struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };
inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }
// Implicit conversion inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }
• No collections
Virtual functions• Why are they hard (from a language perspective)
• Code generated for CPU is different than code generated for GPU
• At different addresses
• May not be addressable by other device
• Yet C++ says one function, one address
• Hint at a bigger issue
36
Exceptions• For general support, we have to solve virtual functions first
• Throw derived, catch as base class reference try { throw D1(); } catch (Base& b) { b.Call(); }
• Exceptions derived from std::exception
• virtual const char* what() const noexcept
• virtual destructor37
Better C++ Support
• Virtual Inheritance
• Run-Time Type Information (RTTI)
38
C++ Trivially Copyable
• For almost a decade as a C++ Committee member, I did not know why trivially copyable is important
• I generally supported it because it is more flexible
• But I never pushed for it
• I suspect many in LEWG also do not know why trivially copyable is important
39
40
Copying Objects
• How do we copy objects in C++?
• Copy constructor / copy assignment operator
• Running code
• Code may access both source and destination
40
41
Copying Objects• Can we do the same for inter-device copying?
• Non-trivial copy constructor / copy assignment operator
• Where would the code run?
• May not be legal to access both source and destination
• About all we can do is copy the bytes (object representation) that make up the object
41
C++ Trivially Copyable
• C++ trivially copyable types
• Used as a proxy for types where we can copy the bytes
42
43
C++ Trivially Copyable
• All base classes and non-static members are trivially copyable
• Has at least one public non-deleted copy/move ctor/assign
• If it has a copy/move ctor/assign, it must be public and defaulted
• Has a public defaulted destructor
43
44
C++ Trivially Copyable• Conflated into trivially copyable
• Bitwise copyable
• Layout
• Trivially copyable is too restrictive (not necessary)
• Not sufficient either
• Member functions can throw exceptions
44
45
C++ Trivially Copyable• There are standard library types which are not necessarily trivially
copyable for historical reasons
• pair, tuple (even when the types it contains are trivially copyable)
• And because layout is conflated, changing would be ABI break
• And some which are not yet guaranteed to be trivially copyable
• span, basic_string_view
• These are well on their way to C++23 due to paper P225145
46
C++ Trivially Copyable
• If a lambda captures a non trivially copyable type
• The lambda (which is just a struct) is not trivially copyable
• The lambda cannot be implicitly copied to the kernel
• Lead to some interesting workarounds in Kokkos and RAJA
46
47
C++ Trivially Copyable• __SYCL_DEVICE_ONLY__ macro to make something trivially copyable on
the device
• __SYCL_DEVICE_ONLY__ is defined to 1 if the source file is being compiled with a SYCL device compiler which does not produce host binary
• This can violate the C++ One Definition Rule (ODR) [basic.def.odr]
• No translation unit shall contain more than one definition of any variable, function, class type, enumeration type, template, default argument for a parameter (for a function in a given scope), or default template argument […]
47
C++ Trivially Copyablestruct A { #ifndef __SYCL_DEVICE_ONLY__ ~A() {} #endif };
• This is a static_assert that only fires on the host static_assert(std::is_trivially_copyable_v<A>);
• Worse, what if it is used as a template parameter? template <bool B> void C() { /* ... */ }
C<std::is_trivially_copyable_v<A>>();
• What does it mean to run a destructor on the host but not on a device?
48
49
C++ Trivially Copyable• Manually copy the bytes to the device
• Violates C++ object model (lifetime of objects)
• Copying the bytes does not magically bring non-trivially copyable or non-implicit lifetime types into existence
• Undefined behavior
• May work today, but can easily break tomorrow
49
50
SYCL 2020 - Device Copyable• Types where bitwise copy for inter-device copying has correct semantics
• Unspecified whether or not copy/move ctor/assign is called to do the inter-device copying
• Unspecified whether or not the destructor is called on the device
• Since it must effectively have no effect on the device
• User specializable trait to indicate a type is device copyable
• Specialize at your own risk
50
51
SYCL 2020 - Device Copyable• sycl::is_device_copyable
• Defaults to std::is_trivially_copyable
• Specialized for array, pair, tuple, optional, variant
• When they contain all device copyable types
• array, optional, variant already trivially copyable when they contain all trivially copyable types
• Recursive definition: need to extend it to all device copyable types
• Specialized for span, basic_string_view
51
52
SYCL 2020 - Device Copyable• Limitations
• Trivially copyable recursively works if all the types it aggregates are trivially copyable
• Device copyable manually specified
• C++ Reflection (C++26?)
• Require compiler support
• Another hint at a bigger issue
52
One Definition Rule• Informally, there are two exceptions to the One Definition Rule
• NDEBUG and assert
• std::is_constant_evaluated()
• Looks like a runtime check, but is actually a compile time check
• Buggy when tried to call via if constexpr (std::is_constant_evaluated()) { /* … */ }
• Always true
• Allows different definition in constexpr context
53
if target• NVIDIA GTC21
• if target
• Similar to if constexpr, allows a different definition for devices
• is_device, is_host, specific device types, properties, etc.
• Language change
• Not applicable to SYCL
• Yet another hint at a bigger issue
54
What is the bigger issue?
• C++ has a model for multiple cores and threads on the same computing unit
55
What is the bigger issue?• C++ doesn’t have a model for heterogeneous computing
• C++ doesn’t have a model for multiple processes on the same compute unit, and this is at least an order of magnitude harder
• Optimistically, this would take over a decade to add to C++
• Someone has to propose it and spend years guiding it
• IMO, this is what oneAPI and SYCL should flesh out in the long term
56
Heterogeneous Computing
• Lots of open technical questions
• Is USM part of this model?
• Can most vendors implement this efficiently?
• How does object transfer work?
• How do we allow multiple definitions?
57
Heterogenous Computing• Summary
• C++ has a long way to go
• Standardize things with wide applicability and longevity
• oneAPI, SYCL, Kokkos, RAJA help explore this design space
• In a practical way that we can use now and in the foreseeable future
58
Resources & References• N4885 - Working Draft, Standard for Programming Language C++
• https://wg21.link/N4885
• SYCL 2020 Specification
• https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
• Data Parallel C++ (Reinders, Ashbaugh, Brodman, Kinsner, Pennycook, Tian)
• https://link.springer.com/book/10.1007/978-1-4842-5574-2
• Kokkos
• https://github.com/kokkos
• RAJA
• https://github.com/LLNL/RAJA
• Inside NVC++ and NVFORTRAN (Bryce Adelstein Lelbach) [if target]
• https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31358
59
60
Q&A
61
This presentation was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this presentation used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.