oneAPI, SYCL and standard C++ - where do we need to go ...

oneAPI, SYCL & Standard C++ Where do we go from here?

Nevin “:-)” Liber nliber@anl.gov

Nevin “:-)” Liber• Argonne National Laboratory

• Advanced Leadership Computing Facility (ALCF)

• Continue to do C++ standardization

• Kokkos backend for Aurora

• SYCL

• oneAPI

• DPC++

C++ Standardization• 2007

• First BoostCon

• Meet Beman Dawes

• Founder of Boost

• Strong advocate for putting Stepanov’s STL into C++98

• Tells me about an upcoming meeting close to me

• In three years…

C++ Standardization• 2010

• Local meeting at Fermilab

• Joined the committee

• Learn more about C++

• Represent users

• Give back to the C++ community

C++11March 2011

Madrid5

C++11March 2011

Madrid6

C++20February 2020

Prague7

C++20February 2020

Prague8

C++20February 2020

Prague9

February 2020 - Prague• Volunteered to be Vice Chair, Library Evolution Working Group

Incubator (LEWGI) / Study Group 18 (SG18)

• A bit of prep work before and after meeting

• Focus on LEWGI proposals

• Slight change

• Pandemic

C++ Committee

• Every member wants to make C++ a better language

• Even if no two of us can agree that I am right on what that is

–The Rolling Stones

“You can’t always get what you want, but if you try sometimes, well, you might find, you get what you need.”

C++ Committee• Consensus-by-Committee

• Not Design-by-Committee

• We work on proposals

• It is all about tradeoffs

• Consensus of participants -> Consensus of countries

• Getting what you can live with

C++ Committee

• Not an Ivory Tower

• Well all have day jobs

• It is all tradeoffs

• Which you might or might not agree with

• Unlikely we haven’t considered other (major) positions

C++ Standardization Limitations

• We have surprisingly little authority

• No authority over hardware, OSes, systems, etc.

• Understanding with implementers

Example: memset_explicit

• A memset that is “guaranteed” not to be optimized away

• What happens if the OS pages out this memory?

• What about other threads or cores?

• How does a guaranteed write fit in with observable behavior?

• At best: undefined, unspecified, or implementation-defined

SYCL• Committee much smaller than C++

• Group effort of really smart people from many different companies

• Standardization effort much newer

• Flesh out ideas for C++ Standardization

• SYCL 2020

• Growing beyond its OpenCL and 3D graphics roots

SYCL Limitations

• The code must be valid C++ code

• Even if we interpret it in strange ways

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

Unnamed Lambdas

• Weird but valid C++ syntax

• Forward declaration of a function local class

• SYCL 1.2.1

• Name every kernel

• Unique global name for toolchains with separate device compiler

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

Intel, oneAPI & DPC++

• Implementer (hardware & software), interface, & implementation

• Initially tools for Aurora

• Flesh out ideas for SYCL

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

Unnamed Lambdas

• Initially Intel, now SYCL 2020

• No need to specify it

• Compiler will internally generate a unique name

• May want to specify it to help with debugging

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

Major oneAPI contributions to SYCL

• Unified Shared Memory (USM)

• Fundamentally simpler programming model for a lot of cases

• Tradeoff

• Dependency graph has to be done explicitly

• As opposed to accessors

Major oneAPI contributions to SYCL

• Parallel Reductions

• Class Template Argument Deduction (CTAD)

• Adopting C++17 feature

• Makes it easier to write SYCL code

Kokkos

• Performance Portability EcoSystem

• atomic_ref

• C++20

• Interface adopted by SYCL 2020

Kokkos• C++23 (hopefully) -> SYCL Next (hopefully)

• P0009 mdspan

• P1673 Basic Linear Algebra (BLAS)

• oneMKL (hopefully)

• P0443 Executors

• P2128 Multidimensional subscript operator

• mdspn(x,y) mdspn[x,y]

Short term (SYCL - Next)

• Continue to grow beyond three dimensions

• Why not N dimensions?

• C++ has had variadic templates since C++11

• Requires interface and implementation work

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

• We can be clever and keep this pattern going for N dimensions

• But it is generic code hostile

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

template <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

Range Constructortemplate <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

• Is this really the interface we want?

Better C++ Support• Virtual functions and function pointers

• Why not just use variant?

• Virtual functions model 1 of an indefinite number of types

• std::variant models 0 (valueless_by_exception) or 1 of N known types

• Visitor needs a lot of non-obvious machinery

Virtualstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

• Fairly straightforward

• Collection: vector<unique_ptr<Base>>

Variantstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

• Classes are simpler

• hand-written machinery

• Collection: vector<VariantD>

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion from D1 or D2 inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• Inversion of control

• Pattern matching (C++23?) may help alleviate this

Template-landstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

template <typename D> void CallIt(D& d) { d.Call(); }

• Errors generated at the call

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• No collections

Virtual functions• Why are they hard (from a language perspective)

• Code generated for CPU is different than code generated for GPU

• At different addresses

• May not be addressable by other device

• Yet C++ says one function, one address

• Hint at a bigger issue

Exceptions• For general support, we have to solve virtual functions first

• Throw derived, catch as base class reference try { throw D1(); } catch (Base& b) { b.Call(); }

• Exceptions derived from std::exception

• virtual const char* what() const noexcept

• virtual destructor37

Better C++ Support

• Virtual Inheritance

• Run-Time Type Information (RTTI)

C++ Trivially Copyable

• For almost a decade as a C++ Committee member, I did not know why trivially copyable is important

• I generally supported it because it is more flexible

• But I never pushed for it

• I suspect many in LEWG also do not know why trivially copyable is important

Copying Objects

• How do we copy objects in C++?

• Copy constructor / copy assignment operator

• Running code

• Code may access both source and destination

Copying Objects• Can we do the same for inter-device copying?

• Non-trivial copy constructor / copy assignment operator

• Where would the code run?

• May not be legal to access both source and destination

• About all we can do is copy the bytes (object representation) that make up the object

• C++ trivially copyable types

• Used as a proxy for types where we can copy the bytes

• All base classes and non-static members are trivially copyable

• Has at least one public non-deleted copy/move ctor/assign

• If it has a copy/move ctor/assign, it must be public and defaulted

• Has a public defaulted destructor

C++ Trivially Copyable• Conflated into trivially copyable

• Bitwise copyable

• Layout

• Trivially copyable is too restrictive (not necessary)

• Not sufficient either

• Member functions can throw exceptions

C++ Trivially Copyable• There are standard library types which are not necessarily trivially

copyable for historical reasons

• pair, tuple (even when the types it contains are trivially copyable)

• And because layout is conflated, changing would be ABI break

• And some which are not yet guaranteed to be trivially copyable

• span, basic_string_view

• These are well on their way to C++23 due to paper P225145

• If a lambda captures a non trivially copyable type

• The lambda (which is just a struct) is not trivially copyable

• The lambda cannot be implicitly copied to the kernel

• Lead to some interesting workarounds in Kokkos and RAJA

C++ Trivially Copyable• __SYCL_DEVICE_ONLY__ macro to make something trivially copyable on

the device

• __SYCL_DEVICE_ONLY__ is defined to 1 if the source file is being compiled with a SYCL device compiler which does not produce host binary

• This can violate the C++ One Definition Rule (ODR) [basic.def.odr]

• No translation unit shall contain more than one definition of any variable, function, class type, enumeration type, template, default argument for a parameter (for a function in a given scope), or default template argument […]

C++ Trivially Copyablestruct A { #ifndef __SYCL_DEVICE_ONLY__ ~A() {} #endif };

• This is a static_assert that only fires on the host static_assert(std::is_trivially_copyable_v<A>);

• Worse, what if it is used as a template parameter? template <bool B> void C() { /* ... */ }

C<std::is_trivially_copyable_v<A>>();

• What does it mean to run a destructor on the host but not on a device?

C++ Trivially Copyable• Manually copy the bytes to the device

• Violates C++ object model (lifetime of objects)

• Copying the bytes does not magically bring non-trivially copyable or non-implicit lifetime types into existence

• Undefined behavior

• May work today, but can easily break tomorrow

SYCL 2020 - Device Copyable• Types where bitwise copy for inter-device copying has correct semantics

• Unspecified whether or not copy/move ctor/assign is called to do the inter-device copying

• Unspecified whether or not the destructor is called on the device

• Since it must effectively have no effect on the device

• User specializable trait to indicate a type is device copyable

• Specialize at your own risk

SYCL 2020 - Device Copyable• sycl::is_device_copyable

• Defaults to std::is_trivially_copyable

• Specialized for array, pair, tuple, optional, variant

• When they contain all device copyable types

• array, optional, variant already trivially copyable when they contain all trivially copyable types

• Recursive definition: need to extend it to all device copyable types

• Specialized for span, basic_string_view

SYCL 2020 - Device Copyable• Limitations

• Trivially copyable recursively works if all the types it aggregates are trivially copyable

• Device copyable manually specified

• C++ Reflection (C++26?)

• Require compiler support

• Another hint at a bigger issue

One Definition Rule• Informally, there are two exceptions to the One Definition Rule

• NDEBUG and assert

• std::is_constant_evaluated()

• Looks like a runtime check, but is actually a compile time check

• Buggy when tried to call via if constexpr (std::is_constant_evaluated()) { /* … */ }

• Always true

• Allows different definition in constexpr context

if target• NVIDIA GTC21

• if target

• Similar to if constexpr, allows a different definition for devices

• is_device, is_host, specific device types, properties, etc.

• Language change

• Not applicable to SYCL

• Yet another hint at a bigger issue

What is the bigger issue?

• C++ has a model for multiple cores and threads on the same computing unit

What is the bigger issue?• C++ doesn’t have a model for heterogeneous computing

• C++ doesn’t have a model for multiple processes on the same compute unit, and this is at least an order of magnitude harder

• Optimistically, this would take over a decade to add to C++

• Someone has to propose it and spend years guiding it

• IMO, this is what oneAPI and SYCL should flesh out in the long term

Heterogeneous Computing

• Lots of open technical questions

• Is USM part of this model?

• Can most vendors implement this efficiently?

• How does object transfer work?

• How do we allow multiple definitions?

Heterogenous Computing• Summary

• C++ has a long way to go

• Standardize things with wide applicability and longevity

• oneAPI, SYCL, Kokkos, RAJA help explore this design space

• In a practical way that we can use now and in the foreseeable future

Resources & References• N4885 - Working Draft, Standard for Programming Language C++

• https://wg21.link/N4885

• SYCL 2020 Specification

• https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf

• Data Parallel C++ (Reinders, Ashbaugh, Brodman, Kinsner, Pennycook, Tian)

• https://link.springer.com/book/10.1007/978-1-4842-5574-2

• Kokkos

• https://github.com/kokkos

• RAJA

• https://github.com/LLNL/RAJA

• Inside NVC++ and NVFORTRAN (Bryce Adelstein Lelbach) [if target]

• https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31358

This presentation was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this presentation used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

oneAPI, SYCL and standard C++ - where do we need to go ...

Documents

Transcript of oneAPI, SYCL and standard C++ - where do we need to go ...

Getting started with oneAPI

OpenCL SYCL 2.2 Specification

oneAPI - Intel€¦ · These organizations support the oneAPI initiative ‘concept’ for a single, unified programming model for cross-architecture development. It does not indicate

Heterogeneous Processors Using SYCL Extending … Potter.pdfHeterogeneous Processors Using SYCL Ralph Potter, Senior Research Engineer RC4DL: Reconfigurable Computing for Deep Learning

A oneAPI Case Study: easyWave

GSMA OneAPI Gateway Launch Presentation

Intel® MPI Library for Intel® oneAPI on Linux* OS

Using Intel oneAPI Toolkits with FPGAs*

SYCL: An Abstraction Layer for Leveraging C++ and OpenCL

インテル® oneAPI レンダリング・ ツールキット...One Intel Software & Architecture (OISA) 1 インテル® oneAPI レンダリング・ツール キットでレイトレーシングの革新を実現

Intel oneAPI: a Performance Study

SYCL – Introduction and Best Practices

SYCL State of the Union Keynote SYCLCon 2021 - IWOCL

SYCL & DPC++ - Improvements to the SYCL Programming Model

SYCL in HPC - (Indico)...2020/02/18 · SYCL in HPC Peter Žužek, Senior Software Engineer, SYCL Performance Workshop on Efficient Computing for High Energy Physics, Edinburgh February

OneAPI New_Deck_2016

triSYCL - Open Source C++17 & OpenMP-based OpenCL SYCL ...codeplaysoftware.github.io/iwocl2015/presentations/... · Branch SYCL-1.2-provisional-2: previous public version, from SC14

Intel® oneAPI Programming Guide · 1.1 oneAPI Programming Model Overview The oneAPI programming model provides a comprehensive and unified portfolio of developer tools that can be

SYCL 2020 API Reference Guide Page 1 SYCL Developers ......SYCL 2020 API Reference Guide Page 2 Queue class [4.6.5] The queue class encapsulates a single queue which schedules kernels

Using the GSMA OneAPI Gateway

インテル® oneAPI レンダリング・ツールキット...One Intel Software & Architecture (OISA) 1 インテル® oneAPI レンダリング・ツールキットでレイトレーシングの革新を実現