oneAPI, SYCL and standard C++ - where do we need to go ...

61
oneAPI, SYCL & Standard C++ Where do we go from here? Nevin “:-)” Liber [email protected] 1

Transcript of oneAPI, SYCL and standard C++ - where do we need to go ...

Page 1: oneAPI, SYCL and standard C++ - where do we need to go ...

oneAPI, SYCL & Standard C++ Where do we go from here?

Nevin “:-)” Liber [email protected]

1

Page 2: oneAPI, SYCL and standard C++ - where do we need to go ...

2

Nevin “:-)” Liber• Argonne National Laboratory

• Advanced Leadership Computing Facility (ALCF)

• Continue to do C++ standardization

• Kokkos backend for Aurora

• SYCL

• oneAPI

• DPC++

2

Page 3: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Standardization• 2007

• First BoostCon

• Meet Beman Dawes

• Founder of Boost

• Strong advocate for putting Stepanov’s STL into C++98

• Tells me about an upcoming meeting close to me

• In three years…

3

Page 4: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Standardization• 2010

• Local meeting at Fermilab

• Joined the committee

• Learn more about C++

• Represent users

• Give back to the C++ community

4

Page 5: oneAPI, SYCL and standard C++ - where do we need to go ...

C++11March 2011

Madrid5

Page 6: oneAPI, SYCL and standard C++ - where do we need to go ...

C++11March 2011

Madrid6

Page 7: oneAPI, SYCL and standard C++ - where do we need to go ...

C++20February 2020

Prague7

Page 8: oneAPI, SYCL and standard C++ - where do we need to go ...

C++20February 2020

Prague8

Page 9: oneAPI, SYCL and standard C++ - where do we need to go ...

C++20February 2020

Prague9

Page 10: oneAPI, SYCL and standard C++ - where do we need to go ...

February 2020 - Prague• Volunteered to be Vice Chair, Library Evolution Working Group

Incubator (LEWGI) / Study Group 18 (SG18)

• A bit of prep work before and after meeting

• Focus on LEWGI proposals

• Slight change

• Pandemic

10

Page 11: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Committee

• Every member wants to make C++ a better language

• Even if no two of us can agree that I am right on what that is

11

Page 12: oneAPI, SYCL and standard C++ - where do we need to go ...

–The Rolling Stones

“You can’t always get what you want, but if you try sometimes, well, you might find, you get what you need.”

12

Page 13: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Committee• Consensus-by-Committee

• Not Design-by-Committee

• We work on proposals

• It is all about tradeoffs

• Consensus of participants -> Consensus of countries

• Getting what you can live with

13

Page 14: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Committee

• Not an Ivory Tower

• Well all have day jobs

• It is all tradeoffs

• Which you might or might not agree with

• Unlikely we haven’t considered other (major) positions

14

Page 15: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Standardization Limitations

• We have surprisingly little authority

• No authority over hardware, OSes, systems, etc.

• Understanding with implementers

15

Page 16: oneAPI, SYCL and standard C++ - where do we need to go ...

Example: memset_explicit

• A memset that is “guaranteed” not to be optimized away

• What happens if the OS pages out this memory?

• What about other threads or cores?

• How does a guaranteed write fit in with observable behavior?

• At best: undefined, unspecified, or implementation-defined

16

Page 17: oneAPI, SYCL and standard C++ - where do we need to go ...

SYCL• Committee much smaller than C++

• Group effort of really smart people from many different companies

• Standardization effort much newer

• Flesh out ideas for C++ Standardization

• SYCL 2020

• Growing beyond its OpenCL and 3D graphics roots

17

Page 18: oneAPI, SYCL and standard C++ - where do we need to go ...

SYCL Limitations

• The code must be valid C++ code

• Even if we interpret it in strange ways

18

Page 19: oneAPI, SYCL and standard C++ - where do we need to go ...

19

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

19

Page 20: oneAPI, SYCL and standard C++ - where do we need to go ...

20

Unnamed Lambdas

• Weird but valid C++ syntax

• Forward declaration of a function local class

• SYCL 1.2.1

• Name every kernel

• Unique global name for toolchains with separate device compiler

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

20

Page 21: oneAPI, SYCL and standard C++ - where do we need to go ...

Intel, oneAPI & DPC++

• Implementer (hardware & software), interface, & implementation

• Initially tools for Aurora

• Flesh out ideas for SYCL

• Flesh out ideas for C++ Standardization

21

Page 22: oneAPI, SYCL and standard C++ - where do we need to go ...

22

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

22

Page 23: oneAPI, SYCL and standard C++ - where do we need to go ...

23

Unnamed Lambdas

• Initially Intel, now SYCL 2020

• No need to specify it

• Compiler will internally generate a unique name

• May want to specify it to help with debugging

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

23

Page 24: oneAPI, SYCL and standard C++ - where do we need to go ...

Major oneAPI contributions to SYCL

• Unified Shared Memory (USM)

• Fundamentally simpler programming model for a lot of cases

• Tradeoff

• Dependency graph has to be done explicitly

• As opposed to accessors

24

Page 25: oneAPI, SYCL and standard C++ - where do we need to go ...

Major oneAPI contributions to SYCL

• Parallel Reductions

• Class Template Argument Deduction (CTAD)

• Adopting C++17 feature

• Makes it easier to write SYCL code

25

Page 26: oneAPI, SYCL and standard C++ - where do we need to go ...

Kokkos

• Performance Portability EcoSystem

• Flesh out ideas for C++ Standardization

• atomic_ref

• C++20

• Interface adopted by SYCL 2020

26

Page 27: oneAPI, SYCL and standard C++ - where do we need to go ...

Kokkos• C++23 (hopefully) -> SYCL Next (hopefully)

• P0009 mdspan

• P1673 Basic Linear Algebra (BLAS)

• oneMKL (hopefully)

• P0443 Executors

• P2128 Multidimensional subscript operator

• mdspn(x,y) mdspn[x,y]

27

Page 28: oneAPI, SYCL and standard C++ - where do we need to go ...

Short term (SYCL - Next)

• Continue to grow beyond three dimensions

• Why not N dimensions?

• C++ has had variadic templates since C++11

• Requires interface and implementation work

28

Page 29: oneAPI, SYCL and standard C++ - where do we need to go ...

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

• We can be clever and keep this pattern going for N dimensions

• But it is generic code hostile

29

Page 30: oneAPI, SYCL and standard C++ - where do we need to go ...

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

template <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

30

Page 31: oneAPI, SYCL and standard C++ - where do we need to go ...

Range Constructortemplate <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

• Is this really the interface we want?

31

Page 32: oneAPI, SYCL and standard C++ - where do we need to go ...

Better C++ Support• Virtual functions and function pointers

• Why not just use variant?

• Virtual functions model 1 of an indefinite number of types

• std::variant models 0 (valueless_by_exception) or 1 of N known types

• Visitor needs a lot of non-obvious machinery

32

Page 33: oneAPI, SYCL and standard C++ - where do we need to go ...

Virtualstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

• Fairly straightforward

• Collection: vector<unique_ptr<Base>>

33

Page 34: oneAPI, SYCL and standard C++ - where do we need to go ...

Variantstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

• Classes are simpler

• hand-written machinery

• Collection: vector<VariantD>

34

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion from D1 or D2 inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• Inversion of control

• Pattern matching (C++23?) may help alleviate this

Page 35: oneAPI, SYCL and standard C++ - where do we need to go ...

Template-landstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

template <typename D> void CallIt(D& d) { d.Call(); }

• Errors generated at the call

35

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• No collections

Page 36: oneAPI, SYCL and standard C++ - where do we need to go ...

Virtual functions• Why are they hard (from a language perspective)

• Code generated for CPU is different than code generated for GPU

• At different addresses

• May not be addressable by other device

• Yet C++ says one function, one address

• Hint at a bigger issue

36

Page 37: oneAPI, SYCL and standard C++ - where do we need to go ...

Exceptions• For general support, we have to solve virtual functions first

• Throw derived, catch as base class reference try { throw D1(); } catch (Base& b) { b.Call(); }

• Exceptions derived from std::exception

• virtual const char* what() const noexcept

• virtual destructor37

Page 38: oneAPI, SYCL and standard C++ - where do we need to go ...

Better C++ Support

• Virtual Inheritance

• Run-Time Type Information (RTTI)

38

Page 39: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Trivially Copyable

• For almost a decade as a C++ Committee member, I did not know why trivially copyable is important

• I generally supported it because it is more flexible

• But I never pushed for it

• I suspect many in LEWG also do not know why trivially copyable is important

39

Page 40: oneAPI, SYCL and standard C++ - where do we need to go ...

40

Copying Objects

• How do we copy objects in C++?

• Copy constructor / copy assignment operator

• Running code

• Code may access both source and destination

40

Page 41: oneAPI, SYCL and standard C++ - where do we need to go ...

41

Copying Objects• Can we do the same for inter-device copying?

• Non-trivial copy constructor / copy assignment operator

• Where would the code run?

• May not be legal to access both source and destination

• About all we can do is copy the bytes (object representation) that make up the object

41

Page 42: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Trivially Copyable

• C++ trivially copyable types

• Used as a proxy for types where we can copy the bytes

42

Page 43: oneAPI, SYCL and standard C++ - where do we need to go ...

43

C++ Trivially Copyable

• All base classes and non-static members are trivially copyable

• Has at least one public non-deleted copy/move ctor/assign

• If it has a copy/move ctor/assign, it must be public and defaulted

• Has a public defaulted destructor

43

Page 44: oneAPI, SYCL and standard C++ - where do we need to go ...

44

C++ Trivially Copyable• Conflated into trivially copyable

• Bitwise copyable

• Layout

• Trivially copyable is too restrictive (not necessary)

• Not sufficient either

• Member functions can throw exceptions

44

Page 45: oneAPI, SYCL and standard C++ - where do we need to go ...

45

C++ Trivially Copyable• There are standard library types which are not necessarily trivially

copyable for historical reasons

• pair, tuple (even when the types it contains are trivially copyable)

• And because layout is conflated, changing would be ABI break

• And some which are not yet guaranteed to be trivially copyable

• span, basic_string_view

• These are well on their way to C++23 due to paper P225145

Page 46: oneAPI, SYCL and standard C++ - where do we need to go ...

46

C++ Trivially Copyable

• If a lambda captures a non trivially copyable type

• The lambda (which is just a struct) is not trivially copyable

• The lambda cannot be implicitly copied to the kernel

• Lead to some interesting workarounds in Kokkos and RAJA

46

Page 47: oneAPI, SYCL and standard C++ - where do we need to go ...

47

C++ Trivially Copyable• __SYCL_DEVICE_ONLY__ macro to make something trivially copyable on

the device

• __SYCL_DEVICE_ONLY__ is defined to 1 if the source file is being compiled with a SYCL device compiler which does not produce host binary

• This can violate the C++ One Definition Rule (ODR) [basic.def.odr]

• No translation unit shall contain more than one definition of any variable, function, class type, enumeration type, template, default argument for a parameter (for a function in a given scope), or default template argument […]

47

Page 48: oneAPI, SYCL and standard C++ - where do we need to go ...

C++ Trivially Copyablestruct A { #ifndef __SYCL_DEVICE_ONLY__ ~A() {} #endif };

• This is a static_assert that only fires on the host static_assert(std::is_trivially_copyable_v<A>);

• Worse, what if it is used as a template parameter? template <bool B> void C() { /* ... */ }

C<std::is_trivially_copyable_v<A>>();

• What does it mean to run a destructor on the host but not on a device?

48

Page 49: oneAPI, SYCL and standard C++ - where do we need to go ...

49

C++ Trivially Copyable• Manually copy the bytes to the device

• Violates C++ object model (lifetime of objects)

• Copying the bytes does not magically bring non-trivially copyable or non-implicit lifetime types into existence

• Undefined behavior

• May work today, but can easily break tomorrow

49

Page 50: oneAPI, SYCL and standard C++ - where do we need to go ...

50

SYCL 2020 - Device Copyable• Types where bitwise copy for inter-device copying has correct semantics

• Unspecified whether or not copy/move ctor/assign is called to do the inter-device copying

• Unspecified whether or not the destructor is called on the device

• Since it must effectively have no effect on the device

• User specializable trait to indicate a type is device copyable

• Specialize at your own risk

50

Page 51: oneAPI, SYCL and standard C++ - where do we need to go ...

51

SYCL 2020 - Device Copyable• sycl::is_device_copyable

• Defaults to std::is_trivially_copyable

• Specialized for array, pair, tuple, optional, variant

• When they contain all device copyable types

• array, optional, variant already trivially copyable when they contain all trivially copyable types

• Recursive definition: need to extend it to all device copyable types

• Specialized for span, basic_string_view

51

Page 52: oneAPI, SYCL and standard C++ - where do we need to go ...

52

SYCL 2020 - Device Copyable• Limitations

• Trivially copyable recursively works if all the types it aggregates are trivially copyable

• Device copyable manually specified

• C++ Reflection (C++26?)

• Require compiler support

• Another hint at a bigger issue

52

Page 53: oneAPI, SYCL and standard C++ - where do we need to go ...

One Definition Rule• Informally, there are two exceptions to the One Definition Rule

• NDEBUG and assert

• std::is_constant_evaluated()

• Looks like a runtime check, but is actually a compile time check

• Buggy when tried to call via if constexpr (std::is_constant_evaluated()) { /* … */ }

• Always true

• Allows different definition in constexpr context

53

Page 54: oneAPI, SYCL and standard C++ - where do we need to go ...

if target• NVIDIA GTC21

• if target

• Similar to if constexpr, allows a different definition for devices

• is_device, is_host, specific device types, properties, etc.

• Language change

• Not applicable to SYCL

• Yet another hint at a bigger issue

54

Page 55: oneAPI, SYCL and standard C++ - where do we need to go ...

What is the bigger issue?

• C++ has a model for multiple cores and threads on the same computing unit

55

Page 56: oneAPI, SYCL and standard C++ - where do we need to go ...

What is the bigger issue?• C++ doesn’t have a model for heterogeneous computing

• C++ doesn’t have a model for multiple processes on the same compute unit, and this is at least an order of magnitude harder

• Optimistically, this would take over a decade to add to C++

• Someone has to propose it and spend years guiding it

• IMO, this is what oneAPI and SYCL should flesh out in the long term

56

Page 57: oneAPI, SYCL and standard C++ - where do we need to go ...

Heterogeneous Computing

• Lots of open technical questions

• Is USM part of this model?

• Can most vendors implement this efficiently?

• How does object transfer work?

• How do we allow multiple definitions?

57

Page 58: oneAPI, SYCL and standard C++ - where do we need to go ...

Heterogenous Computing• Summary

• C++ has a long way to go

• Standardize things with wide applicability and longevity

• oneAPI, SYCL, Kokkos, RAJA help explore this design space

• In a practical way that we can use now and in the foreseeable future

58

Page 59: oneAPI, SYCL and standard C++ - where do we need to go ...

Resources & References• N4885 - Working Draft, Standard for Programming Language C++

• https://wg21.link/N4885

• SYCL 2020 Specification

• https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf

• Data Parallel C++ (Reinders, Ashbaugh, Brodman, Kinsner, Pennycook, Tian)

• https://link.springer.com/book/10.1007/978-1-4842-5574-2

• Kokkos

• https://github.com/kokkos

• RAJA

• https://github.com/LLNL/RAJA

• Inside NVC++ and NVFORTRAN (Bryce Adelstein Lelbach) [if target]

• https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31358

59

Page 60: oneAPI, SYCL and standard C++ - where do we need to go ...

60

Q&A

Page 61: oneAPI, SYCL and standard C++ - where do we need to go ...

61

This presentation was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this presentation used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.