oneAPI, SYCL and standard C++ - where do we need to go ...

Post on 19-May-2022

5 views 0 download

Transcript of oneAPI, SYCL and standard C++ - where do we need to go ...

oneAPI, SYCL & Standard C++ Where do we go from here?

Nevin “:-)” Liber nliber@anl.gov

1

2

Nevin “:-)” Liber• Argonne National Laboratory

• Advanced Leadership Computing Facility (ALCF)

• Continue to do C++ standardization

• Kokkos backend for Aurora

• SYCL

• oneAPI

• DPC++

2

C++ Standardization• 2007

• First BoostCon

• Meet Beman Dawes

• Founder of Boost

• Strong advocate for putting Stepanov’s STL into C++98

• Tells me about an upcoming meeting close to me

• In three years…

3

C++ Standardization• 2010

• Local meeting at Fermilab

• Joined the committee

• Learn more about C++

• Represent users

• Give back to the C++ community

4

C++11March 2011

Madrid5

C++11March 2011

Madrid6

C++20February 2020

Prague7

C++20February 2020

Prague8

C++20February 2020

Prague9

February 2020 - Prague• Volunteered to be Vice Chair, Library Evolution Working Group

Incubator (LEWGI) / Study Group 18 (SG18)

• A bit of prep work before and after meeting

• Focus on LEWGI proposals

• Slight change

• Pandemic

10

C++ Committee

• Every member wants to make C++ a better language

• Even if no two of us can agree that I am right on what that is

11

–The Rolling Stones

“You can’t always get what you want, but if you try sometimes, well, you might find, you get what you need.”

12

C++ Committee• Consensus-by-Committee

• Not Design-by-Committee

• We work on proposals

• It is all about tradeoffs

• Consensus of participants -> Consensus of countries

• Getting what you can live with

13

C++ Committee

• Not an Ivory Tower

• Well all have day jobs

• It is all tradeoffs

• Which you might or might not agree with

• Unlikely we haven’t considered other (major) positions

14

C++ Standardization Limitations

• We have surprisingly little authority

• No authority over hardware, OSes, systems, etc.

• Understanding with implementers

15

Example: memset_explicit

• A memset that is “guaranteed” not to be optimized away

• What happens if the OS pages out this memory?

• What about other threads or cores?

• How does a guaranteed write fit in with observable behavior?

• At best: undefined, unspecified, or implementation-defined

16

SYCL• Committee much smaller than C++

• Group effort of really smart people from many different companies

• Standardization effort much newer

• Flesh out ideas for C++ Standardization

• SYCL 2020

• Growing beyond its OpenCL and 3D graphics roots

17

SYCL Limitations

• The code must be valid C++ code

• Even if we interpret it in strange ways

18

19

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

19

20

Unnamed Lambdas

• Weird but valid C++ syntax

• Forward declaration of a function local class

• SYCL 1.2.1

• Name every kernel

• Unique global name for toolchains with separate device compiler

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

20

Intel, oneAPI & DPC++

• Implementer (hardware & software), interface, & implementation

• Initially tools for Aurora

• Flesh out ideas for SYCL

• Flesh out ideas for C++ Standardization

21

22

Unnamed Lambdascgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

22

23

Unnamed Lambdas

• Initially Intel, now SYCL 2020

• No need to specify it

• Compiler will internally generate a unique name

• May want to specify it to help with debugging

cgh.parallel_for<class kernel_name>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; });

23

Major oneAPI contributions to SYCL

• Unified Shared Memory (USM)

• Fundamentally simpler programming model for a lot of cases

• Tradeoff

• Dependency graph has to be done explicitly

• As opposed to accessors

24

Major oneAPI contributions to SYCL

• Parallel Reductions

• Class Template Argument Deduction (CTAD)

• Adopting C++17 feature

• Makes it easier to write SYCL code

25

Kokkos

• Performance Portability EcoSystem

• Flesh out ideas for C++ Standardization

• atomic_ref

• C++20

• Interface adopted by SYCL 2020

26

Kokkos• C++23 (hopefully) -> SYCL Next (hopefully)

• P0009 mdspan

• P1673 Basic Linear Algebra (BLAS)

• oneMKL (hopefully)

• P0443 Executors

• P2128 Multidimensional subscript operator

• mdspn(x,y) mdspn[x,y]

27

Short term (SYCL - Next)

• Continue to grow beyond three dimensions

• Why not N dimensions?

• C++ has had variadic templates since C++11

• Requires interface and implementation work

28

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

• We can be clever and keep this pattern going for N dimensions

• But it is generic code hostile

29

Range Constructortemplate <int dimensions = 1> struct range { /* The following constructor is only available in the range class specialization where: dimensions==1 */ range(size_t dim0); /* The following constructor is only available in the range class specialization where: dimensions==2 */ range(size_t dim0, size_t dim1); /* The following constructor is only available in the range class specialization where: dimensions==3 */ range(size_t dim0, size_t dim1, size_t dim2);

//... }; // Deduction guides range(size_t) -> range<1>; range(size_t, size_t) -> range<2>; range(size_t, size_t, size_t) -> range<3>;

template <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

30

Range Constructortemplate <int dimensions = 1> struct range { static_assert(0 < dimensions);

template <typename... Us, typename = std::enable_if_t< sizeof...(Us) == dimensions && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&... us) : dims{static_cast<size_t>(std::forward<Us>(us))...} {} // ... }; // Deduction guides template <typename... Us, typename = std::enable_if_t< sizeof...(Us) && (std::is_convertible_v<Us, size_t> && ...)>> range(Us&&...) -> range<sizeof...(Us)>;

• Is this really the interface we want?

31

Better C++ Support• Virtual functions and function pointers

• Why not just use variant?

• Virtual functions model 1 of an indefinite number of types

• std::variant models 0 (valueless_by_exception) or 1 of N known types

• Visitor needs a lot of non-obvious machinery

32

Virtualstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

• Fairly straightforward

• Collection: vector<unique_ptr<Base>>

33

Variantstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

• Classes are simpler

• hand-written machinery

• Collection: vector<VariantD>

34

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion from D1 or D2 inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• Inversion of control

• Pattern matching (C++23?) may help alleviate this

Template-landstruct Base { virtual void Call() = 0; virtual ~Base() = default; };

struct D1 : Base { void Call() override { /* ... */ } };

struct D2 : Base { void Call() override { /* ... */ } };

inline void CallIt(Base& b) { b.Call(); }

template <typename D> void CallIt(D& d) { d.Call(); }

• Errors generated at the call

35

using VariantD = std::variant<D1, D2>;

struct VariantDVisitor { template <typename D> void operator()(D&& d) const { d.Call(); } };

inline void CallIt(VariantD& d) { static const VariantDVisitor vis; std::visit(vis, d); }

// Implicit conversion inline void CallIt(VariantD&& d) { static const VariantDVisitor vis; std::visit(vis, d); }

• No collections

Virtual functions• Why are they hard (from a language perspective)

• Code generated for CPU is different than code generated for GPU

• At different addresses

• May not be addressable by other device

• Yet C++ says one function, one address

• Hint at a bigger issue

36

Exceptions• For general support, we have to solve virtual functions first

• Throw derived, catch as base class reference try { throw D1(); } catch (Base& b) { b.Call(); }

• Exceptions derived from std::exception

• virtual const char* what() const noexcept

• virtual destructor37

Better C++ Support

• Virtual Inheritance

• Run-Time Type Information (RTTI)

38

C++ Trivially Copyable

• For almost a decade as a C++ Committee member, I did not know why trivially copyable is important

• I generally supported it because it is more flexible

• But I never pushed for it

• I suspect many in LEWG also do not know why trivially copyable is important

39

40

Copying Objects

• How do we copy objects in C++?

• Copy constructor / copy assignment operator

• Running code

• Code may access both source and destination

40

41

Copying Objects• Can we do the same for inter-device copying?

• Non-trivial copy constructor / copy assignment operator

• Where would the code run?

• May not be legal to access both source and destination

• About all we can do is copy the bytes (object representation) that make up the object

41

C++ Trivially Copyable

• C++ trivially copyable types

• Used as a proxy for types where we can copy the bytes

42

43

C++ Trivially Copyable

• All base classes and non-static members are trivially copyable

• Has at least one public non-deleted copy/move ctor/assign

• If it has a copy/move ctor/assign, it must be public and defaulted

• Has a public defaulted destructor

43

44

C++ Trivially Copyable• Conflated into trivially copyable

• Bitwise copyable

• Layout

• Trivially copyable is too restrictive (not necessary)

• Not sufficient either

• Member functions can throw exceptions

44

45

C++ Trivially Copyable• There are standard library types which are not necessarily trivially

copyable for historical reasons

• pair, tuple (even when the types it contains are trivially copyable)

• And because layout is conflated, changing would be ABI break

• And some which are not yet guaranteed to be trivially copyable

• span, basic_string_view

• These are well on their way to C++23 due to paper P225145

46

C++ Trivially Copyable

• If a lambda captures a non trivially copyable type

• The lambda (which is just a struct) is not trivially copyable

• The lambda cannot be implicitly copied to the kernel

• Lead to some interesting workarounds in Kokkos and RAJA

46

47

C++ Trivially Copyable• __SYCL_DEVICE_ONLY__ macro to make something trivially copyable on

the device

• __SYCL_DEVICE_ONLY__ is defined to 1 if the source file is being compiled with a SYCL device compiler which does not produce host binary

• This can violate the C++ One Definition Rule (ODR) [basic.def.odr]

• No translation unit shall contain more than one definition of any variable, function, class type, enumeration type, template, default argument for a parameter (for a function in a given scope), or default template argument […]

47

C++ Trivially Copyablestruct A { #ifndef __SYCL_DEVICE_ONLY__ ~A() {} #endif };

• This is a static_assert that only fires on the host static_assert(std::is_trivially_copyable_v<A>);

• Worse, what if it is used as a template parameter? template <bool B> void C() { /* ... */ }

C<std::is_trivially_copyable_v<A>>();

• What does it mean to run a destructor on the host but not on a device?

48

49

C++ Trivially Copyable• Manually copy the bytes to the device

• Violates C++ object model (lifetime of objects)

• Copying the bytes does not magically bring non-trivially copyable or non-implicit lifetime types into existence

• Undefined behavior

• May work today, but can easily break tomorrow

49

50

SYCL 2020 - Device Copyable• Types where bitwise copy for inter-device copying has correct semantics

• Unspecified whether or not copy/move ctor/assign is called to do the inter-device copying

• Unspecified whether or not the destructor is called on the device

• Since it must effectively have no effect on the device

• User specializable trait to indicate a type is device copyable

• Specialize at your own risk

50

51

SYCL 2020 - Device Copyable• sycl::is_device_copyable

• Defaults to std::is_trivially_copyable

• Specialized for array, pair, tuple, optional, variant

• When they contain all device copyable types

• array, optional, variant already trivially copyable when they contain all trivially copyable types

• Recursive definition: need to extend it to all device copyable types

• Specialized for span, basic_string_view

51

52

SYCL 2020 - Device Copyable• Limitations

• Trivially copyable recursively works if all the types it aggregates are trivially copyable

• Device copyable manually specified

• C++ Reflection (C++26?)

• Require compiler support

• Another hint at a bigger issue

52

One Definition Rule• Informally, there are two exceptions to the One Definition Rule

• NDEBUG and assert

• std::is_constant_evaluated()

• Looks like a runtime check, but is actually a compile time check

• Buggy when tried to call via if constexpr (std::is_constant_evaluated()) { /* … */ }

• Always true

• Allows different definition in constexpr context

53

if target• NVIDIA GTC21

• if target

• Similar to if constexpr, allows a different definition for devices

• is_device, is_host, specific device types, properties, etc.

• Language change

• Not applicable to SYCL

• Yet another hint at a bigger issue

54

What is the bigger issue?

• C++ has a model for multiple cores and threads on the same computing unit

55

What is the bigger issue?• C++ doesn’t have a model for heterogeneous computing

• C++ doesn’t have a model for multiple processes on the same compute unit, and this is at least an order of magnitude harder

• Optimistically, this would take over a decade to add to C++

• Someone has to propose it and spend years guiding it

• IMO, this is what oneAPI and SYCL should flesh out in the long term

56

Heterogeneous Computing

• Lots of open technical questions

• Is USM part of this model?

• Can most vendors implement this efficiently?

• How does object transfer work?

• How do we allow multiple definitions?

57

Heterogenous Computing• Summary

• C++ has a long way to go

• Standardize things with wide applicability and longevity

• oneAPI, SYCL, Kokkos, RAJA help explore this design space

• In a practical way that we can use now and in the foreseeable future

58

Resources & References• N4885 - Working Draft, Standard for Programming Language C++

• https://wg21.link/N4885

• SYCL 2020 Specification

• https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf

• Data Parallel C++ (Reinders, Ashbaugh, Brodman, Kinsner, Pennycook, Tian)

• https://link.springer.com/book/10.1007/978-1-4842-5574-2

• Kokkos

• https://github.com/kokkos

• RAJA

• https://github.com/LLNL/RAJA

• Inside NVC++ and NVFORTRAN (Bryce Adelstein Lelbach) [if target]

• https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31358

59

60

Q&A

61

This presentation was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this presentation used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.