PMAM2017.keyMichael Haidl · Michel Steuwer · Hendrik Dirks Tim
Humernbrum · Sergei Gorlatch
The State of GPU Programming
• Low-Level GPU programming with CUDA / OpenCL is widely considered
too difficult
• Higher level approaches improve programmability
• Thrust and others allow programmers to write programs by
customising and composing patterns
2
Dot Product Example in Thrust
Dot Product expressed as special case No composition of universal
patterns
Our LLVM-based, C++-to-GPU code generator uses multi-staging
optimizations to embed values only known at CPU runtime into the
kernel code which is then JIT com- piled and executed on the GPU.
This allows to specialize the kernel code for the particular input
size and particular number of threads executing the kernel.
Our experimental evaluation demonstrates that compos- able code
written using our approach achieves the same performance as
low-level monolithic implementations in CUDA and outperforms
composable code written using Thrust.
This paper makes the following main contributions:
• Enabling composable GPU programming by enriching the range-v3
library with GPU-enabled algorithms;
• Giving programmers full control and better predictability about
kernel fusion by choosing between eager actions and algorithms and
lazy views;
• Implementing JIT code generation which takes advan- tage of
multi-staging to optimize GPU code based on val- ues known at
runtime of the CPU program.
The remainder of the paper is organized as follows: In Section 2 we
recap related work on GPU programming and we motivate the necessity
for writing efficient composable GPU code. Section 3 introduces our
API with ranges and GPU-enabled algorithms. Section 4 introduces
actions and views and show how these give the programmer precise
con- trol over kernel fusion. Section 5 explains our C++-to-GPU
code generation process and our enhanced compiler opti- mizations
(in particular multi-staging). Section 6 evaluates our approach,
and Section 7 concludes the paper.
2. Background and Related Work GPU programming using the current
low-level programming models such as OpenCL and CUDA is
challenging. Special kernel functions are executed in parallel on
the GPU by ex- plicitly specifying the number of executing threads
running in parallel. OpenCL and CUDA clearly separate the host pro-
gram running on the CPU from the GPU program (kernel) which is
written in a subset of the C/C++ programming lan- guage with
GPU-specific extensions.
1 __global__ void partialDotProduct(float* a, 2 float* b, 3 float*
res){ 4 extern __shared__ float* tmp; 5 int lid = threadIdx.x; 6
int gid = lid + blockIdx.x * blockDim.x; 7 tmp[lid] = a[gid] *
b[gid]; 8 __syncthreads (); 9 for (int i = get_local_size (0)/2;
i>0; i*=2) {
10 if (lid < i) tmp[lid] += tmp[lid + i]; 11 __syncthreads (); }
12 if (lid == 0) res[blockIdx.x] = tmp[lid];}
Listing 1: GPU Kernel in CUDA computing a dot product.
Listing 1 shows a simple, unoptimized CUDA kernel, adopted from the
Nvidia toolkit [14] that computes a dot product of vectors a and b.
In line 7 each thread multiplies the corresponding elements of the
two vectors with each other. A tree-based reduction is performed by
a group of threads (called block in CUDA): in each iteration of the
loop in line 9 the number of active threads which add up two el-
ements (line 10) is halved. Finally, one thread of each block
writes the computed result back into memory (line 12). The barriers
in lines 8 and 11 ensure a consistent view of the memory across all
threads. As synchronization across blocks is not possible, the
summation of all results computed by the blocks cannot be performed
in this kernel and must either be done on the host or in another
CUDA kernel. The imple- mentation in Listing 1 is not very
efficient on modern GPUs and can be significantly optimized [9]
which makes the code significantly more complex. But even the
simple dot product code is not trivial: it works in a hierarchical
CUDA address space (global and shared memory) and requires barrier
syn- chronization, making it prone to subtle bugs like deadlocks
and race conditions.
One popular way to overcome this low-level program- ming style is
to provide reusable generic parallel patterns, also known as
algorithmic skeletons. Thrust [2], Bolt [1], Accelerate [11], and
SkelCL [19] follow this approach.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 thrust :: device_vector <float
> d_a = a; 4 thrust :: device_vector <float > d_b = b; 5
return thrust :: inner_product( 6 d_a.begin(), d_a.end(),
d_b.begin(), 0.0f); }
Listing 2: Optimal dot product implementation in Thrust using a
domain-specific library function.
Listing 2 shows the implementation of the dot product in Thrust
using a single library call. This implementation is straightforward
and relies on an optimized GPU pattern writ- ten by experts and
tuned for performance. Unfortunately, it uses a very
domain-specific library function (inner_product) which limits its
applicability for other applications.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 thrust :: device_vector <float
> d_a = a; 4 thrust :: device_vector <float > d_b = b; 5
thrust :: device_vector <float > tmp(a.size()); 6 thrust ::
transform(d_a.begin(), d_a.end(), 7 d_b.begin(), tmp.begin (), 8
thrust ::multiplies <floal >()); 9 return thurst ::
reduce(tmp.begin (), tmp.end());}
Listing 3: Generic and composable, but non-optimal Thrust
implemenation of dot product
In Listing 3, the dot product is implemented in Thrust us- ing two
smaller but more universal patterns: transform and reduce. However,
here we pay a performance penalty for
2 2016/12/16
In Thrust: Two Patterns !== Two Kernels !-> Bad
Performance
Our LLVM-based, C++-to-GPU code generator uses multi-staging
optimizations to embed values only known at CPU runtime into the
kernel code which is then JIT com- piled and executed on the GPU.
This allows to specialize the kernel code for the particular input
size and particular number of threads executing the kernel.
Our experimental evaluation demonstrates that compos- able code
written using our approach achieves the same performance as
low-level monolithic implementations in CUDA and outperforms
composable code written using Thrust.
This paper makes the following main contributions:
• Enabling composable GPU programming by enriching the range-v3
library with GPU-enabled algorithms;
• Giving programmers full control and better predictability about
kernel fusion by choosing between eager actions and algorithms and
lazy views;
• Implementing JIT code generation which takes advan- tage of
multi-staging to optimize GPU code based on val- ues known at
runtime of the CPU program.
The remainder of the paper is organized as follows: In Section 2 we
recap related work on GPU programming and we motivate the necessity
for writing efficient composable GPU code. Section 3 introduces our
API with ranges and GPU-enabled algorithms. Section 4 introduces
actions and views and show how these give the programmer precise
con- trol over kernel fusion. Section 5 explains our C++-to-GPU
code generation process and our enhanced compiler opti- mizations
(in particular multi-staging). Section 6 evaluates our approach,
and Section 7 concludes the paper.
2. Background and Related Work GPU programming using the current
low-level programming models such as OpenCL and CUDA is
challenging. Special kernel functions are executed in parallel on
the GPU by ex- plicitly specifying the number of executing threads
running in parallel. OpenCL and CUDA clearly separate the host pro-
gram running on the CPU from the GPU program (kernel) which is
written in a subset of the C/C++ programming lan- guage with
GPU-specific extensions.
1 __global__ void partialDotProduct(float* a, 2 float* b, 3 float*
res){ 4 extern __shared__ float* tmp; 5 int lid = threadIdx.x; 6
int gid = lid + blockIdx.x * blockDim.x; 7 tmp[lid] = a[gid] *
b[gid]; 8 __syncthreads (); 9 for (int i = get_local_size (0)/2;
i>0; i*=2) {
10 if (lid < i) tmp[lid] += tmp[lid + i]; 11 __syncthreads (); }
12 if (lid == 0) res[blockIdx.x] = tmp[lid ];}
Listing 1: GPU Kernel in CUDA computing a dot product.
Listing 1 shows a simple, unoptimized CUDA kernel, adopted from the
Nvidia toolkit [14] that computes a dot product of vectors a and b.
In line 7 each thread multiplies the corresponding elements of the
two vectors with each other. A tree-based reduction is performed by
a group of threads (called block in CUDA): in each iteration of the
loop in line 9 the number of active threads which add up two el-
ements (line 10) is halved. Finally, one thread of each block
writes the computed result back into memory (line 12). The barriers
in lines 8 and 11 ensure a consistent view of the memory across all
threads. As synchronization across blocks is not possible, the
summation of all results computed by the blocks cannot be performed
in this kernel and must either be done on the host or in another
CUDA kernel. The imple- mentation in Listing 1 is not very
efficient on modern GPUs and can be significantly optimized [9]
which makes the code significantly more complex. But even the
simple dot product code is not trivial: it works in a hierarchical
CUDA address space (global and shared memory) and requires barrier
syn- chronization, making it prone to subtle bugs like deadlocks
and race conditions.
One popular way to overcome this low-level program- ming style is
to provide reusable generic parallel patterns, also known as
algorithmic skeletons. Thrust [2], Bolt [1], Accelerate [11], and
SkelCL [19] follow this approach.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 thrust :: device_vector <float
> d_a = a; 4 thrust :: device_vector <float > d_b = b; 5
return thrust :: inner_product( 6 d_a.begin(), d_a.end(),
d_b.begin(), 0.0f); }
Listing 2: Optimal dot product implementation in Thrust using a
domain-specific library function.
Listing 2 shows the implementation of the dot product in Thrust
using a single library call. This implementation is straightforward
and relies on an optimized GPU pattern writ- ten by experts and
tuned for performance. Unfortunately, it uses a very
domain-specific library function (inner_product) which limits its
applicability for other applications.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 thrust :: device_vector <float
> d_a = a; 4 thrust :: device_vector <float > d_b = b; 5
thrust :: device_vector <float > tmp(a.size()); 6 thrust ::
transform(d_a.begin (), d_a.end(), 7 d_b.begin(), tmp.begin (), 8
thrust :: multiplies <floal >()); 9 return thurst ::
reduce(tmp.begin (), tmp.end());}
Listing 3: Generic and composable, but non-optimal Thrust
implemenation of dot product
In Listing 3, the dot product is implemented in Thrust us- ing two
smaller but more universal patterns: transform and reduce. However,
here we pay a performance penalty for
2 2016/12/164
Universal patterns
Iterators prevent composable programming style
Composability in the Range-based STL*
• Replacing pairs of Iterators with Ranges allows for a composable
style:
• We can even write:
5
pattern composition: two kernels are launched and a tem- porary
vector tmp is required to store the intermediate re- sult of
transform. This loss of efficiency is the reason why Thrust offers
the specific inner_product function: high per- formance is not
achievable by composing universal patterns.
In the next sections, we present our approach where higher-level
programs can be expressed as a composition of universal patterns
without a performance loss.
3. Programming with Ranges and GPU-Enabled Algorithms
Our suggested composable API for GPUs is inspired by the Standard
Template Library (STL) which is widely used in C++ programming. The
STL consists of three main components: 1) containers: collection
data types such as std::vector or std::set, 2) algorithms: reusable
operations on containers such as std::accumulate or std::sort, 3)
it- erators: glue containers with algorithms, such that the same
algorithm is reusable for different containers. This library design
has proven to be highly flexible and is one of the main reasons of
the STL’s success.
3.1 From Iterators to Ranges The existing STL with iterators does
not allow to compose algorithms easily. Listing 4 shows the dot
product example using two STL algorithms with iterators: transform
to ap- ply the binary mult function to the corresponding elements
of the input containers a and b, and accumulate to sum up all the
values of a sequence. These two algorithms do not compose nicely,
because transform returns only a single it- erator, while
accumulate expects a pair of iterators as its first two arguments
describing the start and end of the input con- tainer. Furthermore,
a temporary vector tmp is required, as the transform algorithm has
to write the computed interme- diate result into a container. The
algorithm cannot allocate this temporary container itself, because
the iterator abstrac- tion hides the type of container (in this
case a vector) from the algorithm.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto x, auto
y){return x * y;}; 4 5 vector <float > tmp(a.size()); 6
transform(a.begin (), a.end(), b.begin (), 7 tmp.begin (), mult); 8
return accumulate(tmp.begin (), tmp.end() ,0.0f); 9 }
Listing 4: Dot product implementation using iterators.
Listing 5 shows how we can overcome these composability problems
using the recent range-v3 library [13] that replaces iterators with
ranges. The accumulate call now takes its input as the single first
argument compared to two separate itera- tors in Listing 4. Ranges
are composable, because they carry information about the start and
end points of a container, i.e.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto p){ 4 return
get <0>(p) * get <1>(p); }; 5 6 return 7 accumulate( 8
view:: transform(view::zip(a,b),mult) ,0.0f); }
Listing 5: Dot product implementation using composable
ranges.
ranges combine the information scattered across the first two
arguments of accumulate into a single value. This change, together
with the introduction of views — lazily computed ranges, which we
will discuss in more detail in Section 4 — allows to write the dot
product example in a concise and composable style in Listing 5.
Here the zip function creates pairs of elements from vectors a and
b which are then mul- tiplied and summed up. Composability is a
central feature of the range-v3 library: it allows and encourages
developers to use the pipe symbol | to denote composition (similar
to the bash shell). Therefore, we can also rewrite lines 7–8 of
Listing 5 as explicit composition of patterns:
view::zip(a,b) | view::transform(mult) | accumulate(0.0f)
The idea of our approach is to bring this range-based ab- straction
to GPU programming.
3.2 GPU-enabled containers and algorithms We extend the range-v3
library with a GPU-enabled con- tainer and GPU-enabled algorithms
to allow programmers to write GPU applications in a composable
way.
For storing data on the GPU we introduce the gpu::vector
container which provides a range-based interface to access its
elements. We use type traits to ensure statically that the
GPU-enabled algorithms only operate on data stored in a
gpu::vector. Data is transferred to the GPU by copying data into a
gpu::vector, e.g., by using the gpu::copy function as in Listing 6,
and transferred back by copying such a con- tainer into a regular
STL vector.
Our algorithm implementations for the GPU using ranges currently
covers the three central algorithms – gpu::for_each,
gpu::transform, and gpu::reduce – and is currently being extended
to cover all algorithms of the recently standardized parallel STL
[10]. The for_each and transform algorithms apply a given function
to every element of the input range in parallel. The transform
writes the result of each function ap- plication into an output
range (this algorithm is also known as map in functional
programming). The for_each algorithm produces no result directly,
but is executed for its side ef- fects. This allows, for example,
to write computed values to memory in a less structured fashion
than using transform. The reduce algorithm performs a parallel
reduction using a given binary operator which has to be
associative. We will
3 2016/12/16
pattern composition: two kernels are launched and a tem- porary
vector tmp is required to store the intermediate re- sult of
transform. This loss of efficiency is the reason why Thrust offers
the specific inner_product function: high per- formance is not
achievable by composing universal patterns.
In the next sections, we present our approach where higher-level
programs can be expressed as a composition of universal patterns
without a performance loss.
3. Programming with Ranges and GPU-Enabled Algorithms
Our suggested composable API for GPUs is inspired by the Standard
Template Library (STL) which is widely used in C++ programming. The
STL consists of three main components: 1) containers: collection
data types such as std::vector or std::set, 2) algorithms: reusable
operations on containers such as std::accumulate or std::sort, 3)
it- erators: glue containers with algorithms, such that the same
algorithm is reusable for different containers. This library design
has proven to be highly flexible and is one of the main reasons of
the STL’s success.
3.1 From Iterators to Ranges The existing STL with iterators does
not allow to compose algorithms easily. Listing 4 shows the dot
product example using two STL algorithms with iterators: transform
to ap- ply the binary mult function to the corresponding elements
of the input containers a and b, and accumulate to sum up all the
values of a sequence. These two algorithms do not compose nicely,
because transform returns only a single it- erator, while
accumulate expects a pair of iterators as its first two arguments
describing the start and end of the input con- tainer. Furthermore,
a temporary vector tmp is required, as the transform algorithm has
to write the computed interme- diate result into a container. The
algorithm cannot allocate this temporary container itself, because
the iterator abstrac- tion hides the type of container (in this
case a vector) from the algorithm.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto x, auto
y){return x * y;}; 4 5 vector <float > tmp(a.size()); 6
transform(a.begin (), a.end(), b.begin (), 7 tmp.begin(), mult); 8
return accumulate(tmp.begin (), tmp.end() ,0.0f); 9 }
Listing 4: Dot product implementation using iterators.
Listing 5 shows how we can overcome these composability problems
using the recent range-v3 library [13] that replaces iterators with
ranges. The accumulate call now takes its input as the single first
argument compared to two separate itera- tors in Listing 4. Ranges
are composable, because they carry information about the start and
end points of a container, i.e.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto p){ 4 return
get <0>(p) * get <1>(p); }; 5 6 return 7 accumulate( 8
view:: transform(view::zip(a,b),mult) ,0.0f); }
Listing 5: Dot product implementation using composable
ranges.
ranges combine the information scattered across the first two
arguments of accumulate into a single value. This change, together
with the introduction of views — lazily computed ranges, which we
will discuss in more detail in Section 4 — allows to write the dot
product example in a concise and composable style in Listing 5.
Here the zip function creates pairs of elements from vectors a and
b which are then mul- tiplied and summed up. Composability is a
central feature of the range-v3 library: it allows and encourages
developers to use the pipe symbol | to denote composition (similar
to the bash shell). Therefore, we can also rewrite lines 7–8 of
Listing 5 as explicit composition of patterns:
view::zip(a,b) | view::transform(mult) | accumulate(0.0f)
The idea of our approach is to bring this range-based ab- straction
to GPU programming.
3.2 GPU-enabled containers and algorithms We extend the range-v3
library with a GPU-enabled con- tainer and GPU-enabled algorithms
to allow programmers to write GPU applications in a composable
way.
For storing data on the GPU we introduce the gpu::vector
container which provides a range-based interface to access its
elements. We use type traits to ensure statically that the
GPU-enabled algorithms only operate on data stored in a
gpu::vector. Data is transferred to the GPU by copying data into a
gpu::vector, e.g., by using the gpu::copy function as in Listing 6,
and transferred back by copying such a con- tainer into a regular
STL vector.
Our algorithm implementations for the GPU using ranges currently
covers the three central algorithms – gpu::for_each,
gpu::transform, and gpu::reduce – and is currently being extended
to cover all algorithms of the recently standardized parallel STL
[10]. The for_each and transform algorithms apply a given function
to every element of the input range in parallel. The transform
writes the result of each function ap- plication into an output
range (this algorithm is also known as map in functional
programming). The for_each algorithm produces no result directly,
but is executed for its side ef- fects. This allows, for example,
to write computed values to memory in a less structured fashion
than using transform. The reduce algorithm performs a parallel
reduction using a given binary operator which has to be
associative. We will
3 2016/12/16
• GPU-enabled container gpu!::vector<T>
7
see that already using these three algorithms as patterns al- lows
us to express many interesting applications, especially when
combined with views (lazily evaluated ranges) as dis- cussed in
Section 4.
3.3 First GPU example Listing 6 shows a code of the dot product
using our API with the gpu::reduce algorithm. Notice how close the
implemen- tation is compared to Listing 5. In line 6 the input
vectors a
and b are copied to the GPU, and then their pairwise multi-
plication results (line 7) are summed up in line 8. Our API
implementation (described in Section 5) guarantees to fuse the
operations expressed as views into a single efficient GPU kernel.
In the example, the zip and the vector multiplica- tion in lines 6
and 7 are fused together with the reduction in line 8. We will
discuss the implementation of views in the next section.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto p){ 4 return
get <0>(p) * get <1>(p); }; 5 6 return
view::zip(gpu::copy(a), gpu::copy(b)) 7 | view:: transform(mult) 8
| gpu:: reduce (0.0f); }
Listing 6: GPU dot product using composable patterns.
3.4 Summary In our API, ranges combined with GPU algorithms and GPU
containers enable a natural way to program GPUs in C++ similar to
the programming approach widely known from the STL. To achieve
composability, programs are written by combining small and
simple-to-understand patterns which greatly simplify programming as
compared to traditional low-level programming approaches like CUDA.
Next we discuss the key idea for achieving high performance in our
composable approach: guaranteed kernel fusion using views.
4. Eager Actions and Lazy Views The range-v3 library introduces two
new consructs to the STL — actions and views — which enhance the
compos- ability. We exploit views and actions for GPU programming:
by using them programmers can control the fusion of com- putations
expressed by patterns into a single GPU kernel.
4.1 Actions Actions perform (potentially mutating) in-place
operations on containers, i.e., actions do not require an
externally pro- vided output container, such as the temporary
vector tmp
in Listing 4. Actions are implemented with the STL algo- rithms
and, therefore, we implemented corresponding ac- tions for our
GPU-enabled algorithms. Actions return a ref- erence to the
modified container and, therefore, compose nicely with other
actions.
4.2 Views Views are the counterpart to actions and describe non-
mutating operations on ranges. Views, like actions, compose nicely
and are designed to be used together with each other and with the
algorithms described earlier.
An example for a view is view::transform(mult) in List- ing 6. It
applies a given function (in this case mult) to its input range.
When executed on the CPU, this computation is not performed eagerly
by writing the computed result to some (temporary) memory location,
but rather an object (the view) is created which behaves like a
range and performs the computation lazily, i.e., on-demand once an
element in the range is requested. The view object holds references
to its input range and the function to be called. When the view is
iterated over, it evaluates the requested elements. Views are
implemented as first class objects which can be stored in variables
and passed to and returned from functions.
When views are composed with each other or with algo- rithms, they
are evaluated only when the finally computed range is accessed. For
example, in Listing 6 the zip and transform views are composed with
the gpu::reduce algo- rithm. The implementation of reduce iterates
over the input range to sum up all of its elements. Inside of this
iteration, the pairwise multiplication expressed by the two views
is performed, i.e., it is automatically fused by our API imple-
mentation into the implementation of the reduce algorithm. When we
compile this code for the GPU, as described in Section 5 this API
design ensures that only a single GPU ker- nel is emitted which
performs the zip and transform compu- tations inside of the
iteration code of the reduce algorithm.
Together with our GPU-enabled algorithms, this guaran- teed
behavior of the views let programmers reason precisely about the
cost of operations and the number of GPU kernels launched. Views
allow to write composable and elegant code without paying a
performance penalty, as we will see in our experimental
evaluation.
4.3 Provided Views The range-v3 library offers currently over 40
views (e.g., filter or generate) which can be used together with
our GPU algorithms. This greatly enhances the flexibility of our
GPU programming approach. Interestingly, some views such as repeat
represent infinite ranges; the take and take_while
views can be used to limit such infinite ranges. In our ap- proach
kernel fusion is directly tied to the available views. The
possibility to fuse two kernels corresponds to the ability to
express a computation as a view which can be performed lazily and,
therefore, folded into another computation.
The benefit of reusing the existing views from the range-v3
library is ensured by our LLVM-based GPU code generator which
compiles arbitrary C++ code for the GPU. There are only minor not
supported exemptions, such as virtual func- tions or exceptions
which are not used in the STL.
4 2016/12/16
2. Combine vectors
• Many Patterns !!= Many Kernels !-> Good Performance
Lazy Views !== Kernel Fusion
• The implementation of views guarantees fusion with the following
operation
• Fused with GPU-enabled pattern !=> Kernel Fusion
see that already using these three algorithms as patterns al- lows
us to express many interesting applications, especially when
combined with views (lazily evaluated ranges) as dis- cussed in
Section 4.
3.3 First GPU example Listing 6 shows a code of the dot product
using our API with the gpu::reduce algorithm. Notice how close the
implemen- tation is compared to Listing 5. In line 6 the input
vectors a
and b are copied to the GPU, and then their pairwise multi-
plication results (line 7) are summed up in line 8. Our API
implementation (described in Section 5) guarantees to fuse the
operations expressed as views into a single efficient GPU kernel.
In the example, the zip and the vector multiplica- tion in lines 6
and 7 are fused together with the reduction in line 8. We will
discuss the implementation of views in the next section.
1 float dotProduct(const vector <float >& a, 2 const
vector <float >& b) { 3 auto mult = []( auto p){ 4 return
get <0>(p) * get <1>(p); }; 5 6 return
view::zip(gpu::copy(a), gpu::copy(b)) 7 | view:: transform(mult) 8
| gpu:: reduce (0.0f); }
Listing 6: GPU dot product using composable patterns.
3.4 Summary In our API, ranges combined with GPU algorithms and GPU
containers enable a natural way to program GPUs in C++ similar to
the programming approach widely known from the STL. To achieve
composability, programs are written by combining small and
simple-to-understand patterns which greatly simplify programming as
compared to traditional low-level programming approaches like CUDA.
Next we discuss the key idea for achieving high performance in our
composable approach: guaranteed kernel fusion using views.
4. Eager Actions and Lazy Views The range-v3 library introduces two
new consructs to the STL — actions and views — which enhance the
compos- ability. We exploit views and actions for GPU programming:
by using them programmers can control the fusion of com- putations
expressed by patterns into a single GPU kernel.
4.1 Actions Actions perform (potentially mutating) in-place
operations on containers, i.e., actions do not require an
externally pro- vided output container, such as the temporary
vector tmp
in Listing 4. Actions are implemented with the STL algo- rithms
and, therefore, we implemented corresponding ac- tions for our
GPU-enabled algorithms. Actions return a ref- erence to the
modified container and, therefore, compose nicely with other
actions.
4.2 Views Views are the counterpart to actions and describe non-
mutating operations on ranges. Views, like actions, compose nicely
and are designed to be used together with each other and with the
algorithms described earlier.
An example for a view is view::transform(mult) in List- ing 6. It
applies a given function (in this case mult) to its input range.
When executed on the CPU, this computation is not performed eagerly
by writing the computed result to some (temporary) memory location,
but rather an object (the view) is created which behaves like a
range and performs the computation lazily, i.e., on-demand once an
element in the range is requested. The view object holds references
to its input range and the function to be called. When the view is
iterated over, it evaluates the requested elements. Views are
implemented as first class objects which can be stored in variables
and passed to and returned from functions.
When views are composed with each other or with algo- rithms, they
are evaluated only when the finally computed range is accessed. For
example, in Listing 6 the zip and transform views are composed with
the gpu::reduce algo- rithm. The implementation of reduce iterates
over the input range to sum up all of its elements. Inside of this
iteration, the pairwise multiplication expressed by the two views
is performed, i.e., it is automatically fused by our API imple-
mentation into the implementation of the reduce algorithm. When we
compile this code for the GPU, as described in Section 5 this API
design ensures that only a single GPU ker- nel is emitted which
performs the zip and transform compu- tations inside of the
iteration code of the reduce algorithm.
Together with our GPU-enabled algorithms, this guaran- teed
behavior of the views let programmers reason precisely about the
cost of operations and the number of GPU kernels launched. Views
allow to write composable and elegant code without paying a
performance penalty, as we will see in our experimental
evaluation.
4.3 Provided Views The range-v3 library offers currently over 40
views (e.g., filter or generate) which can be used together with
our GPU algorithms. This greatly enhances the flexibility of our
GPU programming approach. Interestingly, some views such as repeat
represent infinite ranges; the take and take_while
views can be used to limit such infinite ranges. In our ap- proach
kernel fusion is directly tied to the available views. The
possibility to fuse two kernels corresponds to the ability to
express a computation as a view which can be performed lazily and,
therefore, folded into another computation.
The benefit of reusing the existing views from the range-v3
library is ensured by our LLVM-based GPU code generator which
compiles arbitrary C++ code for the GPU. There are only minor not
supported exemptions, such as virtual func- tions or exceptions
which are not used in the STL.
4 2016/12/16
• Actions are (usually) mutating
9
float dotProduct(const vector<float>& a, const
vector<float>& b) { auto mult = [](auto p) { return
get<0>(p) * get<1>(p); };
gpu!::vector<pair<float, float!>> pairs(a.size()); for
(auto i = 0; i < a.size(); !++i) pairs[i] = make_pair(a[i],
b[i]);
return pairs | gpu!::action!::transform(mult) | gpu!::reduce(0.0f);
}
Choice of Kernel Fusion
10
• Choice between views and actions/algorithms is choice for or
against kernel fusion
• Simple cost model: Every action/algorithm results in a
Kernel
• Programmer is in control! Fusion is guaranteed.
Available for free: Views provided by range-v3
• group_by • indirect • intersperse • ints • iota • join • keys •
move • partial_sum • remove_if • repeat • repeat_n • replace •
replace_if • reverse
• adjacent_filter • adjacent_remove_if • all • bounded • chunk •
concat • const_ • counted • delimit • drop • drop_exactly •
drop_while • empty • generate • generate_n
https:!//ericniebler.github.io/range-v3/index.html#range-views
• single • slice • split • stride • tail • take • take_exactly •
take_while • tokenize • transform • unbounded • unique • values •
zip • zip_with
Code Generation via PACXX
12
4.4 GPU evaluation of views Some computations can be expressed in
our API only using views and without ever using an action or
algorithm. In such cases, the evaluation will happen implicitly
when the final view is iterated over, e.g., when printing the
result or copying it into a data container. By default, this
evaluation happens sequentially on the CPU. To allow programmers to
perform such computations on the GPU, we implemented a variation of
the C++ standard async function. Our gpu::async function takes an
arbitrary function whose return value is fixed to be a gpu::vector
and, therefore, if in the function implementa- tion a view is
returned then it is implicitly evaluated on the GPU in parallel and
the result is written into a gpu::vector.
1 auto saxpy(float A, const vector <float >& X, 2 const
vector <float >& Y) { 3 return gpu:: async ([=]( auto a,
const auto& x, 4 const auto& y){ 5 auto ax =
view::zip(view:: repeat(a), x) 6 | view:: transform(mult); 7 return
view::zip(ax,y) 8 | view:: transform(plus);}, A, X, Y) ;}
Listing 7: saxpy on GPU using views and gpu::async.
Listing 7 shows an implementation of the saxpy com- putation which
only uses views and no algorithms or ac- tions. Our gpu::async
function is used in line 3 to evalu- ate the views on the GPU.
First, a x is computed where a is a scalar and x is a vector,
before the result is added to the vector y. We use four views
(twice zip and twice transform) to describe the computation which
is nested in- side a gpu::async. All views are fused together
during code generation and evaluated in a single GPU kernel which
pro- duces a gpu::vector. The gpu::async function ensures that all
input containers are automatically copied to gpu::vector
objects if they are not already of that type. This ensures that all
data used in gpu::async is available on the GPU.
4.5 Summary With the support of actions and views, our GPU
extension of the range-v3 library encourages a compositional
program- ming style. We make use of this style for GPU programming
by combining actions and views with our GPU-enabled al- gorithms
and providing a natural way to evaluate lazy views in parallel on
the GPU via gpu::async. The views provide a simple mechanism for
the programmer to express programs compositionally while being
guaranteed that the views are fused together into a single GPU
kernel. We will now look at our GPU code generation and discuss
optimizations ap- plied when generating GPU kernel code.
5. Code Generation and Optimization Our API with GPU-enabled
algorithms and containers is implemented as a C++ library compiled
using PACXX [8] — our LLVM-based compiler generating GPU code from
C++ code. In this section, we briefly describe the general
design of PACXX and then we focus on a particular feature, called
multi-staging, which allows us to optimize the GPU code based on
values only known at host runtime.
5.1 Overview of PACXX PACXX transforms C++ code using a combination
of offline and online compilation to a representation executable on
different kinds of GPUs: PTX on Nvidia GPUs, and SPIR on AMD
GPUs.
Executable PACXX Runtime
CUDA Runtime
Nvidia GPU
template< class ForwardIt, class T > void fill(ForwardIt
first, ForwardIt last, const T& value) { for (; first != last;
++first) { *first = value; } }
C++
Figure 1: Key components of PACXX.
Figure 1 shows an overview of the PACXX implementa- tion which
comprises two main components:
1) The PACXX Offline Compiler based on the open-source Clang
compiler, and
2) The PACXX Runtime library consisting of a just-in-time compiler
and a GPU execution runtime.
Correspondingly, C++ code is compiled by PACXX in two stages: 1)
the offline compilation stage separates GPU and CPU code and
prepares the executable for the PACXX runtime, 2) the online
compilation stage just-in-time com- piles the GPU code during
program execution using our LLVM-based online compiler in the PACXX
runtime library.
Stage 1: Offline Compilation In PACXX, code to be ex- ecuted on a
GPU is written as a pacxx::kernel function supported by PACXX.
Based on this kernel function, the PACXX offline compiler marks the
code executed on the GPU by annotating every function called from
inside the kernel function with a PACXX-specific C++ 11 generalized
attribute. Using generalized attributes of C++ has the advan- tage
that the code remains valid C++ and other compilers have the
freedom to ignore PACXX custom annotations.
After the annotations are added, two passes are per- formed: the
first pass prepares the GPU kernel generation at runtime, and the
second pass compiles the CPU program.
In the kernel compilation pass, the program’s abstract syntax tree
(AST) is lowered to the LLVM intermediate representation (IR), and
functions with the PACXX-specific attribute are marked as kernel
code in the IR. Then the following steps transform and optimize the
IR:
1) dead code elimination removes all IR nodes beside the code
reachable from the kernel;
5 2016/12/16
• We use PACXX to compile the extended C++ range-v3 library
implementation to GPU code
• Similar implementation possible with SYCL
Evaluation Sum and Dot Product
13
Performance Results We evaluate each benchmark with 11 input sizes
of single precision floating point values. Fig- ures 2 and 3 show
the performance results compared to the optimized CUDA code. The
results for the vector addition and saxpy benchmark (Figure 2) show
that our API imple- mented using PACXX achieves equal performance
to the low-level CUDA kernels. The transform implementation in
Thrust uses 1024 threads per block and launches a fixed number of
blocks. Using this strategy, each thread computes multiple elements
and Thrust achieves significantly better performance for the
2
17 input size of the saxpy benchmark.
Figure 2: Speedup of our approach and Thrust compared to CUDA for
the vector addition and saxpy benchmark.
Figure 3 shows the performance of our approach for dot product and
sum in comparison to Thrust and CUDA. All three approaches follow
different implementation strategies for these benchmarks. The CUDA
version from Nvidia’s SDK [14] uses a tree-based reduction in local
memory and avoids synchronization inside a warp. The Thrust
reduction algorithm computes the result in two GPU kernels while
the PACXX and the CUDA versions execute only one GPU kernel and
finish the computation on the CPU. Using two kernels with Thrust
yields a significant overhead for smaller input sizes, which is
clearly visible in the graph. The multi- staging optimized
gpu::reduce implementation in PACXX clearly outperforms the Thrust
reduction and even beats the low-level CUDA code for larger input
sizes.
0
0.2
0.4
0.6
0.8
1
1.2
Sp ee
du p
Input Size
CUDA Dot/Sum Thrust Dot PACXX Dot Thrust Sum PACXX Sum
Figure 3: Speedup of our approach and Thrust compared to CUDA for
the dot product and sum benchmarks.
Summary This section shows that our composable pro- gramming style
using views does not introduce a perfor- mance overhead. In
contrary, our optimized GPU-enabled
algorithms outperform the corresponding low-level CUDA and Thrust
code. Even the specialized inner_product im- plementation of Thrust
used in the dot product benchmark is outperformed by our composable
implementation where views are fused with our generic reduction
algorithm.
6.3 Performance Impact of Multi-Staging We continue our evaluation
by studying the multi-stage op- timization applied in PACXX. As
described in Section 5, multi-staging is used in our gpu::reduce
algorithm to spe- cialize the generated GPU code at runtime for the
particular input size, which enables aggressive
loop-unrolling.
Figure 4: Performance improvement via multi-stage (MS) optimization
for the dot product and sum benchmarks.
Figure 4 shows the relative performance improvement when applying
multi-staging in the gpu::reduce algorithm, for the dot product
benchmark shown in Listing 6 and for the sum benchmark that adds up
all elements of an array. We observe that multi-staging improves
performance by up to 1.35 depending on the input size. For input
sizes larger than 2
24, the performance advantage declines, as the number of threads in
a block is increased and, therefore, the num- ber of loop
iterations executed by each individual thread de- creases: the
effect of unrolling the loop becomes less visible. The improvements
for the dot product are lower, because the kernel performs more
loads from the global memory which dominate the kernel runtime,
making the loop unrolling a less important performance
factor.
6.4 Just-in-time Compilation Overhead PACXX performs an offline and
an online compilation step. In this section, we study the overhead
introduced at runtime by the PACXX JIT compiler by comparing the
compilation times of PACXX with the JIT compiler library nvtrc
recently introduced for CUDA.
Figure 5 shows the compilation time for the sum and the dot product
benchmarks. We compare PACXX with nvrtc from CUDA 7.5 and 8.0 RC.
The compilation of PACXX is 15 times (for dot product) and almost
20 times (for sum) faster as compared to CUDA 7.5, and 9-12 times
faster com- pared to CUDA 8.0 RC. This is because the PACXX offline
compiler prepares the GPU code generation and, therefore, the PACXX
JIT compiler operates directly on the LLVM IR.
8 2016/12/16
Multi-Staging in PACXX
• Implementation of gpu!::reduce !=>
2) function calls are inlined into the kernel functions;
3) -O3 equivalent optimizations are performed;
4) the IR is embedded in the executable.
At runtime the prepared IR is loaded, then JIT-compiled and
optimized for the target GPU before execution.
In the host compilation pass, the PACXX offline com- piler lowers
the AST to LLVM IR a second time, but this time replacing calls to
kernel functions with PACXX run- time calls which manage data
transfers and launch the corre- sponding kernel. The compilation of
the host C++ program is performed as usual, by generating an
executable which is statically linked against the PACXX runtime
library.
Stage 2: Online Compilation During program execution, the PACXX
runtime loads the IR previously embedded in the executable by the
PACXX offline compiler, performs addi- tional, GPU-specific
optimizations, such as loop-unrolling and rearranging load
instructions. Finally, the IR is com- piled to GPU code using one
of the two LLVM back-ends: PTX [15] together with the CUDA runtime
library when targeting Nvidia GPUs, and SPIR [7] for GPUs with an
OpenCL implementation (e.g., from AMD or Intel).
5.2 GPU Algorithm Implementations We implemented three GPU-enabled
STL algorithms to be used together with the range-v3 library. We
are discussing their efficient implementations for Nvidia GPUs
here. This implementation is not portable, but we intend to
incorporate recent related work in the future which uses a
functional techniques to generate efficient GPU implementations
from portable pattern-based representations [20].
Transform and For_Each The gpu::transform algorithm applies a given
function in parallel to every element of its input range and stores
the produced results in the output range. The gpu::for_each
algorithm is a special case of transform, where its given function
does not produce a result directly but is rather executed for its
side effects. Therefore, we concentrate on the implementation of
gpu::transform.
1 auto kernel = pacxx :: kernel( 2 [func](auto in , auto out ,
size_t size) { 3 auto id = Thread ::get().global; 4 if (id.x >=
size) return; 5 *(out + id.x) = func (*(in + id.x)); 6 },{{(
distance + 127) / 128}, {128}}); 7 8 kernel(in.begin(), out.begin
(), size);
Listing 8: Implementation of gpu::transform.
Listing 8 shows the implementation of gpu::transform for an Nvidia
GPU. For each element of the input range, a GPU thread is launched
which: a) loads one element form the global memory, b) applies the
given function (func) to the element, and c) stores the result back
to the global memory of the GPU. Our implementation configures the
underlying
kernel to use 128 threads per block. This is a platform- specific
choice, and auto-tuning [4] or similar techniques can be used to
pick appropriate values.
Reduce To efficiently implement the gpu::reduce algo- rithm, we
make use of the multi-stage programming support in PACXX [8] for
embedding values known at the runtime of the host program into the
GPU code. This enables ad- ditional optimization opportunities like
aggressive loop un- rolling. PACXX provides a dedicated function
(stage) which evaluates expressions prior to the GPU kernel
execution and the computed values are then automatically embedded
into the GPU program.
1 template <class InRng , class T, class Fun > 2 auto
reduce(InRng&& in, T init , Fun&& fun) { 3 // 1.
preparation of kernel call 4 ... 5 // 2. create GPU kernel 6 auto
kernel = pacxx :: kernel( 7 [fun](auto&& in ,
auto&& out , 8 int size , auto init) { 9 // 2a. stage
elements per thread
10 int ept = stage(size / glbSize); 11 // 2b. start reduction
computation 12 auto sum = init; 13 for (int x = 0; x < ept; ++x)
{ 14 sum = fun(sum , *(in + gid)); 15 gid += glbSize; } 16 // 2c.
perform reduction in shared memory 17 ... 18 // 2d. write result
back 19 if (lid = 0) *(out + bid) = shared [0]; 20 }, glbSize ,
lclSize); 21 // 3. execute kernel 22 kernel(in, out , distance(in),
init); 23 // 4. finish reduction on the CPU 24 return std::
accumulate(out , init , fun); }
Listing 9: Implementation sketch of the gpu::reduce
algorithm making use of multi-staging.
Listing 9 shows a sketch of the implementation of the gpu::reduce
algorithm in PACXX. For brevity and clarity we concentrate on the
parts relevant for the multi-staging op- timization. After some
preparations, a GPU kernel is created in line 6 using the
PACXX-provided kernel function. The following lambda expression
contains the code executed on the GPU which calls the reduction
operation fun in line 14. In line 10 we make use of the stage
function: it indicates that the expression passed as its argument
is evaluated on the CPU prior to the kernel call and that the
evaluated value is embedded into the GPU code as a compile-time
constant. Here this value is the number of elements processed per
thread which is used as the upper bound in the following loop in
line 13. As a consequence of the staging, this loop can be
completely unrolled, since the upper bound is now known at the GPU
compilation time, which results in a sig- nificant performance gain
as we will see in the evaluation section. The kernel continues with
a reduction performed in shared memory, synchronized across all
threads in the
6 2016/12/16
Performance Impact of Multi-Staging
Performance Results We evaluate each benchmark with 11 input sizes
of single precision floating point values. Fig- ures 2 and 3 show
the performance results compared to the optimized CUDA code. The
results for the vector addition and saxpy benchmark (Figure 2) show
that our API imple- mented using PACXX achieves equal performance
to the low-level CUDA kernels. The transform implementation in
Thrust uses 1024 threads per block and launches a fixed number of
blocks. Using this strategy, each thread computes multiple elements
and Thrust achieves significantly better performance for the
2
17 input size of the saxpy benchmark.
Figure 2: Speedup of our approach and Thrust compared to CUDA for
the vector addition and saxpy benchmark.
Figure 3 shows the performance of our approach for dot product and
sum in comparison to Thrust and CUDA. All three approaches follow
different implementation strategies for these benchmarks. The CUDA
version from Nvidia’s SDK [14] uses a tree-based reduction in local
memory and avoids synchronization inside a warp. The Thrust
reduction algorithm computes the result in two GPU kernels while
the PACXX and the CUDA versions execute only one GPU kernel and
finish the computation on the CPU. Using two kernels with Thrust
yields a significant overhead for smaller input sizes, which is
clearly visible in the graph. The multi- staging optimized
gpu::reduce implementation in PACXX clearly outperforms the Thrust
reduction and even beats the low-level CUDA code for larger input
sizes.
Figure 3: Speedup of our approach and Thrust compared to CUDA for
the dot product and sum benchmarks.
Summary This section shows that our composable pro- gramming style
using views does not introduce a perfor- mance overhead. In
contrary, our optimized GPU-enabled
algorithms outperform the corresponding low-level CUDA and Thrust
code. Even the specialized inner_product im- plementation of Thrust
used in the dot product benchmark is outperformed by our composable
implementation where views are fused with our generic reduction
algorithm.
6.3 Performance Impact of Multi-Staging We continue our evaluation
by studying the multi-stage op- timization applied in PACXX. As
described in Section 5, multi-staging is used in our gpu::reduce
algorithm to spe- cialize the generated GPU code at runtime for the
particular input size, which enables aggressive
loop-unrolling.
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
Sp ee
du p
Input Size
Dot Sum Dot +MS Sum +MS
Figure 4: Performance improvement via multi-stage (MS) optimization
for the dot product and sum benchmarks.
Figure 4 shows the relative performance improvement when applying
multi-staging in the gpu::reduce algorithm, for the dot product
benchmark shown in Listing 6 and for the sum benchmark that adds up
all elements of an array. We observe that multi-staging improves
performance by up to 1.35 depending on the input size. For input
sizes larger than 2
24, the performance advantage declines, as the number of threads in
a block is increased and, therefore, the num- ber of loop
iterations executed by each individual thread de- creases: the
effect of unrolling the loop becomes less visible. The improvements
for the dot product are lower, because the kernel performs more
loads from the global memory which dominate the kernel runtime,
making the loop unrolling a less important performance
factor.
6.4 Just-in-time Compilation Overhead PACXX performs an offline and
an online compilation step. In this section, we study the overhead
introduced at runtime by the PACXX JIT compiler by comparing the
compilation times of PACXX with the JIT compiler library nvtrc
recently introduced for CUDA.
Figure 5 shows the compilation time for the sum and the dot product
benchmarks. We compare PACXX with nvrtc from CUDA 7.5 and 8.0 RC.
The compilation of PACXX is 15 times (for dot product) and almost
20 times (for sum) faster as compared to CUDA 7.5, and 9-12 times
faster com- pared to CUDA 8.0 RC. This is because the PACXX offline
compiler prepares the GPU code generation and, therefore, the PACXX
JIT compiler operates directly on the LLVM IR.
8 2016/12/16
• GPU Programming with universal composable patterns
• Views vs. Actions/Algorithms determine kernel fusion
• Kernel fusion for views guaranteed !=> Programmer in
control
• Competitive performance vs. CUDA and specialized Thrust
code
• Multi-Staging optimization gives up to 1.35 improvement
16
Towards Composable GPU Programming: Programming GPUs with Eager
Actions and Lazy Views
Michael Haidl · Michel Steuwer · Hendrik Dirks Tim Humernbrum ·
Sergei Gorlatch
Questions?