ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely...
Transcript of ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely...
![Page 1: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/1.jpg)
April 4-7, 2016 | Silicon Valley
Steven Dalton, April 6th
ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES
![Page 2: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/2.jpg)
2
PITCH
Execution-policies are:
Extremely important and a core design feature of Thrust
Not well-understood or widely used
Effective mechanism for providing library extensibility
Useful for small applications, necessary for libraries built around Thrust
![Page 3: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/3.jpg)
3
FUSED OPERATIONS
thrust::device_vector<int> vec(N, 1);
thrust::transform(vec.begin(), vec.end(),
vec.begin(),
thrust::negate<int>());
thrust::reduce(vec.begin(), vec.end(),
thrust::plus<int>());
thrust::device_vector<int> vec(N, 1);
thrust::transform_reduce(vec.begin(), vec.end(),
thrust::negate<int>(),
int(0),
thrust::plus<int>());
![Page 4: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/4.jpg)
4
FANCY ITERATORS
thrust::reduce(thrust::constant_iterator(1),
thrust::constant_iterator(1) + N,
thrust::plus<int>());
thrust::device_vector<int> vec(N, 1);
thrust::reduce(vec.begin(), vec.end(),
thrust::plus<int>());
![Page 5: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/5.jpg)
5
SORT
#include <vector>
#include <algorithms>
void main(void)
{
std::vector<int> vec(10, …);
std::sort(
vec.begin(),
vec.end());
}
Sort header
Data
Sort
![Page 6: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/6.jpg)
6
THRUST SORT
#include <thrust/device_vector.h>
#include <thrust/sort.h>
void main(void)
{
thrust::device_vector<int> vec(10, …);
thrust::sort(
vec.begin(),
vec.end());
}
Sort header
Data
Sort
![Page 7: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/7.jpg)
7
THRUST SORT
#include <thrust/device_vector.h>
#include <thrust/sort.h>
void main(void)
{
thrust::device_vector<int> vec(10, …);
thrust::sort(
vec.begin(),
vec.end());
}
Backend Systems
CPP OMP CUDA
![Page 8: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/8.jpg)
8
MOTIVATION
void func1(…)
{
thrust::device_vector<int> vec(10, …);
thrust::sort(
vec.begin(),
vec.end());
}
void func2(…)
{
thrust::device_vector<int> vec(10, …);
thrust::sort(
vec.begin(),
vec.end());
}
void func3(…)
{
thrust::device_vector<int> vec(10, …);
thrust::sort(
vec.begin(),
vec.end());
}
void func4(…)
{
thrust::sort(
vec.begin(),
vec.end());
}
Profiling Thrust-based library
Several sorting calls across multiple functions/files
![Page 9: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/9.jpg)
9
PROFILINGPossible Thrust profiling solutions
How? What would new thrust::sort require?
How would you profile STL routines?
REDESIGN INTERFACE DO-IT-YOURSELF
LD_PRELOAD=prof_thrust.so exec_file
INTERCEPT CALLS
thrust::sort(exec,vec.begin(),vec.end());
EXECUTION POLICIES
timer t;thrust::sort(begin, end);t.elapsed_milliseconds();
![Page 10: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/10.jpg)
10
PROFILINGPossible Thrust profiling solutions
How? What would new thrust::sort require?
How would you profile STL routines?
REDESIGN INTERFACE
timer t;thrust::sort(begin, end);t.elapsed_milliseconds();
DO-IT-YOURSELF
LD_PRELOAD=prof_thrust.so exec_file
INTERCEPT CALLS
thrust::sort(exec,vec.begin(),vec.end());
EXECUTION POLICIES
![Page 11: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/11.jpg)
11
PROFILINGPossible Thrust profiling solutions
How? What would new thrust::sort require?
How would you profile STL routines?
REDESIGN INTERFACE DO-IT-YOURSELF
LD_PRELOAD=prof_thrust.so exec_file
INTERCEPT CALLS
thrust::sort(exec,vec.begin(),vec.end());
EXECUTION POLICIES
timer t;thrust::sort(begin, end);t.elapsed_milliseconds();
![Page 12: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/12.jpg)
12
PROFILINGPossible Thrust profiling solutions
How? What would new thrust::sort require?
How would you profile STL routines?
REDESIGN INTERFACE DO-IT-YOURSELF
LD_PRELOAD=prof_thrust.so exec_file
INTERCEPT CALLS
thrust::sort(exec,vec.begin(),vec.end());
EXECUTION POLICIES
timer t;thrust::sort(begin, end);t.elapsed_milliseconds();
![Page 13: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/13.jpg)
13
THRUST SORT
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/system/cuda/execution_policy.h>
void main(void)
{
cudaStream_t s; cudaStreamCreate(&s);
thrust::device_vector<int> vec(10, …);
thrust::sort(thrust::cuda::par.on(s),
vec.begin(),
vec.end());
}
Policy header
Sort with policy
![Page 14: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/14.jpg)
14
THRUST SORT
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/system/cuda/execution_policy.h>
void main(void)
{
cudaStream_t s; cudaStreamCreate(&s);
thrust::device_vector<int> vec(10, …);
thrust::sort(thrust::cuda::par.on(s),
vec.begin(),
vec.end());
}
Policy header
Sort with policy
WHAT?
HOW?
![Page 15: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/15.jpg)
15
EXECUTION-POLICY DESIGN PATTERN
template<typename Policy, typename Iterator>
void sort(Policy& exec, Iterator begin, Iterator end)
{
// add generic sort to local context
using generic::sort;
// use ADL lookup for dispatching sort
sort(derived_cast(exec), begin, end);
}
template<typename Iterator>void sort(Iterator begin, Iterator end){
// no policy specified// use generic sortsort(exec, begin, end);
}
![Page 16: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/16.jpg)
16
CUSTOM POLICY
struct my_policy : thrust::device_execution_policy{};
template<typename Iterator>
void sort(my_policy, Iterator begin, Iterator end)
{
PROFILE_START; // start profiling specific code
thrust::sort(begin, end);
PROFILE_STOP; // end profiler specific code
}
void main(void){
thrust::device_vector<int> vec(10);my_policy exec;thrust::sort(exec, vec.begin(), vec.end());
}
![Page 17: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/17.jpg)
17
CGtemplate<typename Matrix, typename Vector>
void cg(Matrix& A, Vector& x, Vector& y)
{
size_t N = A.num_rows;
Vector y(N), z(N), r(N), p(N);
multiply(A, x, y);
axpby(b, y, r, 1, -1);
while(…) {
multiply(A, p, y);
double alpha = rz / dot(y, p);
axpy(y, r, -alpha);
double rz_old = rz;
rz = dot(r,z);
double beta = rz / rz_old;
axpby(z, p, p, 1, beta);
}
}
![Page 18: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/18.jpg)
18
CG PROFILE
Thrust kernel launch
![Page 19: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/19.jpg)
19
GRAPPLE
void main(void){
// construct grapple policygrapple::grapple_system exec;
// call thrust sort with grapple profilingthrust::sort(exec, vec.begin(), vec.end());
// automatically print summary before exiting}
Profiler for Thrust applications
In reality : Just another execution policy
Automatically intercepts all Thrust calls
NO CHANGES TO THRUST REQUIRED!
![Page 20: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/20.jpg)
20
CG + GRAPPLEtemplate<typename Policy, typename Matrix, typename Vector>
void cg(Policy& exec, Matrix& A, Vector& x, Vector& y)
{
size_t N = A.num_rows;
Vector y(exec, N), z(exec, N), r(exec, N), p(exec, N);
multiply(exec, A, x, y);
axpby(exec, b, y, r, 1, -1);
while(…) {
multiply(exec, A, p, y);
double alpha = rz / dot(exec, y, p);
axpy(exec, y, r, -alpha);
double rz_old = rz;
rz = dot(exec, r,z);
double beta = rz / rz_old;
axpby(exec, z, p, p, 1, beta);
}
}
Pass user policy intoall inner routines
Implemented usingthrust::inner_product
![Page 21: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/21.jpg)
21
CG + GRAPPLE PROFILE
GeForce GTX TITAN : 875.500 Mhz (Ordinal 0)
14 SMs enabled. Compute Capability sm_35
FreeMem: 5868MB TotalMem: 6143MB 64-bit pointers.
Mem Clock: 3004.000 Mhz x 384 bits (288.4 GB/s)
ECC Disabled
CUDA v7.0
PTX Version : sm_30
GCC v4.8.2
Thrust v1.8.2
[ 0][cuda] krylov::cg : 10.543 (ms), allocated : 1000000 bytes
[ 1][cuda] multiply : 3.57744 (ms), allocated : 1748000 bytes
[ 2][cuda] offsets_to_indices : 1.14803 (ms), allocated : 0 bytes
[ 3][cuda] fill : 0.050848 (ms), allocated : 0 bytes
[ 4][cuda] scatter_if : 0.056288 (ms), allocated : 0 bytes
[ 5][cuda] inclusive_scan : 1.00077 (ms),
![Page 22: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/22.jpg)
22
CG + GRAPPLE PROFILE
GeForce GTX TITAN : 875.500 Mhz (Ordinal 0)
14 SMs enabled. Compute Capability sm_35
FreeMem: 5868MB TotalMem: 6143MB 64-bit pointers.
Mem Clock: 3004.000 Mhz x 384 bits (288.4 GB/s)
ECC Disabled
CUDA v7.0
PTX Version : sm_30
GCC v4.8.2
Thrust v1.8.2
[ 0][cuda] krylov::cg : 10.543 (ms), allocated : 1000000 bytes
[ 1][cuda] multiply : 3.57744 (ms), allocated : 1748000 bytes
[ 2][cuda] offsets_to_indices : 1.14803 (ms), allocated : 0 bytes
[ 3][cuda] fill : 0.050848 (ms), allocated : 0 bytes
[ 4][cuda] scatter_if : 0.056288 (ms), allocated : 0 bytes
[ 5][cuda] inclusive_scan : 1.00077 (ms),
![Page 23: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/23.jpg)
23
CG + GRAPPLE PROFILE
GeForce GTX TITAN : 875.500 Mhz (Ordinal 0)
14 SMs enabled. Compute Capability sm_35
FreeMem: 5868MB TotalMem: 6143MB 64-bit pointers.
Mem Clock: 3004.000 Mhz x 384 bits (288.4 GB/s)
ECC Disabled
CUDA v7.0
PTX Version : sm_30
GCC v4.8.2
Thrust v1.8.2
[ 0][cuda] krylov::cg : 10.543 (ms), allocated : 1000000 bytes
[ 1][cuda] multiply : 3.57744 (ms), allocated : 1748000 bytes
[ 2][cuda] offsets_to_indices : 1.14803 (ms), allocated : 0 bytes
[ 3][cuda] fill : 0.050848 (ms), allocated : 0 bytes
[ 4][cuda] scatter_if : 0.056288 (ms), allocated : 0 bytes
[ 5][cuda] inclusive_scan : 1.00077 (ms),
![Page 24: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/24.jpg)
24
CG + GRAPPLE PROFILE
GeForce GTX TITAN : 875.500 Mhz (Ordinal 0)
14 SMs enabled. Compute Capability sm_35
FreeMem: 5868MB TotalMem: 6143MB 64-bit pointers.
Mem Clock: 3004.000 Mhz x 384 bits (288.4 GB/s)
ECC Disabled
CUDA v7.0
PTX Version : sm_30
GCC v4.8.2
Thrust v1.8.2
[ 0][cuda] krylov::cg : 10.543 (ms), allocated : 1000000 bytes
[ 1][cuda] multiply : 3.57744 (ms), allocated : 1748000 bytes
[ 2][cuda] offsets_to_indices : 1.14803 (ms), allocated : 0 bytes
[ 3][cuda] fill : 0.050848 (ms), allocated : 0 bytes
[ 4][cuda] scatter_if : 0.056288 (ms), allocated : 0 bytes
[ 5][cuda] inclusive_scan : 1.00077 (ms),
![Page 25: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/25.jpg)
25
CG + GRAPPLE PROFILE
GeForce GTX TITAN : 875.500 Mhz (Ordinal 0)
14 SMs enabled. Compute Capability sm_35
FreeMem: 5868MB TotalMem: 6143MB 64-bit pointers.
Mem Clock: 3004.000 Mhz x 384 bits (288.4 GB/s)
ECC Disabled
CUDA v7.0
PTX Version : sm_30
GCC v4.8.2
Thrust v1.8.2
[ 0][cuda] krylov::cg : 10.543 (ms), allocated : 1000000 bytes
[ 1][cuda] multiply : 3.57744 (ms), allocated : 1748000 bytes
[ 2][cuda] offsets_to_indices : 1.14803 (ms), allocated : 0 bytes
[ 3][cuda] fill : 0.050848 (ms), allocated : 0 bytes
[ 4][cuda] scatter_if : 0.056288 (ms), allocated : 0 bytes
[ 5][cuda] inclusive_scan : 1.00077 (ms),
![Page 26: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/26.jpg)
26
GRAPPLE FEATURES
Interface level performance profiling
Annotation of memory usage
Stack frame reference for function calls
Execution system oriented annotation (eg, cpp, omp, cuda, …)
Extensible registration system
Single stepping
Runtime data inspection, pre- and post-checking
(Some In Progress)
![Page 27: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/27.jpg)
27
GRAPPLE (HIGH-LEVEL)
template<typename Iterator>
void sort(grapple_system& exec, Iterator begin, Iterator end)
{
// mark beginning of grapple sort call
exec.start(SORT);
// cast grapple to system specific policy
sort(exec.policy(begin), begin, end);
// mark ending of grapple sort call
exec.stop();
}
![Page 28: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/28.jpg)
28
C++ STANDARD
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/system/cuda/execution_policy.h>
void main(void)
{
cudaStream_t s; cudaStreamCreate(&s);
thrust::device_vector<int> vec(10, …);
thrust::sort(thrust::cuda::par.on(s),
vec.begin(),
vec.end());
}
Parrellism TS accepted as part of C++17
![Page 29: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/29.jpg)
29
C++ STANDARD
#include <vector>
#include <algorithms>
#include <thrust/system/cuda/execution_policy.h>
void main(void)
{
cudaStream_t s; cudaStreamCreate(&s);
std::vector<int,uvm_allocator> vec(10, …);
std::sort(thrust::cuda::par.on(s),
vec.begin(),
vec.end());
}
Parrellism TS accepted as part of C++17
Could look like…
![Page 30: ADVANCED THRUST PROGRAMMING WITH EXECUTION POLICIES · 2 PITCH Execution-policies are: Extremely important and a core design feature of Thrust Not well-understood or widely used Effective](https://reader033.fdocuments.us/reader033/viewer/2022050304/5f6d27b7cbe9eb75ab5f02d2/html5/thumbnails/30.jpg)
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join