Thrust-surface geometry implications for thrust-belt evolution and section-balancin
Thrust & Curand
description
Transcript of Thrust & Curand
Thrust & Curand
Dr. Bo YuanE-mail: [email protected]
2
Outline
• Thrust Basic
• Work with Thrust
• General Types of Thrust
• Fancy Iterators of Thrust
• Curand Library
3
Thrust Basic
• What is Thrust?
• Containers
• Iterators
• Interaction with raw pointers
• Namespace
4
What is Thrust?
• C++ template library for CUDA– Mimics Standard Template Library (STL)
• Containers– thrust::host_vector<T>– thrust::device_vector<T>
• Algorithms– thrust::sort()– thrust::reduce()– thrust::inclusive_scan()– …
5
Containers
• Make common operations concise and readable – Hides cudaMalloc, cudaMemcpy and cudaFree
// allocate host vector with two elements thrust::host_vector<int> h_vec(2);• // copy host vector to device thrust::device_vector<int> d_vec = h_vec;
// manipulate device values from the host d_vec[0] = 13; d_vec[1] = 27;• std::cout << "sum: " << d_vec[0] + d_vec[1] << std::endl;
// vector memory automatically released w/ free() or cudaFree()
6
Iterators• Sequences defined by pair of iterators
// allocate device vectorthrust::device_vector<int> d_vec(4);
// returns iterator at first element of d_vecd_vec.begin();
// returns iterator one past the last element of d_vecd_vec.end()
// [begin, end) pair defines a sequence of 4 elements
d_vec.begin() d_vec.end()
7
Iterators
// initialize random values on host thrust::host_vector<int> h_vec(1000);thrust::generate(h_vec.begin(), h_vec.end(), rand);
// copy values to devicethrust::device_vector<int> d_vec = h_vec;
// compute sum on hostint h_sum = thrust::reduce(h_vec.begin(),h_vec.end());
// compute sum on deviceint d_sum = thrust::reduce(d_vec.begin(),d_vec.end());
8
Convertible to Raw Pointers
// allocate device vectorthrust::device_vector<int> d_vec(4);
// obtain raw pointer to device vector’s memoryint * ptr = thrust::raw_pointer_cast(&d_vec[0]);
// use ptr in a CUDA C kernelmy_kernel<<<N/256, 256>>>(N, ptr);
// Note: ptr cannot be dereferenced on the host!
9
Wrap Raw Pointers with device_ptr
// raw pointer to device memoryint * raw_ptr;cudaMalloc((void **) &raw_ptr, N * sizeof(int));
// wrap raw pointer with a device_ptr thrust::device_ptr<int> dev_ptr(raw_ptr);
// use device_ptr in thrust algorithmsthrust::fill(dev_ptr, dev_ptr + N, (int) 0);
// access device memory through device_ptrdev_ptr[0] = 1;
10
Namespaces
• C++ supports namespaces– Thrust uses thrust namespace
• thrust::device_vector• thrust::copy
– STL uses std namespace• std::vector• std::list
• Avoids collisions– thrust::sort()– std::sort()
• For brevity– using namespace thrust;
11
Usage of Thrust:
• Thrust provides many standard algorithms– Transformations– Reductions– Prefix Sums– Sorting
• Generic definitions– General Types
• Built-in types (int, float, …)• User-defined structures
– General Operators• reduce with plus operator• scan with maximum operator
12
Transformations// allocate three device_vectors with 10 elementsthrust::device_vector<int> X(10);thrust::device_vector<int> Y(10);thrust::device_vector<int> Z(10);// initialize X to 0,1,2,3, ....thrust::sequence(X.begin(), X.end());// compute Y = -Xthrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());// fill Z with twosthrust::fill(Z.begin(), Z.end(), 2);// compute Y = X mod 2thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(),thrust::modulus<int>());// replace all the ones in Y with tensthrust::replace(Y.begin(), Y.end(), 1, 10);
13
Reductionsint sum = thrust::reduce(D.begin(), D.end(), (int) 0,
thrust::plus<int>());
int sum = thrust::reduce(D.begin(), D.end(), (int) 0);
int sum = thrust::reduce(D.begin(), D.end())
thrust::count_ifthrust::min_elementthrust::max_elementthrust::is_sortedthrust::inner_product
14
Sorting
• thrust::sort
• thrust::stable_sort
• thrust::sort_by_key
• thrust::stable_sort_by_key
15
General Types and Operators
#include <thrust/reduce.h>// declare storagedevice_vector<int> i_vec = ... device_vector<float> f_vec = ... // sum of integers (equivalent calls)reduce(i_vec.begin(), i_vec.end());reduce(i_vec.begin(), i_vec.end(), 0, plus<int>());// sum of floats (equivalent calls)reduce(f_vec.begin(), f_vec.end());reduce(f_vec.begin(), f_vec.end(), 0.0f, plus<float>());// maximum of integersreduce(i_vec.begin(), i_vec.end(), 0,
maximum<int>());
16
General Types and Operatorsstruct negate_float2{ __host__ __device__ float2 operator()(float2 a) { return make_float2(-a.x, -a.y); }};// declare storagedevice_vector<float2> input = ... device_vector<float2> output = ...// create functornegate_float2 func;// negate vectorstransform(input.begin(), input.end(), output.begin(),
func);
17
General Types and Operators
// compare x component of two float2 structuresstruct compare_float2{ __host__ __device__ bool operator()(float2 a, float2 b) { return a.x < b.x; }};// declare storagedevice_vector<float2> vec = ...// create comparison functorcompare_float2 comp;// sort elements by x componentsort(vec.begin(), vec.end(), comp);
18
Advanced Thrust• Behave like “normal” iterators
– Algorithms don't know the difference
– constant_iterator– counting_iterator– transform_iterator– zip_iterator
19
Constant // create iteratorsconstant_iterator<int> begin(10);constant_iterator<int> end = begin + 3;
begin[0] // returns 10begin[1] // returns 10begin[100] // returns 10
// sum of [begin, end)reduce(begin, end); // returns 30 (i.e. 3 * 10)
20
Counting// create iteratorscounting_iterator<int> begin(10);counting_iterator<int> end = begin + 3;
begin[0] // returns 10begin[1] // returns 11begin[100] // returns 110
// sum of [begin, end)reduce(begin, end); // returns 33 (i.e. 10 + 11 + 12)
21
Transform– Yields a transformed sequence– Facilitates kernel fusion– Conserves memory capacity and bandwidth
X Y Z
F( x ) F( x ) F( y ) F( z )
22
Transform// initialize vectordevice_vector<int> vec(3);vec[0] = 10; vec[1] = 20; vec[2] = 30;// create iterator (type omitted)begin = make_transform_iterator(vec.begin(), negate<int>());end = make_transform_iterator(vec.end(), negate<int>());
begin[0] // returns -10begin[1] // returns -20begin[2] // returns -30
// sum of [begin, end)reduce(begin, end); // returns -60 (i.e. -10 + -20 + -30)
23
Zip– Looks like an array of structs (AoS)– Stored in structure of arrays (SoA)
24
Zip// initialize vectorsdevice_vector<int> A(3);device_vector<char> B(3);A[0] = 10; A[1] = 20; A[2] = 30;B[0] = ‘x’; B[1] = ‘y’; B[2] = ‘z’;// create iterator (type omitted)begin = make_zip_iterator(make_tuple(A.begin(),
B.begin()));end = make_zip_iterator(make_tuple(A.end(), B.end()));
begin[0] // returns tuple(10, ‘x’)begin[1] // returns tuple(20, ‘y’)begin[2] // returns tuple(30, ‘z’)// maximum of [begin, end)maximum< tuple<int,char> > binary_op;reduce(begin, end, begin[0], binary_op); // returns
tuple(30, ‘z’)
25
Curand Library• # include <curand.h>
• curandStatus_t curandGenerate(curandGenerator_t generator, unsigned int *outputPtr, size_t num)
• A given state may not be used by more than one block.
• A given block may generate random numbers using multiple states.
26
Distribution__device__ floatcurand_uniform (curandState_t *state)
__device__ floatcurand_normal (curandState_t *state)
__device__ unsigned int curand_poisson (curandState_t *state, double lambda)
__device__ double curand_uniform_double (curandState_t *state)
__device__ float2 curand_normal2 (curandState_t *state)
27
struct estimate_pi{ public thrust::unary_function<unsigned int, float>{ __device__ float operator()(unsigned int
thread_id) {float sum = 0;unsigned int N = 10000; // samples per threadunsigned int seed = thread_id;curandState s;// seed a random number generatorcurand_init(seed, 0, 0, &s);
// take N samples in a quarter circle for(unsigned int i = 0; i < N; ++i) {
Estimating π
28
// draw a sample from the unit squarefloat x = curand_uniform(&s);float y = curand_uniform(&s);float dist = sqrtf(x*x + y*y); // measure distance from the origin// add 1.0f if (u0,u1) is inside the quarter circleif(dist <= 1.0f)sum += 1.0f; }// multiply by 4 to get the area of the whole circlesum *= 4.0f;// divide by Nreturn sum / N; }};
Estimating π
29
int main(void) { // use 30K independent seeds int M = 30000; float estimate = thrust::transform_reduce
( thrust::counting_iterator<int>(0),thrust::counting_iterator<int>(M), estimate_pi(), 0.0f, thrust::plus<float>());
estimate /= M;std::cout << std::setprecision(3); std::cout << "pi is approximately "; std::cout << estimate << std::endl; return 0; }
Estimating π
30
Review
• What are containers and iterators?
• How to transfer information between Host and Device via Thurst?
• How to use thrust’s namespace?
• In which way can we make containers associated with raw pointers?
• How to use Thrust to perform transformation, reduce and sort?
• How to use general types in Thrust?
31
Review• Why constant iterators can save storage?
• When do we need to use counting iterators, and how much memory does it require?
• How to use transfor iterators?
• Why do we need zip iterators?
• How to generate different distribution by Curand?
• How to use Curand with Thrust?
• How to use Curand in a kernel function?
32
References
• An Introduction to Thrust
• GTC 2010 – (Part 1) - High Productivity Development– (Part 2) - Thrust By Example
• Thrust_Quick_Start_Guide
• Thrust - A Productivity-Oriented Library for CUDA
• http://code.google.com/p/thrust/
• CURAND_Library