Thrust & Curand

Thrust & Curand

Dr. Bo YuanE-mail: [email protected]

2

Outline

• Thrust Basic

• Work with Thrust

• General Types of Thrust

• Fancy Iterators of Thrust

• Curand Library

3

Thrust Basic

• What is Thrust?

• Containers

• Iterators

• Interaction with raw pointers

• Namespace

4

What is Thrust?

• C++ template library for CUDA– Mimics Standard Template Library (STL)

• Containers– thrust::host_vector<T>– thrust::device_vector<T>

• Algorithms– thrust::sort()– thrust::reduce()– thrust::inclusive_scan()– …

5

Containers

• Make common operations concise and readable – Hides cudaMalloc, cudaMemcpy and cudaFree

// allocate host vector with two elements thrust::host_vector<int> h_vec(2);• // copy host vector to device thrust::device_vector<int> d_vec = h_vec;

// manipulate device values from the host d_vec[0] = 13; d_vec[1] = 27;• std::cout << "sum: " << d_vec[0] + d_vec[1] << std::endl;

// vector memory automatically released w/ free() or cudaFree()

6

Iterators• Sequences defined by pair of iterators

// allocate device vectorthrust::device_vector<int> d_vec(4);

// returns iterator at first element of d_vecd_vec.begin();

// returns iterator one past the last element of d_vecd_vec.end()

// [begin, end) pair defines a sequence of 4 elements

d_vec.begin() d_vec.end()

7

Iterators

// initialize random values on host thrust::host_vector<int> h_vec(1000);thrust::generate(h_vec.begin(), h_vec.end(), rand);

// copy values to devicethrust::device_vector<int> d_vec = h_vec;

// compute sum on hostint h_sum = thrust::reduce(h_vec.begin(),h_vec.end());

// compute sum on deviceint d_sum = thrust::reduce(d_vec.begin(),d_vec.end());

8

Convertible to Raw Pointers

// allocate device vectorthrust::device_vector<int> d_vec(4);

// obtain raw pointer to device vector’s memoryint * ptr = thrust::raw_pointer_cast(&d_vec[0]);

// use ptr in a CUDA C kernelmy_kernel<<<N/256, 256>>>(N, ptr);

// Note: ptr cannot be dereferenced on the host!

9

Wrap Raw Pointers with device_ptr

// raw pointer to device memoryint * raw_ptr;cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// wrap raw pointer with a device_ptr thrust::device_ptr<int> dev_ptr(raw_ptr);

// use device_ptr in thrust algorithmsthrust::fill(dev_ptr, dev_ptr + N, (int) 0);

// access device memory through device_ptrdev_ptr[0] = 1;

10

Namespaces

• C++ supports namespaces– Thrust uses thrust namespace

• thrust::device_vector• thrust::copy

– STL uses std namespace• std::vector• std::list

• Avoids collisions– thrust::sort()– std::sort()

• For brevity– using namespace thrust;

11

Usage of Thrust:

• Thrust provides many standard algorithms– Transformations– Reductions– Prefix Sums– Sorting

• Generic definitions– General Types

• Built-in types (int, float, …)• User-defined structures

– General Operators• reduce with plus operator• scan with maximum operator

12

Transformations// allocate three device_vectors with 10 elementsthrust::device_vector<int> X(10);thrust::device_vector<int> Y(10);thrust::device_vector<int> Z(10);// initialize X to 0,1,2,3, ....thrust::sequence(X.begin(), X.end());// compute Y = -Xthrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());// fill Z with twosthrust::fill(Z.begin(), Z.end(), 2);// compute Y = X mod 2thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(),thrust::modulus<int>());// replace all the ones in Y with tensthrust::replace(Y.begin(), Y.end(), 1, 10);

13

Reductionsint sum = thrust::reduce(D.begin(), D.end(), (int) 0,

thrust::plus<int>());

int sum = thrust::reduce(D.begin(), D.end(), (int) 0);

int sum = thrust::reduce(D.begin(), D.end())

thrust::count_ifthrust::min_elementthrust::max_elementthrust::is_sortedthrust::inner_product

14

Sorting

• thrust::sort

• thrust::stable_sort

• thrust::sort_by_key

• thrust::stable_sort_by_key

15

General Types and Operators

#include <thrust/reduce.h>// declare storagedevice_vector<int> i_vec = ... device_vector<float> f_vec = ... // sum of integers (equivalent calls)reduce(i_vec.begin(), i_vec.end());reduce(i_vec.begin(), i_vec.end(), 0, plus<int>());// sum of floats (equivalent calls)reduce(f_vec.begin(), f_vec.end());reduce(f_vec.begin(), f_vec.end(), 0.0f, plus<float>());// maximum of integersreduce(i_vec.begin(), i_vec.end(), 0,

maximum<int>());

16

General Types and Operatorsstruct negate_float2{ __host__ __device__ float2 operator()(float2 a) { return make_float2(-a.x, -a.y); }};// declare storagedevice_vector<float2> input = ... device_vector<float2> output = ...// create functornegate_float2 func;// negate vectorstransform(input.begin(), input.end(), output.begin(),

func);

17

General Types and Operators

// compare x component of two float2 structuresstruct compare_float2{ __host__ __device__ bool operator()(float2 a, float2 b) { return a.x < b.x; }};// declare storagedevice_vector<float2> vec = ...// create comparison functorcompare_float2 comp;// sort elements by x componentsort(vec.begin(), vec.end(), comp);

18

Advanced Thrust• Behave like “normal” iterators

– Algorithms don't know the difference

– constant_iterator– counting_iterator– transform_iterator– zip_iterator

19

Constant // create iteratorsconstant_iterator<int> begin(10);constant_iterator<int> end = begin + 3;

begin[0] // returns 10begin[1] // returns 10begin[100] // returns 10

// sum of [begin, end)reduce(begin, end); // returns 30 (i.e. 3 * 10)

20

Counting// create iteratorscounting_iterator<int> begin(10);counting_iterator<int> end = begin + 3;

begin[0] // returns 10begin[1] // returns 11begin[100] // returns 110

// sum of [begin, end)reduce(begin, end); // returns 33 (i.e. 10 + 11 + 12)

21

Transform– Yields a transformed sequence– Facilitates kernel fusion– Conserves memory capacity and bandwidth

X Y Z

F( x ) F( x ) F( y ) F( z )

22

Transform// initialize vectordevice_vector<int> vec(3);vec[0] = 10; vec[1] = 20; vec[2] = 30;// create iterator (type omitted)begin = make_transform_iterator(vec.begin(), negate<int>());end = make_transform_iterator(vec.end(), negate<int>());

begin[0] // returns -10begin[1] // returns -20begin[2] // returns -30

// sum of [begin, end)reduce(begin, end); // returns -60 (i.e. -10 + -20 + -30)

23

Zip– Looks like an array of structs (AoS)– Stored in structure of arrays (SoA)

24

Zip// initialize vectorsdevice_vector<int> A(3);device_vector<char> B(3);A[0] = 10; A[1] = 20; A[2] = 30;B[0] = ‘x’; B[1] = ‘y’; B[2] = ‘z’;// create iterator (type omitted)begin = make_zip_iterator(make_tuple(A.begin(),

B.begin()));end = make_zip_iterator(make_tuple(A.end(), B.end()));

begin[0] // returns tuple(10, ‘x’)begin[1] // returns tuple(20, ‘y’)begin[2] // returns tuple(30, ‘z’)// maximum of [begin, end)maximum< tuple<int,char> > binary_op;reduce(begin, end, begin[0], binary_op); // returns

tuple(30, ‘z’)

25

Curand Library• # include <curand.h>

• curandStatus_t curandGenerate(curandGenerator_t generator, unsigned int *outputPtr, size_t num)

• A given state may not be used by more than one block.

• A given block may generate random numbers using multiple states.

26

Distribution__device__ floatcurand_uniform (curandState_t *state)

__device__ floatcurand_normal (curandState_t *state)

__device__ unsigned int curand_poisson (curandState_t *state, double lambda)

__device__ double curand_uniform_double (curandState_t *state)

__device__ float2 curand_normal2 (curandState_t *state)

27

struct estimate_pi{ public thrust::unary_function<unsigned int, float>{ __device__ float operator()(unsigned int

thread_id) {float sum = 0;unsigned int N = 10000; // samples per threadunsigned int seed = thread_id;curandState s;// seed a random number generatorcurand_init(seed, 0, 0, &s);

// take N samples in a quarter circle for(unsigned int i = 0; i < N; ++i) {

Estimating π

28

// draw a sample from the unit squarefloat x = curand_uniform(&s);float y = curand_uniform(&s);float dist = sqrtf(x*x + y*y); // measure distance from the origin// add 1.0f if (u0,u1) is inside the quarter circleif(dist <= 1.0f)sum += 1.0f; }// multiply by 4 to get the area of the whole circlesum *= 4.0f;// divide by Nreturn sum / N; }};

Estimating π

29

int main(void) { // use 30K independent seeds int M = 30000; float estimate = thrust::transform_reduce

( thrust::counting_iterator<int>(0),thrust::counting_iterator<int>(M), estimate_pi(), 0.0f, thrust::plus<float>());

estimate /= M;std::cout << std::setprecision(3); std::cout << "pi is approximately "; std::cout << estimate << std::endl; return 0; }

Estimating π

30

Review

• What are containers and iterators?

• How to transfer information between Host and Device via Thurst?

• How to use thrust’s namespace?

• In which way can we make containers associated with raw pointers?

• How to use Thrust to perform transformation, reduce and sort?

• How to use general types in Thrust?

31

Review• Why constant iterators can save storage?

• When do we need to use counting iterators, and how much memory does it require?

• How to use transfor iterators?

• Why do we need zip iterators?

• How to generate different distribution by Curand?

• How to use Curand with Thrust?

• How to use Curand in a kernel function?

32

References

• An Introduction to Thrust

• GTC 2010 – (Part 1) - High Productivity Development– (Part 2) - Thrust By Example

• Thrust_Quick_Start_Guide

• Thrust - A Productivity-Oriented Library for CUDA

• http://code.google.com/p/thrust/

• CURAND_Library

Thrust & Curand

Documents

Transcript of Thrust & Curand