Cuda 2

18

Click here to load reader

Transcript of Cuda 2

Page 1: Cuda 2

CUDA Programming continued

ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson

Revised

Page 2: Cuda 2

2

Timing GPU Execution

Can use CUDA “events” – create two events and compute the time between them:

cudaEvent_t start, stop;float elapsedTime;cudaEventCreate(&start); // create event objectscudaEventCreate(&stop);cudaEventRecord(start, 0); // Record start event

.

.

.

cudaEventRecord(stop, 0); // record end eventcudaEventSynchronize(stop); // wait work preceding to completecudaEventRecord(stop,0) cudaEventElapsedTime(&elapsedTime, start, stop);

//compute elapsed time between eventscudaEventDestroy(start); //destroy start eventcudaEventDestroy(stop);); //destroy stop event

Page 3: Cuda 2

3

Page 4: Cuda 2

4

Page 5: Cuda 2

5

Page 6: Cuda 2

6

Page 7: Cuda 2

7

Page 8: Cuda 2

8

Host Synchronization

Kernels

•Control returned to CPU immediately (asynchronous, non-blocking)•Kernel starts after all previous CUDA calls completed

cudaMemcpy

•Returns after copy complete (synchronous)•Copy starts after all previous CUDA calls completed

Page 9: Cuda 2

9

CUDA Synchronization Routines

HostcudaThreadSynchronize()

• Blocks until all previous CUDA calls complete

GPUvoid __syncthreads()

•Synchronizes all threads in a block•Barrier – no thread can pass until all threads in block reach it.•All threads must reach __syncthread in thread block.

Page 10: Cuda 2

10

GPU Atomic Operations

Performs a read-modify-write atomic operation on one word residing in global or shared memory.

Associative operations on signed/unsigned integers, add, sub, min, max, and, or, xor, increment, decrement, exchange, compare and swap.

Requires GPU with compute capability 1.1+(Shared memory operations and 64-bit words require higher capability)

coit-grid06 Tesla C2050 has compute capability 2.0

See http://www.nvidia.com/object/cuda_gpus.html for GPU compute capabilities

Page 11: Cuda 2

11

int atomicAdd(int* address, int val);

reads old located at address address in global or shared memory, computes (old + val), and stores result back to memory at same address.

These three operations (read, compute, and write) are performed in one atomic transaction.*

Function returns old.

Atomic Operation Example

* Once stated, it continues to completion without being able to be interrupted by other processors. Other processors cannot read or write to memory location once atomic operation starts. Mechanism implemented in hardware.

Page 12: Cuda 2

12

Other operations

int atomicSub(int* address, int val);

int atomicExch(int* address, int val);

int atomicMin(int* address, int val);

int atomicMax(int* address, int val);

unsigned int atomicInc(unsigned int* address, unsigned int val);

unsigned int atomicDec(unsigned int* address, unsigned int val);

int atomicCAS(int* address, int compare, int val); //compare and swap

int atomicAnd(int* address, int val);

int atomicOr(int* address, int val);

int atomicXor(int* address, int val);

Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010

Page 13: Cuda 2

13

int atomicCAS(int* address, int compare, int val);

reads the word old located at address address in global or shared memory, and compares old with compare. If they are the same, it set old to val (stores val at address address), i.e.:

if (old == compare) old = val; // else old = old

The three operations (read, compute, and write) are performed in one atomic transaction.

The function returns the original value of old.

Also unsigned and unsigned long long int versions.

Compare and Swap(also called compare and exchange)

Page 14: Cuda 2

14

__device__ int lock=0; // unlocked

__global__ void kernel(...) {...

do {} while(atomicCAS(&lock,0,1)); // if lock = 0 set to1// and continue

... // critical section

lock = 0; // free lock}

Coding Critical Sections with Locks

Page 15: Cuda 2

15

Memory Fences

Threads may see the effects of a series of writes to memory executed by another thread in different orders. To enforce ordering:

void __threadfence_block();

waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block.

Other routines:

void __threadfence(); void __threadfence_system();

Page 16: Cuda 2

16

Writes to device memory not guaranteed in any order, so global writes may not have completed by the time the lock is unlocked

__global__ void kernel(...) {...do {} while(atomicCAS(&lock,0,1));

... // criticial section

__threadfence(); // wait for writes to finishlock = 0;}

Critical sections with memory operations

Page 17: Cuda 2

17

Error reporting

All CUDA calls (except kernel launches) return error code of type cudaError_t

cudaError_t cudaGetLastError(void)

Returns code for the last errorCan be used to get error from kernel execution.

Char* cudaGetErroprString(cudaError_t code)

Returns a null-terminated character string describing error

Example

print(“%s\n”,cudaGetErrorString(cudaGetLastError());

Page 18: Cuda 2

Questions