Introduction to CUDA 2 of 2 Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.
-
Upload
dustin-gray -
Category
Documents
-
view
216 -
download
1
description
Transcript of CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.
![Page 1: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/1.jpg)
CUDA Odds and Ends
Joseph KiderUniversity of PennsylvaniaCIS 565 - Fall 2011
![Page 2: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/2.jpg)
Sources
Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel
Processors
![Page 3: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/3.jpg)
Agenda
Atomic Functions Paged-Locked Host Memory Streams Graphics Interoperability
![Page 4: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/4.jpg)
Atomic Functions
__device__ unsigned int count = 0;// ...++count;
What is the value of count if 8 threads execute ++count?
![Page 5: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/5.jpg)
Atomic Functions
Read-modify-write atomic operationGuaranteed no interference from other threadsNo guarantee on order
Shared or global memory Requires compute capability 1.1 (> G80)
See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements
![Page 6: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/6.jpg)
Atomic Functions
__device__ unsigned int count = 0;// ...// atomic ++countatomicInc(&count, 1);
What is the value of count if 8 threads execute atomicInc below?
![Page 7: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/7.jpg)
Atomic Functions
How do you implement atomicInc?
__device__ int atomicAdd( int *address, int val);
![Page 8: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/8.jpg)
Atomic Functions
How do you implement atomicInc?
__device__ int atomicAdd( int *address, int val){ // Made up keyword: __lock (address) { *address += value; }}
![Page 9: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/9.jpg)
Atomic Functions
How do you implement atomicInc without locking?
![Page 10: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/10.jpg)
Atomic Functions
How do you implement atomicInc without locking?
What if you were given an atomic compare and swap?
int atomicCAS(int *address, int compare, int val);
![Page 11: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/11.jpg)
Atomic Functions
atomicCAS pseudo implementation
int atomicCAS(int *address, int compare, int val){ // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }}
![Page 12: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/12.jpg)
Atomic Functions
atomicCAS pseudo implementation
int atomicCAS(int *address, int compare, int val){ // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }}
![Page 13: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/13.jpg)
Atomic Functions
atomicCAS pseudo implementation
int atomicCAS(int *address, int compare, int val){ // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }}
![Page 14: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/14.jpg)
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3);
![Page 15: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/15.jpg)
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3);
// returns 1// *addr = 2
![Page 16: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/16.jpg)
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3);
// returns 2// *addr = 2
![Page 17: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/17.jpg)
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3); // returns 2
// *addr = 3
![Page 18: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/18.jpg)
Atomic Functions
Again, how do you implement atomicInc given atomicCAS?
__device__ int atomicAdd( int *address, int val);
![Page 19: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/19.jpg)
Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old;}
![Page 20: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/20.jpg)
Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old;}
Read original value at *address.
![Page 21: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/21.jpg)
Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old;}
If the value at *address didn’t change, increment it.
![Page 22: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/22.jpg)
Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, assumed + val); } while (assumed != old); return old;}
Otherwise, loop until atomicCAS succeeds.
The value of *address after this function returns is not necessarily the original value of *address + val, why?
![Page 23: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/23.jpg)
Atomic Functions
Lots of atomics:
// Arithmetic // BitwiseatomicAdd() atomicAnd()atomicSub() atomicOr()atomicExch() atomicXor()atomicMin()atomicMax()atomicInc()atomicDec()atomicCAS()
See B.10 in the NVIDIA CUDA C Programming Guide
![Page 24: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/24.jpg)
Atomic Functions
How can threads from different blocks work together?
Use atomics sparingly. Why?
![Page 25: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/25.jpg)
Page-Locked Host Memory
Page-locked MemoryHost memory that is essentially removed from
virtual memoryAlso called Pinned Memory
![Page 26: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/26.jpg)
Page-Locked Host Memory
BenefitsOverlap kernel execution and data transfers
See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements
Time
Data Transfer Kernel Execution
Data Transfer
Kernel Execution
Normally:
Paged-locked:
![Page 27: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/27.jpg)
Page-Locked Host Memory
Benefits Increased memory bandwidth for systems
with a front-side bus Up to ~2x throughput
Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars
![Page 28: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/28.jpg)
Page-Locked Host Memory
BenefitsWriting-Combing Memory
Page-locked memory is cacheable Allocate with cudaHostAllocWriteCombined to
Avoid polluting L1 and L2 caches Avoid snooping transfers across PCIe Improve transfer performance up to 40% - in theory
Reading from write-combing memory is slow! Only write to it from the host
![Page 29: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/29.jpg)
Page-Locked Host Memory
BenefitsPaged-locked host memory can be mapped
into the address space of the device on some systems
What systems allow this? What does this eliminate? What applications does this enable?
Call cudaGetDeviceProperties() and check canMapHostMemory
![Page 30: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/30.jpg)
Page-Locked Host Memory
Usage:
cudaHostAlloc() / cudaMallocHost()cudaHostFree()
cudaMemcpyAsync()
See 3.2.5 in the NVIDIA CUDA C Programming Guide
![Page 31: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/31.jpg)
Page-Locked Host Memory
DEMOCUDA SDK Example: bandwidthTest
![Page 32: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/32.jpg)
Page-Locked Host Memory
What’s the catch?Page-locked memory is scarce
Allocations will start failing before allocation of in pageable memory
Reduces amount of physical memory available to the OS for paging
Allocating too much will hurt overall system performance
![Page 33: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/33.jpg)
Streams Stream: Sequence of commands that execute in order Streams may execute their commands out-of-order or concurrently
with respect to other streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B
![Page 34: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/34.jpg)
Streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B Time
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Is this a possible order?
![Page 35: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/35.jpg)
Streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B Time
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Is this a possible order?
![Page 36: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/36.jpg)
Streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B Time
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Is this a possible order?
![Page 37: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/37.jpg)
Streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B Time
Command 0
Command 2
Command 1
Command 0
Command 2
Command 1
Is this a possible order?
![Page 38: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/38.jpg)
Streams
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Stream A Stream B Time
Command 0
Command 1
Command 2
Command 0
Command 1
Command 2
Is this a possible order?
![Page 39: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/39.jpg)
Streams
In CUDA, what commands go in a stream?Kernel launchesHost device memory transfers
![Page 40: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/40.jpg)
Streams
Code Example1. Create two streams2. Each stream:
1. Copy page-locked memory to device2. Launch kernel3. Copy memory back to host
3. Destroy streams
![Page 41: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/41.jpg)
Stream Example (Step 1 of 3)cudaStream_t stream[2];for (int i = 0; i < 2; ++i){ cudaStreamCreate(&stream[i]);}
float *hostPtr;cudaMallocHost(&hostPtr, 2 * size);
![Page 42: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/42.jpg)
Stream Example (Step 1 of 3)cudaStream_t stream[2];for (int i = 0; i < 2; ++i){ cudaStreamCreate(&stream[i]);}
float *hostPtr;cudaMallocHost(&hostPtr, 2 * size);
Create two streams
![Page 43: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/43.jpg)
Stream Example (Step 1 of 3)cudaStream_t stream[2];for (int i = 0; i < 2; ++i){ cudaStreamCreate(&stream[i]);}
float *hostPtr;cudaMallocHost(&hostPtr, 2 * size);
Allocate two buffers in page-locked memory
![Page 44: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/44.jpg)
Stream Example (Step 2 of 3)for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]);}
![Page 45: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/45.jpg)
Stream Example (Step 2 of 3)for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]);}
Commands are assigned to, and executed by streams
![Page 46: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/46.jpg)
Stream Example (Step 3 of 3)for (int i = 0; i < 2; ++i){ // Blocks until commands complete cudaStreamDestroy(stream[i]);}
![Page 47: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/47.jpg)
Streams
Assume compute capabilities:Overlap of data transfer and kernel executionConcurrent kernel executionConcurrent data transfer
How can the streams overlap?
See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities
![Page 48: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/48.jpg)
Streams
Time
Kernel execution
Stream A Stream B
Host device memory
Device to host memory Kernel execution
Host device memory
Device to host memory
Can we have more overlap than this?
![Page 49: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/49.jpg)
Streams
Time
Kernel execution
Stream A Stream B
Host device memory
Device to host memory
Kernel execution
Host device memory
Device to host memory
Can we have this?
![Page 50: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/50.jpg)
Streams
Implicit SynchronizationAn operation that requires a dependency
check to see if a kernel finished executing: Blocks all kernel launches from any stream until
the checked kernel is finished
See 3.2.6.5.3 in the NVIDIA CUDA C Programming Guide for all limitations
![Page 51: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/51.jpg)
Streams
Time
Kernel execution
Stream A
Stream BHost device memory
Device to host memory Kernel execution
Host device memory
Device to host memory
Can we have this?
Dependent on kernel
completion
Blocked until kernel from Stream A completes
![Page 52: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/52.jpg)
Streams
Performance Advice Issue all independent commands before
dependent onesDelay synchronization (implicit or explicit) as
long as possible
![Page 53: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/53.jpg)
Streams
for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, stream[i]); kernel<<< /*... */ stream[i]>>>(); cudaMemcpyAsync(/* ... */, stream[i]);}
Rewrite this to allow concurrent kernel execution
![Page 54: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/54.jpg)
Streams
for (int i = 0; i < 2; ++i) // to device cudaMemcpyAsync(/* ... */, stream[i]);
for (int i = 0; i < 2; ++i) kernel<<< /*... */ stream[i]>>>();
for (int i = 0; i < 2; ++i) // to host cudaMemcpyAsync(/* ... */, stream[i]);
![Page 55: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/55.jpg)
Streams
Explicit SynchronizationcudaThreadSynchronize()
Blocks until commands in all streams finishcudaStreamSynchronize()
Blocks until commands in a stream finish
See 3.2.6.5 in the NVIDIA CUDA C Programming Guide for more synchronization functions
![Page 56: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/56.jpg)
Timing with Stream Events
Events can be added to a stream to monitor the device’s progress
An event is completed when all commands in the stream preceding it complete.
![Page 57: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/57.jpg)
Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)
cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)
![Page 58: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/58.jpg)
Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)
cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)
Create two events. Each will record the time
![Page 59: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/59.jpg)
Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)
cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)
Record events before and after each stream is assigned its work
![Page 60: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/60.jpg)
Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)
cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)
Delay addition commands in stream until after the stop event
![Page 61: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/61.jpg)
Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)
cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)
Compute elapsed time
![Page 62: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/62.jpg)
Graphics Interoperability
What applications use both CUDA and OpenGL/Direct3D?CUDA GLGL CUDA
If CUDA and GL cannot share resources, what is the performance implication?
![Page 63: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/63.jpg)
Graphics Interoperability
Graphics Interop: Map GL resource into CUDA address spaceBuffersTexturesRenderbuffers
![Page 64: CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011.](https://reader034.fdocuments.us/reader034/viewer/2022052607/5a4d1b717f8b9ab0599b5408/html5/thumbnails/64.jpg)
Graphics Interoperability
OpenGL Buffer Interop1. Assign device with GL interop
cudaGLSetGLDevice()
2. Register GL resource with CUDA cudaGraphicsGLRegisterBuffer()
3. Map it cudaGraphicsMapResources()
4. Get mapped pointer cudaGraphicsResourceGetMappedPointer()