CUDA Advanced Memory Usage and Optimization Yukai Hung [email protected] Department of Mathematics...
-
Upload
barbara-doreen-hutchinson -
Category
Documents
-
view
225 -
download
0
Transcript of CUDA Advanced Memory Usage and Optimization Yukai Hung [email protected] Department of Mathematics...
![Page 1: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/1.jpg)
CUDA Advanced Memory Usage and OptimizationCUDA Advanced Memory Usage and OptimizationYukai Hung
[email protected] of MathematicsNational Taiwan University
Yukai [email protected]
Department of MathematicsNational Taiwan University
![Page 2: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/2.jpg)
Register as Cache?Register as Cache?
![Page 3: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/3.jpg)
3
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; }}
identical readscompiler optimized
this read away
![Page 4: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/4.jpg)
4
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }}
![Page 5: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/5.jpg)
5
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads();
temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}
![Page 6: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/6.jpg)
6
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ volatile int temp1; volatile int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}
![Page 7: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/7.jpg)
Data PrefetchData Prefetch
![Page 8: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/8.jpg)
8
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
Md Pd
Pdsub
Nd
load blue block to shared memory
compute blue block on shared memoryand load next block to shared memory
![Page 9: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/9.jpg)
9
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
for loop{ load data from global to shared memory synchronize block
compute data in the shared memory synchronize block }
![Page 10: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/10.jpg)
10
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
load data from global memory to registersfor loop{ store data from register to shared memory synchronize block
load data from global memory to registers compute data in the shared memory synchronize block }
very small overheadboth memory are very fast
computing and loading overlapregister and shared are independent
![Page 11: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/11.jpg)
11
Matrix-matrix multiplication
Data PrefetchData Prefetch
![Page 12: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/12.jpg)
Constant MemoryConstant Memory
![Page 13: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/13.jpg)
13
Constant MemoryConstant Memory
Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor
How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles
![Page 14: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/14.jpg)
14
Constant MemoryConstant Memory
How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!)
//declare constant memory __constant__ float cst_ptr[size];
//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
![Page 15: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/15.jpg)
15
Constant MemoryConstant Memory
//declare constant memory__constant__ float cangle[360];
int main(int argc,char** argv){ int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size);
//initialize allocated memory cudaMemset(darray,0,sizeof(float)*size);
//initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f;
//copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);
![Page 16: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/16.jpg)
16
Constant MemoryConstant Memory
//execute device kernel test_kernel<<<size/64,64>>>(darray);
//free device memory cudaFree(darray);
return 0;}
__global__ void test_kernel(float* darray){ int index;
//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;
#pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return;}
![Page 17: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/17.jpg)
Texture MemoryTexture Memory
![Page 18: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/18.jpg)
18
Texture MemoryTexture Memory
Texture mapping
![Page 19: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/19.jpg)
19
Texture MemoryTexture Memory
Texture mapping
![Page 20: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/20.jpg)
20
Texture MemoryTexture Memory
Texture filtering
nearest-neighborhood interpolation
![Page 21: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/21.jpg)
21
Texture MemoryTexture Memory
Texture filtering
linear/bilinear/trilinear interpolation
![Page 22: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/22.jpg)
22
Texture MemoryTexture Memory
Texture filtering
two times bilinear interpolation
![Page 23: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/23.jpg)
23
Texture MemoryTexture Memory
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Work Distribution Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
these units perform graphical texture operations
![Page 24: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/24.jpg)
24
Texture MemoryTexture Memory
two SMs are cooperated astexture processing clusterscalable units on graphics
texture specific unitonly available for texture
![Page 25: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/25.jpg)
25
Texture MemoryTexture Memory
texture specific unittexture address units
compute texture addresses
texture filtering unitscompute data interpolation
read only texture L1 cache
![Page 26: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/26.jpg)
26
Texture MemoryTexture Memory
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Work Distribution Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
read only texture L2 cache for all TPC read only texture L1 cache for each TPC
![Page 27: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/27.jpg)
27
Texture MemoryTexture Memory
texture specific units
![Page 28: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/28.jpg)
28
Texture MemoryTexture Memory
Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
rSP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
global memory
![Page 29: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/29.jpg)
What is the advantages of texture?What is the advantages of texture?
![Page 30: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/30.jpg)
30
Texture MemoryTexture Memory
Data caching - helpful when global memory coalescing is the main bottleneck
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
rSP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
![Page 31: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/31.jpg)
31
Texture MemoryTexture Memory
Data filtering - support linear/bilinear and trilinear hardware interpolation
texture specific unitintrinsic interpolation
cudaFilterModePointcudaFilterModeLinear
![Page 32: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/32.jpg)
32
Texture MemoryTexture Memory
Accesses modes - clamp and wrap memory accessing for out-of-bound addresses
texture specific unit
clamp boundary
wrap boundary
cudaAddressModeWrap
cudaAddressModeClamp
![Page 33: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/33.jpg)
33
Texture MemoryTexture Memory
Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering
Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems
![Page 34: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/34.jpg)
34
Texture MemoryTexture Memory
Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource
Device code - fetch data by indicating texture reference - fetch data by using texture fetch function
![Page 35: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/35.jpg)
35
Texture MemoryTexture Memory
Texture memory constrain
Compute capability 1.3 Compute capability 2.01D texture linear memory 8192 31768
1D texture cuda array 1024x128
2D texture cuda array (65536,32768) (65536,65536)
3D texture cuda array (2048,2048,2048) (4096,4096,4096)
![Page 36: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/36.jpg)
36
Texture MemoryTexture Memory
Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2
![Page 37: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/37.jpg)
Example: 1-dimension linear memoryExample: 1-dimension linear memory
![Page 38: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/38.jpg)
38
Texture MemoryTexture Memory
//declare texture reference texture<float,1,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ int size=3200;
float* harray; float* diarray; float* doarray;
//allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size);
//initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1);
//copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);
![Page 39: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/39.jpg)
39
Texture MemoryTexture Memory
//bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size);
//execute device kernel kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);
//free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray);
return 0;}
![Page 40: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/40.jpg)
40
Texture MemoryTexture Memory
__global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;
//fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index);
return;}
![Page 41: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/41.jpg)
41
Texture MemoryTexture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory odata[index]=idata[index+offset];}
![Page 42: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/42.jpg)
42
Texture MemoryTexture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset);}
![Page 43: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/43.jpg)
Example: 2-dimension cuda array Example: 2-dimension cuda array
![Page 44: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/44.jpg)
44
Texture MemoryTexture Memory
#define size 3200
//declare texture reference texture<float,2,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;
float* hmatrix; float* dmatrix;
cudaArray* carray; cudaChannelFormatDesc channel;
//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);
//initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);
![Page 45: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/45.jpg)
45
Texture MemoryTexture Memory
//create channel to describe data type channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size);
//copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);
//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;
//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;
![Page 46: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/46.jpg)
46
Texture MemoryTexture Memory
//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray);
blocksize.x=16; blocksize.y=16;
blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16);
//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);
return 0;}
![Page 47: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/47.jpg)
47
Texture MemoryTexture Memory
__global__ void kernel(float* dmatrix,int size) { int xindex; int yindex;
//calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y;
//fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);
return;}
![Page 48: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/48.jpg)
Example: 3-dimension cuda array Example: 3-dimension cuda array
![Page 49: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/49.jpg)
49
Texture MemoryTexture Memory
#define size 256
//declare texture reference texture<float,3,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;
float* hmatrix; float* dmatrix;
cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel;
cudaMemcpy3DParms copyparms={0};
//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);
![Page 50: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/50.jpg)
50
Texture MemoryTexture Memory
//initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);
//set cuda array volume size volumesize=make_cudaExtent(size,size,size);
//create channel to describe data type channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize);
//set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice;
copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(©parms);
![Page 51: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/51.jpg)
51
Texture MemoryTexture Memory
//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;
//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp;
//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel);
blocksize.x=8; blocksize.y=8; blocksize.z=8;
blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8);
//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);
![Page 52: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/52.jpg)
52
Texture MemoryTexture Memory
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);
return 0;}
![Page 53: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/53.jpg)
53
Texture MemoryTexture Memory
__global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex;
//calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); }
return;}
![Page 54: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/54.jpg)
Performance comparison: image projectionPerformance comparison: image projection
![Page 55: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/55.jpg)
55
Texture MemoryTexture Memory
image projection or ray casting
![Page 56: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/56.jpg)
56
Texture MemoryTexture Memory
trilinear interpolationon nearby 8 pixels
intrinsic interpolation units is very powerful
global memory accessing is very close to random
![Page 57: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/57.jpg)
57
Texture MemoryTexture Memory
Method Time Speedupglobal 1.891 -
global/locality 0.198 9.5texture/point 0.072 26.2texture/linear 0.037 51.1
texture/linear/locality 0.012 157.5texture/linear/locality/fast math 0.011 171.9
object size 512 x 512x 512 / ray number 512 x 512
![Page 58: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/58.jpg)
Why texture memory is so powerful?Why texture memory is so powerful?
![Page 59: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/59.jpg)
59
Texture MemoryTexture Memory
CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout
![Page 60: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/60.jpg)
Why only readable texture cache?Why only readable texture cache?
![Page 61: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/61.jpg)
61
Texture cache cannot detect the dirty data
Texture MemoryTexture Memory
host memory
cache
float array
load from memory to
cache
perform some operations on cache
lazy updatefor write-back
reload from memory to
cache
modified by other threads
![Page 62: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/62.jpg)
62
Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array
Texture MemoryTexture Memory
device memory
cache
float array
write data to global memory directly
read data through texture cache
tex1Dfetch(texreference,index)
darray[index]=value;
texture cache may not be updated
![Page 63: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/63.jpg)
How about the texture data locality?How about the texture data locality?
![Page 64: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/64.jpg)
64
Texture MemoryTexture Memory
all blocks get scheduled round-robin based onthe number of shaders
Why CUDA distributes the work blocks in
horizontal direction?
![Page 65: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/65.jpg)
65
Texture MemoryTexture Memory
load balancing on overall SMs, suppose consecutive
blocks have very similar work load
texture cache data locality, suppose consecutive blocks
use similar nearby data
![Page 66: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/66.jpg)
66
Texture MemoryTexture Memory
reorder the block index fitting into z-order to take advantage of texture L1 cache
![Page 67: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/67.jpg)
67
Texture MemoryTexture Memory
streaming processorstemp1=a/b+sin(c)
special function unitstemp2[loop]=__cos(d)
texture operation unitstemp3=tex2D(ref,x,y)
concurrent executionfor independent units
![Page 68: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/68.jpg)
68
Texture MemoryTexture Memory
Memory Location Cache Speed Access
global off-chip no hundreds all threads
constant off-chip yes one ~ hundreds all threads
texture off-chip yes one ~ hundreds all threads
shared on-chip - one block threads
local off-chip no very slow single thread
register on-chip - one single thread
instruction off-chip yes - invisible
![Page 69: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/69.jpg)
69
Texture MemoryTexture Memory
Memory Read/Write Property
global read/write input or output
constant read no structure
texture read locality structure
shared read/write shared within block
local read/write -
register read/write local temp variable
![Page 70: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649e8f5503460f94b93599/html5/thumbnails/70.jpg)
70
Reference - Mark Harris http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php