Code GPU with CUDA - Optimizing memory and control flow
-
Upload
marina-kolpakova -
Category
Education
-
view
1.219 -
download
5
Transcript of Code GPU with CUDA - Optimizing memory and control flow
![Page 1: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/1.jpg)
CODE GPU WITH CUDAOPTIMIZING MEMORY & CONTROL FLOW
Created by Marina Kolpakova ( ) for cuda.geek Itseez
PREVIOUS
![Page 2: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/2.jpg)
OUTLINEMemory typesMemory cachingTypes of memory access patternsTexturescontrol flow performance limiterslist of common advices
![Page 3: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/3.jpg)
MEMORYOPTIMIZATION
![Page 4: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/4.jpg)
MEMORY TYPESMemory Scope Location Cached Access LifetimeRegister Thread On-chip N/A R/W ThreadLocal Thread Off-chip L1/L2 R/W ThreadShared Block On-chip N/A R/W BlockGlobal Grid + Host Off-chip L2 R/W AppConstant Grid + Host Off-chip L1,L2,L3 R AppTexture Grid + Host Off-chip L1,L2 R App
![Page 5: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/5.jpg)
MEMORY TYPES
![Page 6: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/6.jpg)
MEMORY TYPES
![Page 7: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/7.jpg)
GPU CACHESGPU caches are not intended for the same use as CPU's
Not aimed at temporal reuse. Smaller than CPU size (especially per thread, e.g. Fermi:48 KB L1, 1536 threads on fly, cache / thread = 1 x 128-byte line).Aimed at spatial reuse. Intended to smooth some access patterns, help with spilledregisters and stack.Do not tile relying on block size. Lines likely become evicted next few access
Use smem for tiling. Same latency, fully programmableL2 aimed to speed up atomics and gmem writes.
![Page 8: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/8.jpg)
GMEMLearn your access pattern before thinking about latency hiding and try not to thresh the
memory bus.
Four general categories of inefficient memory access patterns:
Miss-aligned (offset) warp addressesStrided access between threads within a warpThread-affine (each thread in a warp accesses a large contiguous region)Irregular (scattered) addresses
Always be aware about bytes you actually need and bytes you transfer through the bus
![Page 9: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/9.jpg)
GMEM: MISS-ALIGNED
Add extra padding for data to force alignmentUse read-only texture L1Combination of above
![Page 10: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/10.jpg)
GMEM: STRIDED
If pattern is regular, try to change data layout: AoS -> SoA
![Page 11: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/11.jpg)
GMEM: STRIDED
Use smem to correct access pattern.1. load gmem -> smem with best coalescing2. synchronize3. use
![Page 12: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/12.jpg)
GMEM: STRIDED
Use warp shuffle to permute elements for warp1. coalescingly load elements needed by warp2. permute3. use
![Page 13: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/13.jpg)
GMEM: STRIDED
Use proper caching strategycg – cache globalldg – cache in texture L1cs – cache streaming
![Page 14: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/14.jpg)
GMEM: THREAD-AFFINEEach thread accesses relatively long continuous memory region
Load big structures using AoSThread loads continuous region of dataAll threads load the same data
![Page 15: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/15.jpg)
GMEM: THREAD-AFFINEWork distribution
i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ;
i n t t h r e a d N = N / b l o c k D i m . x * g r i d D i m . x ;f o r ( s i z e _ t i = t i d * N ; i < ( t i d + 1 ) * N ; + + i ){ s u m = + i n [ i ]}
f o r ( s i z e _ t i = t i d ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x ){ s u m = + i n [ i ]}
![Page 16: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/16.jpg)
UNIFORM LOADAll threads in a block access the same address as read only.
Memory operation uses 3-level constant cache
Generated by compilerAvailable as PTX asm insertion
_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ f l o a t _ _ l d u ( c o n s t f l o a t * p t r ){ f l o a t v a l ; a s m ( " l d u . g l o b a l . f 3 2 % 0 , [ % 1 ] ; " : " = " f ( v a l ) : l ( p t r ) ) ; r e t u r n v a l ;}
![Page 17: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/17.jpg)
GMEM: IRREGULARRandom memory access. Threads in a warp access many lines, strides are irregular.
Improve data localityTry 2D-local arrays (Morton-ordered)Use read-only texture L1Kernel fission to localize the worst case.
![Page 18: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/18.jpg)
TEXTURESmaller transactions and different caching (dedicated L1, 48 KB, ~104 clock latency)Cache is not polluted by other GMEM loads, separate partition for each warp schedulerhelps to prevent cache threshingPossible hardware interpolation (Note: 9-bit alpha)Hardware handling of out-of-bound access
Kepler improvements:
sm_30+ Bindless textures. No global static variables. Can be used in threaded codesm_32+ GMEM access through texture cache bypassing interpolation units
![Page 19: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/19.jpg)
SMEM: BANKINGKEPLER: 32-BIT AND 64-BIT MODES
special case: 2D smem usage (Fermi example)
_ _ s h a r e d _ _ f l o a t s m e m _ b u f f e r [ 3 2 ] [ 3 2 + 1 ]
![Page 20: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/20.jpg)
SMEMThe common techniques are:
use smem to improve memory access patternuse smem for stencil processing
But the gap between smem and math throughput is increasing
Tesla: 16 (32 bit) banks vs 8 thread processors (2:1)GF100: 32 (32 bit) banks vs 32 thread processors (1:1)GF104: 32 (32 bit) banks vs 48 thread processors (2:3)Kepler: 32 (64 bit) banks vs 192 thread processors (1:3)
Max size 48 KB (49152 B), assume max occupancy 64x32,so 24 bytes per thread.
More intensive memory usage affects occupancy.
![Page 21: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/21.jpg)
SMEM (CONT.)smem + L1 use the same 64K B. Program-configurable split:
Fermi: 48:16, 16:48Kepler: 48:16, 16:48, 32:32
cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig()
prefer L1 to improve lmem usageprefer smem for stencil kernels
smen often used for:
data sharing across the blockinter-block communicationbock-level buffers (for scan or reduction)stencil code
![Page 22: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/22.jpg)
LMEMLocal memory is a stack memory analogue: call stack, register spilling. Note: Both Local
memory reads/writes are cached in L1.
Registers are for automatic variables
Volatile keyword enforces spillingRegisters do not support indexing: local memory is used for local arrays
Register spilling leads to more instructions and memory traffic
i n t a = 4 2 ;
i n t b [ S I Z E ] = { 0 , } ;
![Page 23: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/23.jpg)
SPILLING CONTROL1. Use __launch_bounds__ to help compiler to select maximum amount of registers.
2. Compile with -maxrregcount to enforce compiler optimization for register usage andregister spilling if needed
3. By default you run less concurrent warps per SM
_ _ g l o b a l _ _ v o i d _ _ l a u n c h _ b o u n d s _ _ (m a x T h r e a d s P e r B l o c k , m i n B l o c k s P e r M u l t i p r o c e s s o r ) k e r n e l ( . . . ){ / / . . .}
![Page 24: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/24.jpg)
CONTROL FLOW
![Page 25: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/25.jpg)
CONTROL FLOW: PROBLEMSWarp divergence: branching, early loop exit... Inspect SASS to find divergent pieces ofcodeWorkload is data dependent: code-path depends on input (like classification task)Too many synchronization logic: intensive usage of parallel data structures, lots ofatomics, __sychthreads(), etcResident warps: occupy resources but do nothingBig blocks: tail effect
![Page 26: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/26.jpg)
CONTROL FLOW: SOLUTIONSUnderstand your problem. Select best algorithm keeping in mind GPU architecture.Maximize independent parallelismCompiler generates branch predication with -O3 during if/switch optimization butnumber of instructions has to be less or equal than a given threshold. Threshold = 7 iflots of divergent warps, 4 otherwiseAdjust thread block sizeTry work queues
![Page 27: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/27.jpg)
KERNEL FUSION AND FISSIONFusion
Replace chain of kernel calls with fused oneHelps to save memory reads/writes. Intermediate results can be kept in registersEnables further ILP optimizationsKernels should have almost the same access pattern
FissionReplace one kernel call with a chainHelps to localize ineffective memory access patternsInsert small kernels that repack data (e.g. integral image)
![Page 28: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/28.jpg)
TUNING BLOCK CONFIGURATIONFinding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
low occupancy presents full hardware utilization and lowers possibility to hide patencyhigh occupancy for kernels with large memory demands results in over polluted read orwrite queues
Experiment to find optimal configuration (block and grid resolutions, amount of work perthread) that is optimal for your kernel.
![Page 29: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/29.jpg)
TUNING BLOCK CONFIGURATIONFinding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
![Page 30: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/30.jpg)
FINAL WORDSBasic CUDA Code Optimizations
use compiler flagsdo not trick compileruse structure of arraysimprove memory layoutload by cache lineprocess by rowcache data in registersre-compute values instead of re-loadingkeep data on GPU
![Page 31: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/31.jpg)
FINAL WORDSConventional parallelization optimizations
use light-weight locking,... atomics,... and lock-free code.minimize locking,... memory fences,... and volatile accesses.
![Page 32: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/32.jpg)
FINAL WORDSConventional architectural optimizations
utilize shared memory,... constant memory,... streams,... thread voting,... and rsqrtf;detect compute capability and number of SMs;tune thread count,... blocks per SM,... launch bounds,and L1 cache/shared memory configuration
![Page 33: Code GPU with CUDA - Optimizing memory and control flow](https://reader031.fdocuments.us/reader031/viewer/2022021507/58e4efe81a28ab87378b6043/html5/thumbnails/33.jpg)
THE ENDNEXT
BY / 2013–2015CUDA.GEEK