ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
-
Upload
dashawn-hurlbert -
Category
Documents
-
view
218 -
download
1
Transcript of ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream ComputingOpenCL™ Histogram Optimization Illustration
Marc RomankewiczApril 5, 2010
| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
The problem
for( many input values ){ histogram[ value ]++;}
• Many scattered read-modify-write accesses into small data structure
• On CPU, scattered r-m-w goes to cache by default fast
• On GPU, goes to __global by default worst case
Solution: use __local memory & parallelize histogram compute
| ATI Stream Computing Update | Confidential3 3 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
SIMD
GPU Algorithm
1. Thread fetches input data from __global to __private (registers)
2. Scatter into __local sub-histograms in group (multiple LDS banks per bin)
3. Reduce __local bins into single histogram per group, ..
4. Reduce __global histograms (2nd kernel for global sync point)
__local Histograms
Input Buffer__global
SIMD
SIMD
SIMD
__global __global
flush to __global
| ATI Stream Computing Update | Confidential4 4 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
SIMD
GPU Algorithm
1. Thread fetches input data from __global to __private (registers)
2. Scatter into __local sub-histograms in group (multiple LDS banks per bin)
3. Reduce __local bins into single histogram per group, .. flush to __global
4. Reduce __global histograms (2nd kernel for global sync point)
__local Histograms
Input Buffer__global
SIMD
SIMD
__global __global
SIMD
Generic reduction performance
1
1+2
1+2+3
1+2+3+4Input bytes processed, approximate numbers
ATI Radeon™ HD 5870, ATI Stream SDK v2.01
(256 MB to 320KB)
(320 KB to 256 KB)
(256 KB to 1 KB)
145 GB/s
109 GB/s
107 GB/s
103 GB/s
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential5 5 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel launch setup
• At least as many threads as needed to optimally fetch input:
Group size
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential6 6 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Launch setup: assorted lore
• At least 1 group per SIMD
• 3-4 wavefronts per SIMD to keep SIMD stages busy (2 ALU, 1 fetch, 1 export)
• For memory bound kernels: >= 7 wavefronts per SIMD for __global latency hiding
(> 8k threads on AMD “Cypress” GPU)
• Per-thread and per-group costs become noticeable at high thread counts
(i.e. 1 thread per DWORD 4-vec)
• Good experimental starting point: 64 and/or 128 threads/group, >= 16k threads
(on AMD “Cypress GPU”)
• On CPU: as few threads as possible, e.g. 1x – 2x number of compute units
| ATI Stream Computing Update | Confidential7 7 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Launch setup, histogram
• Larger group size: better __local sharing between threads
• Smaller group size: __local reduction gets more expensive
• Experimental peak at 256 threads/group, 64k threads
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential8 8 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Launch setup, histogram, cont’d
#define NBINS 256
main()
{
nThreads = 64 * 1024;
nThreadsPerGroup = 256;
nGroups = nThreads / nThreadsPerGroup;
n4Vectors = 4096 * 4096;
n4VectorsPerThread = n4Vectors / nThreads;
inputNBytes = n4Vectors * sizeof(cl_uint4);
outputNBytes = nGroups * NBINS * sizeof(cl_uint);
(static setup for benchmarking purpose only; a real app will take into account the image size and GPU type (wavefront size, # of compute units))
| ATI Stream Computing Update | Confidential9 9 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel
__kernel void histogramKernel( global uint4 *Image,
global uint *Histogram,
uint n4VectorsPerThread)
{
__local uint subhists[NBANKS * NBINS];
…
• input buffer processed as 4-vectors
• output buffer holds sub- and final histograms
(256 bins * 256 groups * cl_uint = 256KB)
• __local buffer holds work-group sub-histograms
(256 bins * 16 banks * cl_uint = 16KB per SIMD)
| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, parallel LDS clear
__local uint2 *p = (__local uint2 *) subhists;
if( ltid < lmem_max_threads )
{
for( <coalesced access pattern> )
p[idx] = 0;
}
barrier( CLK_LOCAL_MEM_FENCE );
• Significant difference compared to single thread clear (4.5x)
• Slightly faster as uint2 vs. uint (2x more LDS requests per instruction)
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential1111 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, coalesced access
uint tid = get_global_id(0);
uint Stride = get_global_size(0);
uint4 temp;
for( i=0, idx = tid; i < n4VectorsPerThread; i++, idx += Stride )
{
temp = Image[idx];
• Each thread starts at its global thread ID
• Stride is the number of threads
• Resulting pattern over all threads is optimally coalesced …
Loop 0Loop 1Loop 2
get_global_size(0)
| ATI Stream Computing Update | Confidential1212 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, serial access
uint tid = get_global_id(0);
uint4 temp;
for( i=0, idx = tid*n4VectorsPerThread; i<n4VectorsPerThread; i++, idx++)
{
temp = Image[idx];
• Each thread reads a block with stride 1
• Resulting pattern is bad for uncached __global
• Ok on CPU and GPU cached
Loop 0Loop 1Loop 2
n4VectorsPerThread
| ATI Stream Computing Update | Confidential1313 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Coalesced vs. serial access
group size 64
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential1414 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel: 4-vector pixel mask & shift
1. fetch: XYZWXYZWXYZWXYZW2. mask: ___W___W___W___W3. shift: _XYZ_XYZ_XYZ_XYZ4. mask: ___Z___Z___Z___Z5. shift: __XY__XY__XY__XY6. mask: ___Y___Y___Y___Y7. …
Performs better than generic uchar4/uchar16
#define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS);
for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … temp = temp >> shft; temp2 = (temp & msk) * (uint4) NBANKS + offset; …
| ATI Stream Computing Update | Confidential1515 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, atomic scatter
#define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS);
for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset;
(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); …
| ATI Stream Computing Update | Confidential1616 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, LDS banks
#define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS);
for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; …
0 1 2 3 4 5 6 7 8 9 A B C D E FLDS addr 0
NBANKS = 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
LDS addr 0
LDS addr 0x10
LDS addr 0x20
NBANKS = 16
| ATI Stream Computing Update | Confidential1717 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
LDS banking performance
Effective LDS rate: > 900 GB/sec
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential1818 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel, LDS reduction
barrier( CLK_LOCAL_MEM_FENCE ); if( ltid < NBINS ){ uint bin = 0;
for( i=0; i<NBANKS; i++ ) bin += subhists[ (ltid * NBANKS) + i ];
Histogram[ (get_group_id(0) * NBINS) + ltid ] = bin;}
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
LDS addr 0
LDS addr 0x10
0 1 2 3 4 5 6 7 8 9__global
| ATI Stream Computing Update | Confidential1919 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Kernel 2, __global reduction
__kernel void reduceKernel( __global uint *Histogram, uint nSubHists ){ uint tid = get_global_id(0); uint bin = 0; for( int i=0; i < nSubHists; i++ ) bin += Histogram[ (i * NBINS) + tid ];
Histogram[ tid ] = bin;}
0 1 2 3 4 5 6 7 8 9__global
0 1 2 3 4 5 6 7 8 9__global0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
| ATI Stream Computing Update | Confidential2020 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Single component vs. 4-vector
• 4-vectors work best for many cases.
• Some corner cases can be faster using single component access ..
• For absolute peak performance, it’s worth trying both.
| ATI Stream Computing Update | Confidential2121 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Single component vs. 4-vector, histogram
for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset;
(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w );
temp = temp >> shft;
| ATI Stream Computing Update | Confidential2222 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Single component vs. 4-vector, histogram, cont’d
for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp.x = Image[idx].x; temp.y = Image[idx].y; temp.z = Image[idx].z; temp.w = Image[idx].w; temp2.x = (temp.x & msk) * (uint) NBANKS + offset; temp2.y = (temp.y & msk) * (uint) NBANKS + offset; temp2.z = (temp.z & msk) * (uint) NBANKS + offset; temp2.w = (temp.w & msk) * (uint) NBANKS + offset;
(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w );
temp.x = temp.x >> shft; temp.y = temp.y >> shft; temp.z = temp.z >> shft; temp.w = temp.w >> shft;
10 % faster!
Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2
| ATI Stream Computing Update | Confidential2323 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration
Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.