ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

ATI Stream ComputingOpenCL™ Histogram Optimization Illustration

Marc RomankewiczApril 5, 2010

| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration

The problem

for( many input values ){ histogram[ value ]++;}

• Many scattered read-modify-write accesses into small data structure

• On CPU, scattered r-m-w goes to cache by default fast

• On GPU, goes to __global by default worst case

Solution: use __local memory & parallelize histogram compute


SIMD

GPU Algorithm

1. Thread fetches input data from __global to __private (registers)

2. Scatter into __local sub-histograms in group (multiple LDS banks per bin)

3. Reduce __local bins into single histogram per group, ..

4. Reduce __global histograms (2nd kernel for global sync point)

__local Histograms

Input Buffer__global

SIMD

SIMD

SIMD

__global __global

flush to __global


SIMD

GPU Algorithm

1. Thread fetches input data from __global to __private (registers)

2. Scatter into __local sub-histograms in group (multiple LDS banks per bin)

3. Reduce __local bins into single histogram per group, .. flush to __global

4. Reduce __global histograms (2nd kernel for global sync point)

__local Histograms

Input Buffer__global

SIMD

SIMD

__global __global

SIMD

Generic reduction performance

1

1+2

1+2+3

1+2+3+4Input bytes processed, approximate numbers

ATI Radeon™ HD 5870, ATI Stream SDK v2.01

(256 MB to 320KB)

(320 KB to 256 KB)

(256 KB to 1 KB)

145 GB/s

109 GB/s

107 GB/s

103 GB/s

Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2


Kernel launch setup

• At least as many threads as needed to optimally fetch input:

Group size



Launch setup: assorted lore

• At least 1 group per SIMD

• 3-4 wavefronts per SIMD to keep SIMD stages busy (2 ALU, 1 fetch, 1 export)

• For memory bound kernels: >= 7 wavefronts per SIMD for __global latency hiding

(> 8k threads on AMD “Cypress” GPU)

• Per-thread and per-group costs become noticeable at high thread counts

(i.e. 1 thread per DWORD 4-vec)

• Good experimental starting point: 64 and/or 128 threads/group, >= 16k threads

(on AMD “Cypress GPU”)

• On CPU: as few threads as possible, e.g. 1x – 2x number of compute units


Launch setup, histogram

• Larger group size: better __local sharing between threads

• Smaller group size: __local reduction gets more expensive

• Experimental peak at 256 threads/group, 64k threads



Launch setup, histogram, cont’d

#define NBINS 256

main()

{

nThreads = 64 * 1024;

nThreadsPerGroup = 256;

nGroups = nThreads / nThreadsPerGroup;

n4Vectors = 4096 * 4096;

n4VectorsPerThread = n4Vectors / nThreads;

inputNBytes = n4Vectors * sizeof(cl_uint4);

outputNBytes = nGroups * NBINS * sizeof(cl_uint);

(static setup for benchmarking purpose only; a real app will take into account the image size and GPU type (wavefront size, # of compute units))


Kernel

__kernel void histogramKernel( global uint4 *Image,

global uint *Histogram,

uint n4VectorsPerThread)

{

__local uint subhists[NBANKS * NBINS];

…

• input buffer processed as 4-vectors

• output buffer holds sub- and final histograms

(256 bins * 256 groups * cl_uint = 256KB)

• __local buffer holds work-group sub-histograms

(256 bins * 16 banks * cl_uint = 16KB per SIMD)

| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration

Kernel, parallel LDS clear

__local uint2 *p = (__local uint2 *) subhists;

if( ltid < lmem_max_threads )

{

for( <coalesced access pattern> )

p[idx] = 0;

}

barrier( CLK_LOCAL_MEM_FENCE );

• Significant difference compared to single thread clear (4.5x)

• Slightly faster as uint2 vs. uint (2x more LDS requests per instruction)



Kernel, coalesced access

uint tid = get_global_id(0);

uint Stride = get_global_size(0);

uint4 temp;

for( i=0, idx = tid; i < n4VectorsPerThread; i++, idx += Stride )

{

temp = Image[idx];

• Each thread starts at its global thread ID

• Stride is the number of threads

• Resulting pattern over all threads is optimally coalesced …

Loop 0Loop 1Loop 2

get_global_size(0)


Kernel, serial access

uint tid = get_global_id(0);

uint4 temp;

for( i=0, idx = tid*n4VectorsPerThread; i<n4VectorsPerThread; i++, idx++)

{

temp = Image[idx];

• Each thread reads a block with stride 1

• Resulting pattern is bad for uncached __global

• Ok on CPU and GPU cached

Loop 0Loop 1Loop 2

n4VectorsPerThread


Coalesced vs. serial access

group size 64



Kernel: 4-vector pixel mask & shift

1. fetch: XYZWXYZWXYZWXYZW2. mask: ___W___W___W___W3. shift: _XYZ_XYZ_XYZ_XYZ4. mask: ___Z___Z___Z___Z5. shift: __XY__XY__XY__XY6. mask: ___Y___Y___Y___Y7. …

Performs better than generic uchar4/uchar16

#define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS);

for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … temp = temp >> shft; temp2 = (temp & msk) * (uint4) NBANKS + offset; …


Kernel, atomic scatter


for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset;

(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); …


Kernel, LDS banks


for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; …

0 1 2 3 4 5 6 7 8 9 A B C D E FLDS addr 0

NBANKS = 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

LDS addr 0

LDS addr 0x10

LDS addr 0x20

NBANKS = 16


LDS banking performance

Effective LDS rate: > 900 GB/sec



Kernel, LDS reduction

barrier( CLK_LOCAL_MEM_FENCE ); if( ltid < NBINS ){ uint bin = 0;

for( i=0; i<NBANKS; i++ ) bin += subhists[ (ltid * NBANKS) + i ];

Histogram[ (get_group_id(0) * NBINS) + ltid ] = bin;}

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

LDS addr 0

LDS addr 0x10

0 1 2 3 4 5 6 7 8 9__global


Kernel 2, __global reduction

__kernel void reduceKernel( __global uint *Histogram, uint nSubHists ){ uint tid = get_global_id(0); uint bin = 0; for( int i=0; i < nSubHists; i++ ) bin += Histogram[ (i * NBINS) + tid ];

Histogram[ tid ] = bin;}

0 1 2 3 4 5 6 7 8 9__global

0 1 2 3 4 5 6 7 8 9__global0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9


Single component vs. 4-vector

• 4-vectors work best for many cases.

• Some corner cases can be faster using single component access ..

• For absolute peak performance, it’s worth trying both.


Single component vs. 4-vector, histogram

for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset;

(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w );

temp = temp >> shft;


Single component vs. 4-vector, histogram, cont’d

for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ){ temp.x = Image[idx].x; temp.y = Image[idx].y; temp.z = Image[idx].z; temp.w = Image[idx].w; temp2.x = (temp.x & msk) * (uint) NBANKS + offset; temp2.y = (temp.y & msk) * (uint) NBANKS + offset; temp2.z = (temp.z & msk) * (uint) NBANKS + offset; temp2.w = (temp.w & msk) * (uint) NBANKS + offset;

(void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w );

temp.x = temp.x >> shft; temp.y = temp.y >> shft; temp.z = temp.z >> shft; temp.w = temp.w >> shft;

10 % faster!



Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Documents

Transcript of ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.