OPENCL PROGRAMMING AND OPTIMIZATION PART Ifree.eol.cn/edu_net/edudown/AMDppt/OpenCL Programming...

OPENCL PROGRAMMING AND OPTIMIZATION – PART I

HAIBO XIE, [email protected]

mailto:[email protected]

| OPENCL PROGRAMMING AND OPTIMIZATION | OCTOBER 15, 2013 | PUBLIC2

GENERAL THINKING OF OPENCL COMPUTING ON GPUS

OpenCL Kernel performance is the key to the overall application performance

‒ Generally, OpenCL on GPU is offload model, we should find the hotspots of the application, get it ported onto GPU

‒ OpenCL Kernel performance is the key to the overall application performance

‒ OpenCL Kernel performance heavily depends on the well understanding of GPU architecture

The following will try to give details of GPU architecture knowledge

‒ How this impact the OpenCL Kernel performance

‒ How to optimize the OpenCL Kernel to a given GPU architecture


AGENDA

GPU architecture

GPU threads and scheduling

GPU memory hierarchy

GPU instruction throughput

GPU optimization – general techniques


PRIOR TO GCN

Each Thread processing unit consists of

‒ 4 standard ALU cores, plus

‒ 1 special function ALU core (VLIW5)

‒ Branch unit

‒ Registers

Flexible and optimized for Graphics workloads

‒ Ideal for 4-element Vector and 4x4 Matrix Operations, Vector/Vector math in a single instruction

‒ Plus One Transcendental-Unit function per Instruction

The GPU compiler tries to find parallelism to utilize all cores

Programmer can write code that fits well with the 4 standard ALU cores (say, using 128-bit data types such as a float4)

VLIW5 OR VLIW4


GPU EVOLUTION

VLIW vs. GCN

LANE 0 LANE 1 LANE 2 LANE 15

SIMD

64 Single Precision multiply-add

1 VLIW inst × 4 ALU ops dependency limited

Compiler manages register port conflicts

Specialized, complex compiler scheduling

Difficult assembly creation, analysis, and debug

Complicated tool chain support

Careful optimization req. for peak performance

VLIW4 SIMD

LANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

64 Single Precision multiply-add

4 SIMDs × 1 ALU op occupancy limited

No register port conflicts

Standard compiler scheduling & optimizations

Simplified assembly creation, analysis, & debug

Simplified tool chain development and support

Stable and predictable performance

GCN Quad SIMD


GPU EVOLUTION

VLIW vs. GCN

LANE 0 LANE 1 LANE 2 LANE 15

SIMD

Dependency Limited

Instruction level paralellism

Need to fill VLIW with four (or five)

independent ops that can be run in parallel

from the same program, each cycle!

VLIW4 SIMD

LANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Occupancy Limited

Data level parallelism

Need to be able to run the same single

instruction on 64 items of data

Thread level parallelism

4x as many wavefronts to occupy all SIMDs

GCN Quad SIMD


AMD GPU ARCHITECTURE

Graphics Core Next Architecture

AMD RADEON™ HD 7900 SERIES – CODENAME “TAHITI”




Up to 32 Compute Units






Dual Geometry Engines

8 Render Back-ends‒ 32 color ROPs per clock

‒ 128 Z/stencil ROPs per clock

Up to 768KB read/write L2 cache










Fast 384-bit GDDR5 memory interface‒ Up to 264 GB/sec

PCI Express 3.0 x16 bus interface










Fast 384-bit GDDR5 memory interface‒ Up to 264 GB/sec

PCI Express 3.0 x16 bus interface

4.3 billion 28nm transistors

3.79 Peak Single-Precision TFLOPS




High performance double precision floating point processing‒ Up to 947 DP GFLOPS

‒ Higher utilization = more usable FLOPS

‒ IEEE compliant

More efficient flow control & branching

Full ECC protection for DRAM & SRAM

First GPU to fully support OpenCL 1.2, Direct3D + Compute 11.1, and C++ AMP

New compute instructions




Dual Asynchronous Compute Engines (ACE)‒ Operate in parallel with graphics command processor

‒ Independent scheduling and work item dispatch for efficient multi-tasking

‒ Fast context switching

‒ Exposed in OpenCL™

Dual DMA engines‒ Can saturate PCIe 3.0 x16 bus bandwidth

(16 GB/sec bidirectional)



GCN ARCHITECTURE – ACE INTIMATE DETAILS

ACEs are responsible for compute shaderscheduling & resource allocation‒ Each ACE fetches commands from cache or

memory & forms task queues

‒ Tasks have a priority level for scheduling‒ Background -> Realtime

‒ ACE dispatch tasks to shader arrays as resources permit

‒ Tasks complete out-of-order, tracked by ACE for correctness

Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs


GCN ARCHITECTURE – ACE INTIMATE DETAILS

ACE are independent

‒ But, can synchronize and communicate via cache/mem/GDS

Can form task graphs‒ Individual tasks can have dependencies on one

another‒ Can depend on another ACE

‒ Can depend on part of the graphics pipeline

Can control task switching‒ Stop and start tasks and dispatch work to shader

engines


GCN ARCHITECTURE – ENABLING COMPUTE WORKLOADS

The focus in GPU hardware is shifting away from graphics-specific units, towards general-purpose compute units

7900 Series GCN-based ASICS already have “3:1” ratio of ACE : Graphics CP‒ Graphics CP can dispatch compute

‒ ACE cannot dispatch graphics

If you aren’t writing Compute Shaders, you’re probably not getting the absolute most out of modern GPUs‒ Control of LDS, barriers, thread layout, etc.

Future Trends:

More Compute Units

ALU outpaces BW

CPU + GPU Flat

MemAPU + dGPU


AMD GPU CU EVOLUTION

Higher utilization = higher performance per sq.mm

UTILIZATION AND EFFICIENCY

0x 1x 2x 3x 4x 5x

Mandelbrot DP

AES256

SHA256

LuxMark

SmallptGPU AMD Radeon HD 6970

AMD Radeon HD 7970

Utilization improvementGFLOPS increase (1.4x)


AMD CU ARCHITECTUREGRAPHICS CORE NEXT ARCHITECTURE

Input Data: PC/State/Vector Register/Scalar Register

SIMDPC & IB

Instr

uction F

etc

h A

rbitra

tion

4 CU Shared 32KB Instruction L1

R/W L2

Instru

ctio

n A

rbitra

tion

4 CU Shared 16KB Scalar Read Only L1RqstArb

Msg Bus

Scalar Decode

Integer ALU

8 KB Registers

Scalar Unit

Vector Decode

Vector Memory Decode

R/W L2

Export/GDS Decode

Export

Bus

MP Vector

ALU

64 KB

Registers

SIMD3

64 KB LDS MemoryLDSDecode

MP Vector

ALU

64 KB

Registers

SIMD0

SIMDPC & IB

SIMDPC & IB

MP Vector

ALU

64 KB

Registers

SIMD2

SIMDPC & IB

MP Vector

ALU

64 KB

Registers

SIMD1

Branch & MSG Unit

R/WdataL1

16KB


AMD CU ARCHITECTUREGRAPHICS CORE NEXT ARCHITECTURE


GCN COMPUTE UNIT

Basic GPU building block

‒ New instruction set architecture

‒ Non-VLIW

‒ Vector unit + scalar co-processor

‒ Distributed programmable scheduler

‒ Each compute unit can executeinstructions from multiple kernels at once

‒ Increased instructions per clock per mm2

Designed for high utilization,high throughput, and multi-tasking

Branch & Message Unit Scalar Unit

Vector Units(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

Scheduler

Texture Fetch Load / Store Units (16)

Scalar Registers(8KB)


GCN COMPUTE UNIT – SPECIFICS

1 Fully Programmable Scalar ALU – Shared by all threads of a “wavefront”

‒ Used for flow control, pointer arithmetic, etc.

‒ Has own GPRs, scalar data cache, etc.

1 Branch & Message Unit

‒ Executes branch instructions

(as dispatched by Scalar unit)

4 [16-lane] Vector ALU (SIMD)

‒ CU Total Throughput: 64 SP ops/clock

‒ 1 SP (Single-Precision) op per 4 clocks

‒ 1 DP (Double-Precision) ADD in 8 clocks

‒ 1 DP MUL/FMA/Transcendental per 16 clocks






L1 Cache(16KB)

Scheduler




GCN COMPUTE UNIT – SPECIFICS

64kb Local Data Share(LDS)

‒ 2x Larger than D3D11 TGSM Limit (32k/thread group)

‒ 32 banks, with conflict resolution

‒ Bandwidth amplification

‒ Separate Instruction Decode

16kb read/write L1 vector data cache

Texture Units (Utilize L1)

‒ 4 Filter, 16 Load/Store

Scheduler (2560 Threads)

‒ Separate decode/issue for VALU, SALU/SMEM, VMEM, LDS, GDS/Export

‒ + Special instructions (NOPs, barriers, etc.) and branch instructions






L1 Cache(16KB)

Scheduler




GCN COMPUTE UNIT – SIMD SPECIFICS

Each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10 wavefronts

‒ The whole CU can have 40 wavefronts in flight

‒ Each potentially from a different work-group or kernel

Each SIMD is a 16-lane ALU

‒ IEEE-754 SP and DP

‒ Full-speed denormals + All Rounding Modes

‒ 32-bit FMA and 24-bit INT at full-speed

‒ DP and 32-bit INT at reduced rate (1/2 -> 1/16)

‒ 64kb vector register file

‒ Issue 1 SP instruction per lane per clock

‒ Retire 64 lanes (1 wavefront) of SP ALU in 4 clocks

A GCN GPU with 32 CUs, such as the AMD Radeon™ HD 7970, can be working on up to 81,920 work items at a time!






L1 Cache(16KB)

Scheduler




GCN COMPUTE UNIT – SCHEDULER SPECIFICS

On GCN, each CU has its own dedicated Scheduler unit

‒ Supports up to 2560 threads per CU

‒ Schedules this work between the 4 SIMDs in groups called “wavefronts”

‒ Each wavefront is a grouping of 64 “threads” which live together on a single SIMD

One wavefront is executed on each SIMD every four cycles

‒ Total CU throughput: 4 wavefronts / 4 cycles

‒ That’s 256 threads executed every 4 cycles!

‒ Separate protected virtual address spaces

‒ Programmed in a purely scalar way






L1 Cache(16KB)

Scheduler




GCN COMPUTE UNIT – SCHEDULER SPECIFICS CONT.

Work should be grouped to support collaborative tasks

‒ All threads within a workgroup are guaranteed to be scheduled at the same time

‒ A set of synchronization primitives and shared memory (LDS) allows data to be passed between threads in a workgroup

‒ 16 Work Group Barriers supported per CU

‒ Global and Shared memory atomics

Optimized for throughput – latency is hidden by overlapping execution of wavefronts

Scheduler Limits:

‒ 40 wavefronts (theoretical max) per CU or 10 wavefronts per SIMD

‒ These ideal limits may not be attained in practice

‒ Limited by number of available GPRs

‒ Limited by size of available LDS






L1 Cache(16KB)

Scheduler




GCN SCHEDULER ARBITRATION AND DECODE

A CU is guaranteed to issue instructions for a wavefront sequentially

‒ Predication & control flow enables any single work-item a unique execution path

For a given CU, every clock cycle, waves on one SIMD are considered for instruction issue

‒ Round robin scheduling algorithm

At most, one instruction from each category may be issued

At most, one instruction per wave may be issued

Up to a maximum of 5 instructions can issue per cycle, not including “internal” instructions

‒ 1 Vector Arithmetic Logic Unit (ALU)

‒ 1 Scalar ALU or Scalar Memory Read

‒ 1 Vector memory access (Read/Write/Atomic)

‒ 1 Branch/Message - s_branch and s_cbranch_<cond>

‒ 1 Local Data Share (LDS)

‒ 1 Export or Global Data Share (GDS)

‒ 1 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional unit]


GCN BRANCH AND MESSAGE UNIT

Independent scalar assist unit to handle special classes of instructions concurrently

‒ Branch

‒ Unconditional Branch (s_branch)

‒ Conditional Branch (s_cbranch_<cond> )

‒ Condition -> SCC==0, SCC=1, EXEC==0, EXEC!=0 , VCC==0, VCC!=0

‒ 16-bit signed immediate dword offset from PC provided

‒ Messages

‒ s_sendmsg -> CPU interrupt with optional halt (with shader supplied code and source),

‒ debug message (perf trace data, halt, etc)

‒ special graphics synchronization messages


GCN VECTOR ALU CHARACTERISTICS

FMA (Fused Multiply Add)

‒ IEEE 754-2008 precise with all round modes, proper handling of Nan/Inf/Zero and full de-normal support in hardware for SP and DP

MULADD

‒ Single cycle issue instruction without truncation, enabling a MULieee followed by ADD ieee to be combined with round and normalization after both multiplication and subsequent addition

VCMP

‒ A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates

IEEE Rounding Modes

‒ (Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately.

De-normal Programmable Mode

‒ Control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero.


GCN VECTOR ALU CHARACTERISTICS (CONT . . .)

DIVIDE ASSIST OPS

‒ IEEE 0.5 ULP Division accomplished with macro in (SP/DP ~15/41 Instruction Slots respectively)

FP Conversion Ops

‒ Between 16-bit, 32-bit, and 64-bit floats with full IEEE 754 precision and rounding

Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation

64-bit Transcendental Approximation

‒ Hardware based double precision approximation for reciprocal, reciprocal square root and square root

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates

‒ Heavy use for Integer thread group address calculation

‒ 32-bit Integer MUL/MULADD @ DP FP Mul/FMA rate


L1L1 L1 L1 L1 L1 L1 L1 L1

CACHE HIERARCHY

L264-bit Dual Channel Memory Controller

L1 read/write caches

L2 read/write cache partitions

64 Bytes per clockL1 bandwidth per CU

Each CU has its own registers and local data share

I$ K$

32 KB instruction cache (I$) +16 KB scalar data cache (K$)

shared per 4 CUswith L2 backing

I$ K$

GDS

64 Bytes per clockL2 bandwidth per partition

Global data share facilitates

synchronization between CUs

(64 KB)L264-bit Dual Channel Memory Controller

L264-bit Dual Channel Memory Controller


GCN LOCAL DATA SHARE (LDS)

64 kb, 32 bank (or 16 bank) Shared Memory, fully decoupled from ALU instructions

Direct mode

‒ Vector Instruction Operand 32/16/8 bit broadcast value

‒ Graphics Interpolation @ rate, no bank conflicts

Index Mode – Load/Store/Atomic Operations

‒ Bandwidth Amplification, up-to 32 – 32 bit lanes serviced per clock peak

‒ Direct decoupled return to VGPRs

‒ Hardware conflict detection with auto scheduling

Software consistency/coherency for thread groups via hardware barrier

Fast & low power vector load return from R/W L1


GCN LOCAL DATA SHARE (LDS)

An LDS bank is 512 entries, each 32-bits wide

‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that includes 32 atomic integer units

‒ This means that several threads can read the same LDS location at the same time for FREE

‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins

Typically, the LDS will coalesce 32 lanes from one SIMD each cycle

‒ One wavefront is serviced completely every 2 cycles

‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware

‒ An instruction which accesses different elements in the same bank takes additional cycles


GCN R/W CACHE

Reads and writes cached

‒ Bandwidth amplification

‒ Improved behavior on more memory access patterns

‒ Improved write to read reuse performance

GPU Coherent

‒ Acquire/Release semantics control data visibility across the machine

‒ L2 coherent = all CUs can have the same view of data

Global atomics

‒ Performed in L2 cache


GCN L1 R/W CACHE ARCHITECTURE

Each CU has its own L1 Cache

‒ 16 KB L1, 64B lines, 4 sets x 64 way

‒ ~64B/CLK per compute unit bandwidth

‒ Write-through – alloc on write (no read) w/dirty byte mask

‒ Write-through at end of wavefront

‒ Decompression on cache read out


GCN L2 R/W Cache Architecture

64-128KB L2 per Memory Controller Channel

‒ 64B lines, 16 way set associative

‒ ~64B/CLK per channel for L2/L1 bandwidth

‒ Write-back - alloc on write (no read) w/ dirty byte mask

‒ Acquire/Release semantics control data visibility across CUs

L2 coherent = all CUs can have the same view of data

‒ Remote Atomic Operations


GCN Latency & Bandwidth

Each CU has 64 bytes per cycle of L1 bandwidth

‒ Shared with the GDS

Per L2 there’s 64 bytes of data per cycle as well

Peak Scalar Data Cache Bandwidth per CU is 16 bytes/cycle

Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)

LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification

That’s nearly 4 TB/s of LDS BW, 2 TB/s of L1 BW, and 700 GB/s of L2 BW!

384-bit GDDR5 Main Memory has over 264 GB/sec bandwidth

PCI Express 3.0 x16 bus interface to system (32GBps)


GCN VIRTUAL MEMORY AND X86

The GCN cache hierarchy was designed to integrate with x86 microprocessors

The GCN virtual memory system can support 4KB pages

‒ Natural mapping granularity for the x86 address space

‒ Paves the way for a shared address space in the future

‒ IOMMU used for DMA transfers can already translate requests into x86 address space

GCN caches use 64B lines, which is the same size as x86 processors use

The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control!


IMPORTANT GCN ARCHITECTURE IMPROVEMENTS

Increased flexibility and efficiency, with reduced complexity!

‒ Non-VLIW Architecture improves efficiency while reducing programmer burden

‒ Constants/resources are just address + offset now in the hardware

‒ GPU has virtual memory, forward looking towards x86 CPU + GPU flat memory

Strong forward-looking focus on Compute

‒ Scalar ALU for complex dynamic control flow + branch & message unit

‒ 64k LDS/CU, 64k GDS, atomics at every stage, coherent cache hierarchy

‒ Multiple Asynchronous Compute Engines (ACE) for multitasking compute


MAIN GCN ARCHITECTURE TAKEAWAYS

GCN generally simplifies your life as a programmer

‒ Don’t: fret too much about instruction grouping, or vectorization

‒ Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts)

‒ Do: Think about thread/cache locality when you structure your algorithm

‒ Do: Pack shader inputs and outputs – aim to be as IO/bandwidth thin as possible!

Unlimited number of addressable constants/resources

‒ N constants aren’t free anymore – each consume resources, use sparingly!

Compute is the future – exploit its power for GPGPU work & graphics!


AGENDA

GPU architecture






MAPPING OF OPENCL ONTO GPU DEVICES

hardware perspective OpenCL programming perspective



Each GPU device consists of CU array

‒ Saying Tahiti, that’s 32 CUs

Each CU consists of SIMD array

‒ Saying GCN architecture, there’re 4 SIMD engine contains total 4 * 16 = 64 ALU to perform OpenCL instructions

Each SIMD array consists of ALU

‒ Generally it’s 16 lane wide with 16 ALU

‒ ALU support integer, float, etc data type, you can treat each ALU as a simple scalar CPU core

Work-item maps to an ALU

Work-group maps to a CU

Kernel maps to a GPU compute device



Work-item

‒ Each instance of a Kernel running on a ALU

Work-group

‒ GPU schedules the range of work-items onto a group of PE, until all work-items have been processed

NDRange

‒ OpenCL maps the total number of work-item to be launched onto a n-dimensional grid

‒ Developers can specify how to divide these work-items into work-groups


OPENCL PROGRAMMING FROM SCHEDULING PERSPECTIVE

The total number of work-items is N, has 3 dimension

‒ global_work_size = {Gx, Gy, Gz}

‒ N = Gx * Gy * Gz

Work-itmes are grouped into work-group

‒ As 3 dimension work-group {Sx, Sy, Sz}

‒ The number of work-items in a work-group is M = Sx * Sy * Sz

‒ Then the number of work-group is N/M


SIMT EXECUTION MODEL

SIMD execution can be combined with pipelining

‒ ALUs all execute the same instruction in a SIMD engine

‒ Pipelining is used to break instruction into phases

‒ When first instruction completes (4 cycles here), the next instruction is ready to execute

SIMT EXECUTION MODEL

1 2 3 4 5 6 7 8 9

AddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAdd

AddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAdd

MulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMul

MulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMul

…

Wavefront(64 work-items)

…Cycle

SIMD Width (16)

47 | OPENCL PROGRAMMING AND OPTIMIZATION | OCTOBER 15, 2013 | PUBLIC

LANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15Scalar Unit

GCN COMPUTE UNIT – HARDWARE VIEW

A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clks

Each lane can dispatch 1 SP ALU operation per clock

Each SP ALU operation takes 4 clocks to complete

The scheduler dispatches from a different wavefront each cycle

GCN Hardware View

48 | OPENCL PROGRAMMING AND OPTIMIZATION | OCTOBER 15, 2013 | PUBLIC

LANE LANE LANE LANE

WAVEFRONT 0

0 1 2 15 16 17 18 31 32 33 34 47 48 49 50 63Scalar Unit

GCN COMPUTE UNIT – PROGRAMMER VIEW

A GCN Compute Unit can perform 64 SP Vector ALU ops / clock

Each lane can dispatch 1 SP ALU operation per clock

Each SP ALU operation still takes 4 clocks to complete

But you can PRETEND your code runs 1 op on 64-threads at once

GCN Programmer View

WAVEFRONT 2

WAVEFRONT 3

WAVEFRONT 0

WAVEFRONT 1

WAVEFRONT 1

WAVEFRONT 4

WAVEFRONT 6

WAVEFRONT 7

WAVEFRONT 8

WAVEFRONT 9

WAVEFRONT 5


WAVEFRONT SCHEDULING

Work-items are always scheduled as wavefront granularity

‒ At each cycle, a 16 lane wide SIMD engine executes a single instruction

‒ Each instruction costs 4 cycles to finish, that’s 64 work-items working together for a single instruction

‒ Wave-front is a hardware concept while work-item is a OpenCL software concept

Work-group has the integer numbers of wave-fronts

‒ For optimum hardware usage, an integer multiple of 64 work-items is recommended

‒ However, the exact number is limited to other factor, like LDS and register

Several work-groups are always scheduled within a CU, not always on a SIMD

‒ Wave-fronts may execute on difference SIMD engine within a CU

Workitems are scheduled with linear order

‒ On a device with a wavefront size of 64, work-items 0-63 map to wavefront 0, work items 64-127 map to wavefront1, etc. along the X dimension first


WAVEFRONT SCHEDULING

Wavefront is a lock step execution

‒ Implicit barrier among all work-items within a single wavefront

Wavefront is scheduled onto different SIMD engine (PE) within a CU

‒ Explicit barrier needed to sync within a workgroup

The same workgroup is scheduled onto the same CU

‒ A fast barrier implementation is possible within a workgroup

Different workgroup are scheduled onto different CU

‒ No mechanism for synchronization between work-groups

‒ Global barrier is guaranteed only with Kernel level

Each work-item has a unique ID in the threading model

‒ To explore the mapping of work-items and hardware, like fetch data

‒ Will explain later


OPENCL PERFORMANCE FROM SCHEDULING PERSPECTIVE

In the case of Read-After-Write (RAW) hazard, one wavefront will stall for four extra cycles

‒ If another wavefront is available it can be scheduled to hide latency

‒ After eight total cycles have elapsed, the ALU result from the first wavefront is ready, so the first wavefront can continue execution

Two wavefronts (128 threads) completely hide a RAW latency

‒ The first wavefront executes for four cycles

‒ Another wavefront is scheduled for the next four cycles

‒ The first wavefront can then run again

Note that two wavefronts are needed just to hide RAW latency, the latency to global memory is much greater

‒ During this time, the compute unit can process other independent wavefronts, if they are available

‒ Memory access latency is hidden by overlapping execution of wavefronts

‒ Enough wave-font numbers is the key factor to keep GPU busy

GPU IS OPTIMIZED FOR THROUGHPUT



Work-item creation

‒ GPU spawns the required number of wavefronts on a single compute unit. if there are non-active work-items within a wavefront, the stream cores that would have been mapped to those work-items are idle

Wavefront diverging and branch granularity

‒ Flow control, such as branching, is achieved by combining all necessary paths as a wavefront, if work-items within a wavefront diverge, all paths are executed serially

‒ The total time to execute the branch is the sum of each path time

‒ The number of work-items that must be executed during a branch is called the branch granularity. On AMD hardware, the branch granularity is the same as the wavefront granularity.

Threads have their own register state and are free to execute different control paths in a wavefront

‒ If work items within a wavefront go on divergent paths of flow control, the invalid paths of a work-items are masked by hardware

‒ Instructions with a true predicate are committed

‒ Instructions with a false predicate do not write results or read operands

‒ Branching should be limited to a wavefront granularity to prevent issuing of wasted instructions

DIVERGENT CONTROL FLOW



Predication is a method for mitigating the costs associated with conditional branches

‒ Predicate is a condition code that is set to true or false based on a conditional

‒ Beneficial in case of branches to short sections of code

‒ Based on fact that executing an instruction and squashing its result may be as efficient as executing a conditional

‒ Compilers may replace “switch” or “if then else” statements by using branch predication

PREDICATION AND CONTROL FLOW

int tid = get_local_id(0)if ( tid % 2 == 0) //Even Work Items

DoSomeWork()else

DoSomeWork2()

int tid = get_local_id(0)if ( tid / 64 == 0) //Full First Wavefront

DoSomeWork()else if (tid /64 == 1) //Full Second Wavefront

DoSomeWork2()

Case 1 Case 2

Conditional – With divergence Conditional – No divergence

Case 1: All odd threads will execute if conditional while all even threads execute the else conditional. The if and else block need to be issued for each wavefront

Case 2: All threads of the first wavefront will execute the if case while other wavefronts will execute the else case. In this case only one out of if or else is issued for each wavefront


SUMMARY FOR GPU SCHEDULING

GPU schedule work-items as wavefront unit

‒ The size of wavefront is 64 work-items

Maximize the number of wavefront to keep GPU busy with computing

‒ Hide memory access

‒ The total number of wavefront is determined once the total work-item number is determined

However, the exact number of work-group size should consider the following

‒ The size of the problem

‒ The LDS and register usage, explained the next section

‒ The overhead of communication

‒ The overhead of work-group creation

Some tricks to avoid divergence within the wavefront

Next section will move to memory hierarchy


AGENDA

GPU architecture






Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Workgroup Workgroup Workgroup

LDS LDS LDS

Compute Unit

R/W L1 Cache (new)

R/O Constant Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GPU MEMORY HIERARCHY


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


GPR’sLocal to threadUsed for ultra low latency variable storageUp to 64 vector and 64 scalar GPR’s available


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


Local Data Store (LDS)Local to WorkgroupUsed for shared local coherent data storage within a WorkgroupUp to 32kb available (64kb total per CU)All standard atomic operations are supported.


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


R/W L1 CacheLocal to Compute UnitUsed for cached global data read, and local coherent data read/write within a Workgroup.L1 can be bypassed for coherent global access.All standard atomic operations are supported.


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


R/O Constant CacheLocal to Compute UnitUsed to cache instructions, data constants, and texture resource and sampler structures.


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


Global Data Store (GDS)Provides low latency coherent access to a global set of on-chip memory.Typically used for inter-workgroup synchronization and data transfer.All the standard atomic operations are supported.


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


L2 CacheProvides a single centralized caching structure for global data.All memory i/o from the shader goes through this cache.All the standard atomic operations are supported (where they cannot be done at a higher level).


Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s

Thread

Thread

GPR’s

GPR’s


LDS LDS LDS

Compute Unit

R/W L1 Cache (new)


Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

Compute Unit

R/W L1 Cache

(new)R/O Constant

Cache (new)

L2 Cache

GDS

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5

GD

DR

5


Main Graphics Memory


MAPPING OF OPENCL ONTO GPU DEVICESHARDWARE PERFORMANCE PARAMETERS



OpenCL has four memory domain

Global memory

‒ Map to GDDR5 video main memory

‒ Slowest, has latency of 300-600 cycles, Read/Write support

‒ Shared by all work-items in the Kernel

‒ No allocation

Local memory

‒ Fast to access and high bandwidth

‒ 3.8 TB/s vs. L1 1.9 TB/s on HD 7970

‒ Read/Write support, use __local keyword

‒ Shared by the work-items in the work-group

‒ Static allocation

Private memory

‒ Basically register, may be global memory when register spilling

‒ Owned by the single work-item, static allocation

Constant memory

‒ Maps to scalar unit reads, read only

‒ static allocation


GLOBAL MEMORY READ ON AMD GPU

CU access global memory via memory channel

There’s memory channel conflict once two memory access requests are directed to the same controller

‒ The memory instructions will be serially executed

‒ Usually a large power of two stride results in a channel/bank conflict

Global memory read should be carefully design

For GCN, global memory write is handled by WC buffer, no coalesced write needed as other venders

‒ However, continuous addresses within work-groups provide maximum performance


WHY OPTIMIZE GLOBAL MEMORY ACCESS

Global memory is the key to optimize OpenCL Kernel performance in any way

‒ Large global memory access latency must be hidden

Example:

‒ Saying global memory access latency is 400 cycles.

‒ CU has several waves and each has 5 math instructions, then it costs 5 * 4 = 20 cycles. To fully hide the 400 cycles of latency, we need 400 / 20 = 20 wavesfronts.

‒ If the wavefront contains 10 instructions, then we need 10 wavefronts to hide the global memory access latency


OPENCL PERFORMANCE FROM MEMORY PERSPECTIVE

Different GPU has different bit wide of channel and bank with memory address

‒ All AMD HD 7xxx GPU has the same layout, channel ends at bit 8

‒ That means channel switches every 256 bytes

Best way to maximize the global memory read is

‒ Same channel within a wavefront, many wavefronts span all memory channels

‒ Like the address of work-item 0 aligned to 256 byte, each work-item fetches 32 bits

‒ For example, read float4 instead of float

An inefficient access pattern is if each wavefront accesses all the channels

‒ This is likely to happen if consecutive work-items access data that has a large power of two strides

GLOBAL MEMORY READ



Reading from the same address is a conflict

‒ From a hardware standpoint, reads from a fixed address have the same upper bits, so they collide and are serialized

‒ This does not happen on the read-only memories, such as constant buffers, textures, etc.

‒ Does happen on the OpenCL global memory

Best way:

‒ To read in a single value, read the value in a single work-item, place it in local memory, and then use that location

‒ Avoid:

temp = input[3] // if input is from global memory

‒ Use:

if (get_local_id(0) == 0)

local = input[3];

barrier(CLK_LOCAL_MEM_FENCE);

temp = local;

GLOBAL MEMORY READ



Benefits of LDS

‒ LDS provides high-bandwidth access (more than 10X higher than global memory), efficient data transfers between work-items in a work-group, and high-performance atomic support

‒ LDS is much faster than L1 cache access as it has twice the peak bandwidth and far lower latency

‒ Using LDS memory can reduce global memory bandwidth usage (locality)

There’s 64KB of LDS of each CU, 32KB useable for each work-group

‒ LDS has 32 banks

‒ Each bank is 4 bytes wide and 256 bytes deep

‒ Bank address is determined by bits 6:2 in the memory address

LDS is access with half wavefront

‒ Every cycle, LDS can service a request for each bank (up to 32 access each cycle)

LOCAL MEMORY



Bank conflict

‒ Same as global memory channel/bank conflict

‒ The program with large number of bank conflict might get benefits from constant or image rather than LDS

Reading same address of LDS

‒ Different global memory read, reading same address of LDS can be broadcast to all requestors without conflict

Conflict free pattern

‒ On HD 7xxx, each work-item read sequential float2 data

‒ Sequential float4 cause half banks usage

LOCAL MEMORY


OPENCL PERFORMANCE FROM MEMORY PERSPECTIVELOCAL MEMORY

0

1

2

3

4

5

6

7

Memory Bank

0

1

2

3

4

5

6

7

Thread

0

1

2

3

4

5

6

7

Memory Bank

0

1

2

3

4

5

6

7

Thread

0

1

2

3

4

5

6

7

Memory Bank

0

1

2

3

4

5

6

7

Thread

2

1

3

1

1

Conflicts0

1

2

3

4

5

6

7

Memory Bank

0

1

2

3

4

5

6

7

Thread



Constant memory is the container of data

‒ Read only shared by a wavefront

Constant is useful as

‒ Function parameters

‒ Accelerate data fetch without requiring big memory bandwidth

GCN supports specific inline literal constants

‒ Some constants are “free” and don’t increase code size

CONSTANT MEMORY



Three patterns of constant memory usage

‒ Simple direct-address patterns

‒ Very high bandwidth can be attained when compiler has available the constant address at compile time and can embed the constant address into the instruction

‒ Usually use for non-array constants and function parameters

‒ Same index

‒ Just like same address with LDS

‒ Broadcast to all wavefronts

‒ Varying index

‒ Like global memory read with cache hit

Global scoped constant array

‒ Use hardware constant buffer with 64KB and blow, typical case is lookup table

‒ Or global memory when larger than 64KB

CONSTANT MEMORY



LDS

‒ Supports R/W and atomics

‒ Typical larger than L1, suited for the case where can’t obtain high L1 cache hit

‒ Native data is 32 bit word. Peak throughput achieved when each thread operates on a 2 vector of 32-bit words, works well with coalesced 32-bit requiring.

‒ LDS can be used to explicitly convert a scattered access pattern to a coalesced pattern for read and write to global memory; Or improve the performance when work-item need to read the same global memory address

‒ Can’t shared among different work-groups

‒ Need balance the work-group size. Larger size means efficient data sharing but cause larger register usage

‒ Static allocated which limits active wavefronts

L1

‒ Suited for cache-read-sensitive alog. Like Matrix Multiplication or convolution

‒ Native data type is 32 bit words or four-vector. It’s important to initially filled from global memory with a coalesced access pattern

‒ Possibility of sharing among different work-groups (cache hit)

USE LDS OR L1 CACHE



Registers are scalar in GCN with 32 bits

‒ 256 VGPRs per wavefront

Active wavefronts

‒ To compute the number of wavefronts per CU, take (256/#register)*4

‒ Registers are statically allocated

‒ Larger wavefront means more workitems working together, need more registers and limits the number of active wavefronts

Register spilling

‒ Too many register usage will cause register spilling, data is located into global memory

REGISTER


OPENCL PERFORMANCE

The number of wavefronts is the key points of overall performance

Carefully choose the size of work-group

‒ Larger size means more efficient data sharing via LDS or synchronization within work-group

‒ But might cause larder LDS usage and registers usage which limits active wavefronts

Choose work-group size at compile-time

‒ Developers can set NULL with work-group size

‒ But a specific work-group size is recommended

FROM RESOURCES PERSPECTIVE


OPENCL IMAGE

C/C++ arrays and OpenCL buffers (cl_mem) objects provide 1D locality

Due to graphics workloads, GPUs contain hardware support for:

‒ Caching and reading multidimensional data (textures)

‒ Drawing interpolated texture vertices

Hardware support exposed to programmers via OpenCL images

‒ OpenCL images are memory object optimized for 2D locality

Adjacent elements not guaranteed to be contiguous in memory

‒ Z curve layout of textures in memory provides 2D locality of data

Z Curve - 2D locality in layout

C/C++ 1D locality (row major)layout


IMAGE ABSTRACTION

To allow for hardware abstraction in the physical memory layout, images elements cannot be accessed directly from within the kernel

In OpenCL kernels, the read_image{type} function call is used instead of simply indexing using ‘[ ]’

Usage of read_image and write_image discussed later


BENEFITS OF IMAGES IN OPENCL

Interpolation

‒ Images are accessed using floating point coordinates

‒ Either closest pixel can be returned or a linear interpolation

‒ Specified in cl_sampler object

‒ CL_FILTER_NEAREST (no interpolation)

‒ CL_FILTER_LINEAR (linear interpolation)

Out-of-bounds handling

‒ Behavior of out-of-bounds accesses are handled in hardware

‒ Flags specified when creating cl_sampler

‒ Examples

‒ CLK_ADDRESS_CLAMP – return 0

‒ CLK_ADDRESS_CLAMP_TO_EDGE – return color of pixel closest to out-of-bounds location


BENEFITS OF IMAGES IN OPENCL

Normalized data types

‒ Reduces amount of memory used since these data types store floats as a 16- or 8-bit integer in the texture

‒ Use floats in a normalized range [0.0-1.0] (unsigned types), [-1.0-1.0] (signed types)

Channels in OpenCL images refer to the primary colors that make up an image

‒ Each pixel in a texture can contain 1 to 4 channels (R to RGBA)

‒ RGBA: red, green, blue, alpha

‒ The color information is stored as float / integer data

Packing several values (channels) in a pixel this can improve memory bandwidth utilization.

The number of channels is defined at the creation of the image


AGENDA

GPU architecture






INSTRUCTION THROUGHPUT ON AMD GPU

Instruction throughput differs on different GPU devices

Natively supported data type

‒ 32/64 bit float, 24/32 bit integer

‒ Packed 8/16 bit is not natively supported, but can be packed with 4/2 units to perform native 32-bit operation, as long as no overflow occur. Examples are char4

On GCN

‒ Double precision float is supported, usually ¼ speed of single precision float

‒ 32-bit integer has same latency and throughput with single precision float

‒ 24-bit integer MULs and MADs have 4 times throughput of 32-bit integer


AMD GPU MEDIA INSTRUCTION EXTENSION

AMD GPU has build-in instruction with full hardware acceleration

‒ Like amd_max3(intN src0, intN src1, intN src2) will return max of src0, src1, src2

Build-in functions include:

‒ amd_pack

‒ amd_sad

‒ amd_sadhi

‒ amd_msad

‒ amd_qsad

‒ amd_median3

‒ amd_min3

‒ amd_max2

‒ ……


AGENDA

GPU architecture






GENERAL OPTIMIZATION TIPS

The section includes

‒ Thread mapping

‒ Occupancy

‒ Vectorization

‒ Loop unrolling

‒ Buffer or image


THREAD MAPPING

Thread mapping determines which threads will access which data

‒ Proper mappings can align with hardware and provide large performance benefits

‒ Improper mappings can be disastrous to performance

The paper Static Memory Access Pattern Analysis on a Massively Parallel GPU by Jang, et. al focuses on the task of effectively mapping threads to the data access patterns of an algorithm


THREAD MAPPING

By using different mappings, the same thread can be assigned to access different data elements

‒ The examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element)

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Thread IDs

Mappingint tid =

get_global_id(1) *

get_global_size(0) +

get_global_id(0);

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

int tid =

get_global_id(0) *

get_global_size(1) +

get_global_id(1);

0 1 4 5

2 3 6 7

8 9 12 13

10 11 14 15

int group_size =

get_local_size(0) *

get_local_size(1);

int tid =

get_group_id(1) *

get_num_groups(0) *

group_size +

get_group_id(0) *

group_size +

get_local_id(1) *

get_local_size(0) +

get_local_id(0)

*assuming 2x2 groups


THREAD MAPPING

Consider a serial matrix multiplication algorithm

This algorithm is suited for output data decomposition

‒ We will create NM threads

‒ Effectively removing the outer two loops

‒ Each thread will perform P calculations

‒ The inner loop will remain as part of the kernel

Should the index space be MxN or NxM?


THREAD MAPPING

Thread mapping 1: with an MxN index space, the kernel would be:

Thread mapping 2: with an NxM index space, the kernel would be:

Both mappings produce functionally equivalent versions of the program


THREAD MAPPING

This figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUs

Notice that mapping 2 is far superior in performance for both GPUs


THREAD MAPPING

The discrepancy in execution times between the mappings is due to data accesses on the global memory bus

‒ Assuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memory

‒ To ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matrices

‒ This will give coalesced accesses in Matrices B and C

‒ For Matrix A, the iterator i3 determines the access pattern for row-major data, so thread mapping does not affect it


THREAD MAPPING

In mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix B

‒ The mapping causes inefficient memory accesses


THREAD MAPPING

In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C

‒ Accesses to both of these matrices will be coalesced

‒ Degree of coalescence depends on the workgroup and data sizes


THREAD MAPPING

In general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functions

The following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses


MATRIX TRANSPOSE

A matrix transpose is a straightforward technique

‒ Out(x,y) = In(y,x)

No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accesses

‒ Note that data must be read to a temporary location (such as a register) before being written to a new location

In Out In Out

0 1 2 3

coalesced uncoalesced

0 1 2 3

uncoalesced coalesced

Threads


MATRIX TRANSPOSE

If local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directions

‒ Note that the work group must be square

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

In Out

coalesced

0 1 2 3

coalesced

0 1 2 3

0 1 2 3

Threads

global mem index

local mem index

0 1 2 3

0 1 2 3

0 4 8 12

Local memory


MATRIX TRANSPOSE

The following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU

‒ “Optimized” uses local memory and thread remapping

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

1024 2048 3072 4096

Tim

e (

s)

Matrix Order

Naive

Optimized

98


OCCUPANCY

On current GPUs, work groups get mapped to compute units

‒ When a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their execution

If there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time

‒ Wavefronts from another work group can be swapped in to hide latency

Resources are fixed per compute unit (number of registers, local memory size, maximum number of threads)

‒ Any one of these resource constraints may limit the number of work groups on a compute unit

The term occupancy is used to describe how well the resources of the compute unit are being utilized


OCCUPANCY – REGISTERS

The availability of registers is one of the major limiting factor for larger kernels

The maximum number of registers required by a kernel must be available for all threads of a workgroup

‒ Example: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread

‒ Each compute unit can execute at most 468 threads

‒ This affects the choice of workgroup size‒ A workgroup of 512 is not possible

‒ Only 1 workgroup of 256 threads is allowed at a time, even though 212 more threads could be running

‒ 3 workgroups of 128 threads are allowed, providing 384 threads to be scheduled, etc.


OCCUPANCY – REGISTERS

Consider another example:

‒ A GPU has 16384 registers per compute unit

‒ The work group size of a kernel is fixed at 256 threads

‒ The kernel currently requires 17 registers per thread

Given the information, each work group requires 4352 registers

‒ This allows for 3 active work groups if registers are the only limiting factor

If the code can be restructured to only use 16 registers, then 4 active work groups would be possible


OCCUPANCY – LOCAL MEMORY

GPUs have a limited amount of local memory on each compute unit

‒ 32KB of local memory on AMD GPUs

‒ 32-48KB of local memory on NVIDIA GPUs

Local memory limits the number of active work groups per compute unit

Depending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution)


OCCUPANCY – THREADS

GPUs have hardware limitations on the maximum number of threads per work group

‒ 256 threads per WG on AMD GPUs

‒ 512 threads per WG on NVIDIA GPUs

NVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model)

‒ 768 or 1024 threads per compute unit

‒ 8 or 16 warps per compute unit

AMD GPUs have GPU-wide limits on the number of wavefronts

‒ 496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit)


OCCUPANCY

The minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit

The interactions between the factors are complex

‒ The limiting factor may have either thread or wavefront granularity

‒ Changing work group size may affect register or shared memory usage

‒ Reducing any factor (such as register usage) slightly may have allow another work group to be active


OCCUPANCY CALCULATOR


VECTORIZATION

On AMD GPUs, each processing element executes a 5-way VLIW instruction

‒ 5 scalar operations or

‒ 4 scalar operations + 1 transcendental operation

Compute Unit

PE0 PE1 PEn-1...PE2

Registers

ALU

ALU + T-unit

IncomingInstruction

Branch Unit


VECTORIZATION

Vectorization allows a single thread to perform multiple operations at once

Explicit vectorization is achieved by using vector datatypes (such as float4) in the source program

‒ When a number is appended to a datatype, the datatype becomes an array of that length

‒ Operations can be performed on vector datatypes just like regular datatypes

‒ Each ALU will operate on different element of the float4 data


VECTORIZATION

Vectorization improves memory performance on AMD GPUs‒ The AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to float4 memory

bandwidth


OPENCL KERNEL OPTIMIZATION

Loop unrolling

‒ Can eliminate some instructions for branch processing

‒ Might find more independent instructions to perform SIMT

‒ Might increase the register usage

Especially useful for prior GCN AMD GPU devices

‒ This can be attributed to better packing of instructions within the loop

Very simple optimization tips and easy to try

‒ Loop bounds must be known at compile time

LOOP UNROLLING

NO unrolling example

#pragma unroll 1for (int i = 0; i < n ; i ++) {…}

Partial unrolling example

#pragma unroll 4for (int i = 0; i < 128 ; i ++) {…}


OPENCL KERNEL OPTIMIZATION

Image memory object is total different with buffer memory object

‒ Buffer has linear memory layout

‒ Image has tile memory layout

‒ Image has good global memory read/write behavior than buffer in some cases.

Image object has hardware build-in functional unit

‒ To perform some common but specific operation

Image object might increase some OpenCL runtime overhead

‒ To convert linear memory to tile memory

BUFFER OR IMAGE


GLOBAL ARRAY

Avoid declaring global arrays on the Kernel’s stack frame as these typically can’t be allocated in registers and require global memory operations

Use predication rather than control-flow

‒ Replace with

if (a>b)

c+=d

else

c-=d

int factor = (a>b) ? 1:-1;

c+=factor*d


SUMMARY

GPU architecture

GPU thread scheduling



GPU Kernel optimization tips

THANKS!


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

OPENCL PROGRAMMING AND OPTIMIZATION PART Ifree.eol.cn/edu_net/edudown/AMDppt/OpenCL Programming...

Documents

Transcript of OPENCL PROGRAMMING AND OPTIMIZATION PART Ifree.eol.cn/edu_net/edudown/AMDppt/OpenCL Programming...