HC-4017, HSA Compilers Technology, by Debyendu Das

HSA COMPILER TECHNOLOGY DIBYENDU DAS, PRAKASH RAGHAVENDRA, LEONID LOBACHEV

| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 2

OUTLINE

H(eterogeneous) S(ystem) A(rchitecture) SW Stack

Architecture of HSA Compilers

Performance

HSA Compiler Deliverables

OpenCL™ 2.0 features

Conclusions and Future Direction


How we deliver the HSA value proposition?

Make GPU easily accessible

‒ Support mainstream languages

‒ Expandable to domain specific languages

Make compute offload efficient ‒ Eliminate memory copying

‒ Low-latency dispatch

Make it ubiquitous ‒ Drive standard through HSA Foundation

‒ Open Source key components

Optimized Compiler Technology ‒ Leverage llvm framework

‒ HSAIL as a new IR for heterogeneous computing

HSA SOFTWARE STACK

Application and System Languages, domain specific languages, etc.

e.g.

OpenCL™, Java ™, C++ AMP, Python, R, …

HSA Runtime(HSA RT)

LLVM IR

HSA Hardware

Applications

HSAIL


HSAIL

HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs

‒ Finalized to a native ISA by a finalizer/JIT

‒ Accommodate to rapid innovations in native GPU architectures

‒ HSAIL expected to be stable and backward compatible across implementations

‒ Enable multiple hardware vendors to support HSA

Key design points and benefits for HSA compilers ‒ Adopt a thin finalizer approach

‒ Enable fast translation time and robustness in the finalizer

‒ Drive performance optimizations through high-level compilers (HLC)

‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations

OpenCL™ Kernel

High-Level Compiler Flow (Developer)

Finalizer Flow (Runtime)

EDG or CLANG SPIR LLVM HSAIL

HSAIL Finalizer

Hardware ISA

EDG – Edison Design Group

CLANG – LLVM FE

SPIR – Standard Portable Intermediate Representation

Architecture of HSA Compilers


TECHNOLOGY BASE FOR COMPILER COMPONENTS

H(igh) L(evel) C(ompiler) front-end

‒ C++ FE from Edison Design Group (EDG) under a proprietary license

‒ May support CLANG FE in the future

‒ Generates llvm-ir

HLC back-end

‒ LLVM optimizer and code-gen

‒ Generates HSAIL from llvm-ir

Finalizer

‒ Converts HSAIL to GPU ISA

‒ SSA-based optimizer

HSAIL assembler/disassembler (libHSAIL)

‒ Assembling, disassembling, validating HSAIL and BRIG (binary format of HSAIL)

Libraries

‒ Optimized implementation of OpenCL™ builtins


OPENCL™ COMPILER ARCHITECTURE

OpenCL™ compiler is expected to continue evolving based on new specs from Khronos.

HSA OpenCL™ compiler leverages the existing and evolving compiler architecture of llvm.

Minimize architectural changes.

Shifting aggressive optimizations toward HLC

Thin Finalizer

x86 Executable with OpenCL™ API Calls

C/C++ Front End

Compiler Optimizations

X86 code generation

Host Linker

GPU ISA

OpenCL™ Host Compiler EDG for

OpenCL™ Kernels

LLVM Optimizer

LLVM HSAIL code generation

Finalizer

OpenCL™Device Compiler


DEVICE COMPILER

Based on LLVM optimizer

Custom HSAIL back-end

Parallel -aware compiler optimizations

SIMT-friendly code generation

GPU specific optimizations

DWARF generation

Direct binary object generation

LLVM optimizer

Device code in LLVM-IR

LLVM HSAIL code generator

Optimized device code in LLVM-IR

BRIGContainer

Device code in binary BRIG form

BRIGStreamer

ELF with BRIG sections.

libHSAIL

LLVM IR

BRIG Binary Object


libHSAIL – assembler/disassembler/validator for HSA

HIDEL - High Level HSAIL Description Language

Automatically generated code to:

‒ Access BRIG fields in safe and effective way

‒ Validate BRIG and HSAIL conformance to spec

‒ Encapsulate BRIG version differences

Brigantine API to ease creation of BRIG on the fly

HSAIL<->BRIG assembler and disassembler

HSAIL->BRIG debug information generator

BRIG streaming routines

HSAIL test generation framework

HSAIL instruction level simulation

ASCII HSAIL

Validator

BrigStreamer BifStreamer

Disassembler

BrigContainer

Proxy classes

Brigantine

BRIG, BIF files

Scanner

Parser

Finalizer

Device linker

Loader

libHSAIL clients

HLC (LLVM)

Direct binary

object generation

HSAILAsm

libBrigDwarf

Test Generation


Fast optimizations for translation efficiency

Expected HLC to perform heavyweight optimizations

Supports Unstructured control flow

Dynamic calling convention

Optimized ISA Libraries

Indirect branches

Exception handling

Offline mode available for caching ISA translation

Debugging support: mapping between BRIG and GPU ISA

FINALIZER HSAIL

HSAIL-to-IR

IR

SSA

Optimizations on IR

Scheduler

Allocator

Assembler

GPU ISA


HSA RT COMPILER INTERACTION

HSA RT API Categories

Topology

Images Queues

Tools Signals

Compilation

Dispatch Compiler Library

High-level Models/Runtimes

OpenCL™ C++AMP Java ™ …

KFD Thunk API

KFD KMD

Memory

Debugger/Profiler

Direct3D

OpenGL™

Interop

Syscall

User-Mode

Kernel Mode


HSA DEBUG INFORMATION

Two layers of debug information

Source to BRIG

BRIG to ISA

Source line number for ISA DWARF line table is BRIG code offset. This way the two line tables (source -> BRIG code offset, BRIG code offset -> ISA program counter) map from kernel source to ISA program counter value.

Relocations support to be used with BRIG linking

HSAIL assembly source -> BRIG mapping in DWARF is supported in libHSAIL

HSA-specific attributes that identifies the ISA memory region of the variables (global, group, etc)

Allows:

Setting breakpoints on kernel /HSAIL/ISA source lines

Inspecting and modifying kernel source variables

Stepping through kernel/HSAIL/ISA source

Performance


PERFORMANCE

Avoid memory copying and use system buffers

Device memory can be used at developers choice

Flat pointer support allows advanced data structures, such as trees, to be used to optimize algorithms

Genuine 64 bit support provides access to more memory allowing not to split tasks and avoid reduction code

Reduced user mode dispatch cost

New HSAIL standard allows to leverage modern HW features

Evolving compiler optimizations give better performance compared to previous SW even without change

Platform atomics provide an improved way to exploit parallelism for lock-free programs


EVOLVING COMPILER

SHOC benchmark, level1 OpenCL™ set on “Kaveri” HW

0%

50%

100%

150%

200%

250%

300%

FFTMD

SGEMMSort

SpmvStencil2D

OpenCL ™ with HSA

Previous OpenCL ™

HSA Compiler Release


HSA COMPILER DELIVERABLES Q2 2014: OpenCL™/LLVM/HSAIL compiler with HSA support enabled

‒ OpenCL™ 1.2 + AMD extensions on Windows ® and Linux ®

‒ HSA RT API 1.0 with HSAIL and AQL inputs

‒ SVM and Platform atomics (OCL 2.0 features)

Q1 2015: Second release of the OpenCL™/LLVM/HSAIL compiler, with higher performance and support for additional hardware

‒ OpenCL™ 2.0 on Windows and Linux

‒ One single compiler stack for OpenCL™ on AMD platforms

Compiler components to be delivered:

‒ High-level compilers (HLC)

‒ HSA Finalizer

‒ libHSAIL

‒ Libraries: language-specific & math

‒ DWARF generation for debugging

Open Source

OpenCL™ 2.0 features


OPENCL™ 2.0 SUPPORT FOR SVM (SHARED-VIRTUAL MEMORY)

Shared-Virtual Memory (SVM)

‒ Address-space exposed to both host and device

‒ Makes a ‘pointer’ meaningful to both host and device

‒ Logically extends a portion of the global memory into the host address space giving work-items access to the host address space

‒ Three types of SVM supported

‒ Coarse-Grained Buffer ‒ Can be used to share linked-lists and such data structures between CPU and GPU but memory synchronization happens only at kernel entry/exit points and at the

level of the entire buffer

‒ Map/unmap calls are used as synch points

‒ Need to use clSVMalloc() call

‒ Fine-Grained Buffer ‒ Can be used to share individual bytes in buffer. Memory synchronization happens at kernel entry/exit as well as at atomic call points

‒ Need to use clSVMalloc() call

‒ Fine-Grained System ‒ Can be used to share individual bytes appearing anywhere in system memory. Memory synchronization happens at kernel entry/exit as well as at atomic call

points.

‒ A ‘normal malloc’ is able to provide access to SVM


OPENCL™ 2.0 SUPPORT FOR ‘PLATFORM ATOMICS’

Follows C11 and C++11 specs on atomics. Additional use of memory_scope in addition to memory_order

Ld/str

‒ void atomic_store_explicit(volatile global A *object, C desired, memory_order order)

‒ C atomic_load_explicit(volatile A *object, memory_order order, memory_scope scope)

Exchange/Compare-Exchange

‒ C atomic_exchange_explicit(volatile global A *object, C desired, memory_order order)

Fetch-and-modify

‒ C atomic_fetch_add(sub)_explicit(volatile global A *object, M operand, memory_order order, memory_scope scope)

Fence

‒ void atomic_work_item_fence(cl_mem_fence_flags flags, memory_order order, memory_scope scope)

Flag

‒ bool atomic_flag_test_and_set_explicit(volatile atomic_flag *object, memory_order order, memory_scope scope)


DEVICE ENQUEUE

OpenCL™ 2.0 spec introduces the concept of enqueuing by the device (GPU). The idea is to launch a new kernel from the running (parent) kernel.

Helpful in cases where there is “enough” data parallelism, within the kernel, which can be exploited by launching a new kernel. By doing without going back to host, would lead to better performance.

The new kernel is launched by the device WITHOUT the support from HSA RT.

The compiler generates the code in BRIG to enqueue the “child” kernel. This includes creating the AQL Q element, filling the Q structure and finally enqueuing the kernel (by using AQL commands)

The challenges are

‒ To create new buffers for filling the data into the kernel (without RT support)

‒ To enqueue the new kernel in a thread safe manner (multiple GPU threads may be enqueueing concurrently). For this, we are using platform atomics.


DEVICE ENQUEUE – AN EXAMPLE

kernel void childKernel (global int * a) {

…… }

kernel void parentKernel(global int *b) {

ndrange_t ndrange;

/* Divide the work ‘b’ into many parts */

if (more_work_available(b) ) {

void (^myblockChild) (void) = ^{childKernel(b);};

enqueue_kernel (get_default_queue(), CLK_WAIT_KERNEL, ndrange, myblockChild);

}

}

OpenCL™ 2.0 supports many more sophisticated ways of enqueueing using various events (wait for child), various ndranges, etc.

HSA compiler has implemented some of the features of OpenCL™ 2.0 enqueue kernel. CLANG blocks which are shown above may not be implemented in the first version.

Conclusions and Future

Direction


CONCLUSIONS AND FUTURE DIRECTIONS

Controlled Alpha Release of the First HSA compiler

‒ Supports OpenCL™ 1.2 and a few features from OpenCL™ 2.0

‒ Performance tuning

OpenCL™ 2.0 support

Open-Source

‒ May Contribute to LLVM

‒ May open source the backend


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a registered trademark of the Khronos Group. Windows ® is a Trademark of Microsoft and Linux ® is Trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.


BACKUP


ABI, LINKING, & LOADING HSAIL spec enables traditional linking tasks (e.g. symbol resolution) spread across static and dynamic stages

‒ Static host and device linking

‒ Merge multiple object files (including host and device) into a single executable

‒ Device linker created to resolve symbols across multiple compilation units

‒ Host linker unmodified

‒ Pre-ISA loading

‒ Load statically allocated, globally scoped global memory data in HSAIL

‒ Track the addresses of globally scoped data symbols

‒ ISA linking and loading

‒ Finalizer resolves all local code and data symbols

‒ Finalizer and RT collectively resolve function symbols

‒ Resolve global-scoped data symbols by getting addresses from pre-ISA loader

‒ Allocate/resolve globally scoped group and private memory data per dispatch

‒ RT loads ISA binary for execution after translation of kernel closure done

Compiler lib drives the invocations of compiler components and functionality from OpenCL™ RT and HSA Core RT


KEY HSAIL FEATURES

Parallel

Shared virtual memory

Portable across vendors in HSA Foundation

Stable across multiple product generations

Consistent numerical results (IEEE-754 with defined min accuracy)

Fast, robust, simple finalization step (no monthly updates)

Good performance (little need to write in ISA)

Supports all of OpenCL™ and C++ AMP

Support Java ™, C++, and other languages as well


REGISTERS

Four classes of registers

‒ C: 1-bit, Control Registers

‒ S: 32-bit, Single-precision FP or Int

‒ D: 64-bit, Double-precision FP or Long Int

‒ Q: 128-bit, Packed data.

Fixed number of registers:

‒ 8 C

‒ S, D, Q share a single pool of resources

‒ S + 2*D + 4*Q <= 128

‒ Up to 128 S or 64 D or 32 Q (or a blend)

Register allocation done in high-level compiler

‒ Finalizer doesn’t have to perform expensive register allocation


HSAIL INSTRUCTION SET - OVERVIEW

Similar to assembly language for a RISC CPU

‒ Load-store architecture

‒ ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)

‒ add_u64 $d1, $d2, 24 ; $d1= $d2+24

136 opcodes (Java™ bytecode has 200)

‒ Floating point (single, double, half (f16))

‒ Integer (32-bit, 64-bit)

‒ Some packed operations

‒ Branches

‒ Function calls

‒ Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas

‒ Synchronize host CPU and HSA Component!

Text and Binary formats (“BRIG”)


SEGMENTS AND MEMORY 7 segments of memory

‒ global, readonly, group, spill, private, arg, kernarg,

‒ Memory instructions can (optionally) specify a segment

Global Segment

‒ Visible to all HSA agents (including host CPU)

Group Segment

‒ Provides high-performance memory shared in the work-group by every work-item

Spill, Private, Arg Segments

‒ Represent different regions of a per-work-item stack typically generated by compiler

Kernarg Segment

‒ Programmer writes kernarg segment to pass arguments to a kernel

Read-Only Segment

‒ Remains constant during execution of kernel

Flat Addressing

‒ Each segment mapped into virtual address space

‒ Flat addresses can map to segments based on virtual address

‒ Instructions with no explicit segment use flat addressing

‒ Very useful for high-level language support (ie classes, libraries)

‒ Aligns well with OpenCL™ 2.0 “generic” addressing feature

ld_global_u64 $d0, [$d6]

ld_group_u64 $d0,[$d6+24]

st_spill_f32 $s1,[$d6+4]

ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat


HSAIL AND SPIR

Feature HSAIL SPIR

Intended Users Compiler developers who want to control their own code generation.

Compiler developers who want a fast path to acceleration across a wide variety of devices.

IR Level Low-level, just above the machine instruction set High-level, just below LLVM-IR

Back-end code generation Thin, fast, robust. Flexible. Can include many optimizations and compiler transformation including register allocation.

Where are compiler optimizations performed?

Most done in high-level compiler, before HSAIL generation.

Most done in back-end code generator, between SPIR and device machine instruction set

Registers Fixed-size register pool Infinite SSA Form No Yes Binary format Yes Yes Code generator for LLVM Yes Yes

Back-end device targets Modern GPU architectures supported by members of the HSA Foundation Any OpenCL(tm) device including GPUs, CPUs, FPGAs

Memory Model Relaxed consistency with acquire/release, barriers, and fine-grained barriers

Flexible. Can support the OpenCL™ 1.2 Memory Model

HC-4017, HSA Compilers Technology, by Debyendu Das

Technology

Transcript of HC-4017, HSA Compilers Technology, by Debyendu Das