HC-4017, HSA Compilers Technology, by Debyendu Das
-
Upload
amd-developer-central -
Category
Technology
-
view
1.395 -
download
0
description
Transcript of HC-4017, HSA Compilers Technology, by Debyendu Das
HSA COMPILER TECHNOLOGY DIBYENDU DAS, PRAKASH RAGHAVENDRA, LEONID LOBACHEV
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 2
OUTLINE
H(eterogeneous) S(ystem) A(rchitecture) SW Stack
Architecture of HSA Compilers
Performance
HSA Compiler Deliverables
OpenCL™ 2.0 features
Conclusions and Future Direction
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 3
How we deliver the HSA value proposition?
Make GPU easily accessible
‒ Support mainstream languages
‒ Expandable to domain specific languages
Make compute offload efficient ‒ Eliminate memory copying
‒ Low-latency dispatch
Make it ubiquitous ‒ Drive standard through HSA Foundation
‒ Open Source key components
Optimized Compiler Technology ‒ Leverage llvm framework
‒ HSAIL as a new IR for heterogeneous computing
HSA SOFTWARE STACK
Application and System Languages, domain specific languages, etc.
e.g.
OpenCL™, Java ™, C++ AMP, Python, R, …
HSA Runtime(HSA RT)
LLVM IR
HSA Hardware
Applications
HSAIL
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 4
HSAIL
HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs
‒ Finalized to a native ISA by a finalizer/JIT
‒ Accommodate to rapid innovations in native GPU architectures
‒ HSAIL expected to be stable and backward compatible across implementations
‒ Enable multiple hardware vendors to support HSA
Key design points and benefits for HSA compilers ‒ Adopt a thin finalizer approach
‒ Enable fast translation time and robustness in the finalizer
‒ Drive performance optimizations through high-level compilers (HLC)
‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations
OpenCL™ Kernel
High-Level Compiler Flow (Developer)
Finalizer Flow (Runtime)
EDG or CLANG SPIR LLVM HSAIL
HSAIL Finalizer
Hardware ISA
EDG – Edison Design Group
CLANG – LLVM FE
SPIR – Standard Portable Intermediate Representation
Architecture of HSA Compilers
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 6
TECHNOLOGY BASE FOR COMPILER COMPONENTS
H(igh) L(evel) C(ompiler) front-end
‒ C++ FE from Edison Design Group (EDG) under a proprietary license
‒ May support CLANG FE in the future
‒ Generates llvm-ir
HLC back-end
‒ LLVM optimizer and code-gen
‒ Generates HSAIL from llvm-ir
Finalizer
‒ Converts HSAIL to GPU ISA
‒ SSA-based optimizer
HSAIL assembler/disassembler (libHSAIL)
‒ Assembling, disassembling, validating HSAIL and BRIG (binary format of HSAIL)
Libraries
‒ Optimized implementation of OpenCL™ builtins
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 7
OPENCL™ COMPILER ARCHITECTURE
OpenCL™ compiler is expected to continue evolving based on new specs from Khronos.
HSA OpenCL™ compiler leverages the existing and evolving compiler architecture of llvm.
Minimize architectural changes.
Shifting aggressive optimizations toward HLC
Thin Finalizer
x86 Executable with OpenCL™ API Calls
C/C++ Front End
Compiler Optimizations
X86 code generation
Host Linker
GPU ISA
OpenCL™ Host Compiler EDG for
OpenCL™ Kernels
LLVM Optimizer
LLVM HSAIL code generation
Finalizer
OpenCL™Device Compiler
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 8
DEVICE COMPILER
Based on LLVM optimizer
Custom HSAIL back-end
Parallel -aware compiler optimizations
SIMT-friendly code generation
GPU specific optimizations
DWARF generation
Direct binary object generation
LLVM optimizer
Device code in LLVM-IR
LLVM HSAIL code generator
Optimized device code in LLVM-IR
BRIGContainer
Device code in binary BRIG form
BRIGStreamer
ELF with BRIG sections.
libHSAIL
LLVM IR
BRIG Binary Object
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 9
libHSAIL – assembler/disassembler/validator for HSA
HIDEL - High Level HSAIL Description Language
Automatically generated code to:
‒ Access BRIG fields in safe and effective way
‒ Validate BRIG and HSAIL conformance to spec
‒ Encapsulate BRIG version differences
Brigantine API to ease creation of BRIG on the fly
HSAIL<->BRIG assembler and disassembler
HSAIL->BRIG debug information generator
BRIG streaming routines
HSAIL test generation framework
HSAIL instruction level simulation
ASCII HSAIL
Validator
BrigStreamer BifStreamer
Disassembler
BrigContainer
Proxy classes
Brigantine
BRIG, BIF files
Scanner
Parser
Finalizer
Device linker
Loader
libHSAIL clients
HLC (LLVM)
Direct binary
object generation
HSAILAsm
libBrigDwarf
Test Generation
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 10
Fast optimizations for translation efficiency
Expected HLC to perform heavyweight optimizations
Supports Unstructured control flow
Dynamic calling convention
Optimized ISA Libraries
Indirect branches
Exception handling
Offline mode available for caching ISA translation
Debugging support: mapping between BRIG and GPU ISA
FINALIZER HSAIL
HSAIL-to-IR
IR
SSA
Optimizations on IR
Scheduler
Allocator
Assembler
GPU ISA
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 11
HSA RT COMPILER INTERACTION
HSA RT API Categories
Topology
Images Queues
Tools Signals
Compilation
Dispatch Compiler Library
High-level Models/Runtimes
OpenCL™ C++AMP Java ™ …
KFD Thunk API
KFD KMD
Memory
Debugger/Profiler
Direct3D
OpenGL™
Interop
Syscall
User-Mode
Kernel Mode
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 12
HSA DEBUG INFORMATION
Two layers of debug information
Source to BRIG
BRIG to ISA
Source line number for ISA DWARF line table is BRIG code offset. This way the two line tables (source -> BRIG code offset, BRIG code offset -> ISA program counter) map from kernel source to ISA program counter value.
Relocations support to be used with BRIG linking
HSAIL assembly source -> BRIG mapping in DWARF is supported in libHSAIL
HSA-specific attributes that identifies the ISA memory region of the variables (global, group, etc)
Allows:
Setting breakpoints on kernel /HSAIL/ISA source lines
Inspecting and modifying kernel source variables
Stepping through kernel/HSAIL/ISA source
Performance
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 14
PERFORMANCE
Avoid memory copying and use system buffers
Device memory can be used at developers choice
Flat pointer support allows advanced data structures, such as trees, to be used to optimize algorithms
Genuine 64 bit support provides access to more memory allowing not to split tasks and avoid reduction code
Reduced user mode dispatch cost
New HSAIL standard allows to leverage modern HW features
Evolving compiler optimizations give better performance compared to previous SW even without change
Platform atomics provide an improved way to exploit parallelism for lock-free programs
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 15
EVOLVING COMPILER
SHOC benchmark, level1 OpenCL™ set on “Kaveri” HW
0%
50%
100%
150%
200%
250%
300%
FFTMD
SGEMMSort
SpmvStencil2D
OpenCL ™ with HSA
Previous OpenCL ™
HSA Compiler Release
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 17
HSA COMPILER DELIVERABLES Q2 2014: OpenCL™/LLVM/HSAIL compiler with HSA support enabled
‒ OpenCL™ 1.2 + AMD extensions on Windows ® and Linux ®
‒ HSA RT API 1.0 with HSAIL and AQL inputs
‒ SVM and Platform atomics (OCL 2.0 features)
Q1 2015: Second release of the OpenCL™/LLVM/HSAIL compiler, with higher performance and support for additional hardware
‒ OpenCL™ 2.0 on Windows and Linux
‒ One single compiler stack for OpenCL™ on AMD platforms
Compiler components to be delivered:
‒ High-level compilers (HLC)
‒ HSA Finalizer
‒ libHSAIL
‒ Libraries: language-specific & math
‒ DWARF generation for debugging
Open Source
OpenCL™ 2.0 features
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 19
OPENCL™ 2.0 SUPPORT FOR SVM (SHARED-VIRTUAL MEMORY)
Shared-Virtual Memory (SVM)
‒ Address-space exposed to both host and device
‒ Makes a ‘pointer’ meaningful to both host and device
‒ Logically extends a portion of the global memory into the host address space giving work-items access to the host address space
‒ Three types of SVM supported
‒ Coarse-Grained Buffer ‒ Can be used to share linked-lists and such data structures between CPU and GPU but memory synchronization happens only at kernel entry/exit points and at the
level of the entire buffer
‒ Map/unmap calls are used as synch points
‒ Need to use clSVMalloc() call
‒ Fine-Grained Buffer ‒ Can be used to share individual bytes in buffer. Memory synchronization happens at kernel entry/exit as well as at atomic call points
‒ Need to use clSVMalloc() call
‒ Fine-Grained System ‒ Can be used to share individual bytes appearing anywhere in system memory. Memory synchronization happens at kernel entry/exit as well as at atomic call
points.
‒ A ‘normal malloc’ is able to provide access to SVM
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 20
OPENCL™ 2.0 SUPPORT FOR ‘PLATFORM ATOMICS’
Follows C11 and C++11 specs on atomics. Additional use of memory_scope in addition to memory_order
Ld/str
‒ void atomic_store_explicit(volatile global A *object, C desired, memory_order order)
‒ C atomic_load_explicit(volatile A *object, memory_order order, memory_scope scope)
Exchange/Compare-Exchange
‒ C atomic_exchange_explicit(volatile global A *object, C desired, memory_order order)
Fetch-and-modify
‒ C atomic_fetch_add(sub)_explicit(volatile global A *object, M operand, memory_order order, memory_scope scope)
Fence
‒ void atomic_work_item_fence(cl_mem_fence_flags flags, memory_order order, memory_scope scope)
Flag
‒ bool atomic_flag_test_and_set_explicit(volatile atomic_flag *object, memory_order order, memory_scope scope)
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 21
DEVICE ENQUEUE
OpenCL™ 2.0 spec introduces the concept of enqueuing by the device (GPU). The idea is to launch a new kernel from the running (parent) kernel.
Helpful in cases where there is “enough” data parallelism, within the kernel, which can be exploited by launching a new kernel. By doing without going back to host, would lead to better performance.
The new kernel is launched by the device WITHOUT the support from HSA RT.
The compiler generates the code in BRIG to enqueue the “child” kernel. This includes creating the AQL Q element, filling the Q structure and finally enqueuing the kernel (by using AQL commands)
The challenges are
‒ To create new buffers for filling the data into the kernel (without RT support)
‒ To enqueue the new kernel in a thread safe manner (multiple GPU threads may be enqueueing concurrently). For this, we are using platform atomics.
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 22
DEVICE ENQUEUE – AN EXAMPLE
kernel void childKernel (global int * a) {
…… }
kernel void parentKernel(global int *b) {
ndrange_t ndrange;
/* Divide the work ‘b’ into many parts */
if (more_work_available(b) ) {
void (^myblockChild) (void) = ^{childKernel(b);};
enqueue_kernel (get_default_queue(), CLK_WAIT_KERNEL, ndrange, myblockChild);
}
}
OpenCL™ 2.0 supports many more sophisticated ways of enqueueing using various events (wait for child), various ndranges, etc.
HSA compiler has implemented some of the features of OpenCL™ 2.0 enqueue kernel. CLANG blocks which are shown above may not be implemented in the first version.
Conclusions and Future
Direction
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 24
CONCLUSIONS AND FUTURE DIRECTIONS
Controlled Alpha Release of the First HSA compiler
‒ Supports OpenCL™ 1.2 and a few features from OpenCL™ 2.0
‒ Performance tuning
OpenCL™ 2.0 support
Open-Source
‒ May Contribute to LLVM
‒ May open source the backend
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 25
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a registered trademark of the Khronos Group. Windows ® is a Trademark of Microsoft and Linux ® is Trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 26
BACKUP
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 27
ABI, LINKING, & LOADING HSAIL spec enables traditional linking tasks (e.g. symbol resolution) spread across static and dynamic stages
‒ Static host and device linking
‒ Merge multiple object files (including host and device) into a single executable
‒ Device linker created to resolve symbols across multiple compilation units
‒ Host linker unmodified
‒ Pre-ISA loading
‒ Load statically allocated, globally scoped global memory data in HSAIL
‒ Track the addresses of globally scoped data symbols
‒ ISA linking and loading
‒ Finalizer resolves all local code and data symbols
‒ Finalizer and RT collectively resolve function symbols
‒ Resolve global-scoped data symbols by getting addresses from pre-ISA loader
‒ Allocate/resolve globally scoped group and private memory data per dispatch
‒ RT loads ISA binary for execution after translation of kernel closure done
Compiler lib drives the invocations of compiler components and functionality from OpenCL™ RT and HSA Core RT
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 28
KEY HSAIL FEATURES
Parallel
Shared virtual memory
Portable across vendors in HSA Foundation
Stable across multiple product generations
Consistent numerical results (IEEE-754 with defined min accuracy)
Fast, robust, simple finalization step (no monthly updates)
Good performance (little need to write in ISA)
Supports all of OpenCL™ and C++ AMP
Support Java ™, C++, and other languages as well
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 29
REGISTERS
Four classes of registers
‒ C: 1-bit, Control Registers
‒ S: 32-bit, Single-precision FP or Int
‒ D: 64-bit, Double-precision FP or Long Int
‒ Q: 128-bit, Packed data.
Fixed number of registers:
‒ 8 C
‒ S, D, Q share a single pool of resources
‒ S + 2*D + 4*Q <= 128
‒ Up to 128 S or 64 D or 32 Q (or a blend)
Register allocation done in high-level compiler
‒ Finalizer doesn’t have to perform expensive register allocation
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 30
HSAIL INSTRUCTION SET - OVERVIEW
Similar to assembly language for a RISC CPU
‒ Load-store architecture
‒ ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
‒ add_u64 $d1, $d2, 24 ; $d1= $d2+24
136 opcodes (Java™ bytecode has 200)
‒ Floating point (single, double, half (f16))
‒ Integer (32-bit, 64-bit)
‒ Some packed operations
‒ Branches
‒ Function calls
‒ Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
‒ Synchronize host CPU and HSA Component!
Text and Binary formats (“BRIG”)
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 31
SEGMENTS AND MEMORY 7 segments of memory
‒ global, readonly, group, spill, private, arg, kernarg,
‒ Memory instructions can (optionally) specify a segment
Global Segment
‒ Visible to all HSA agents (including host CPU)
Group Segment
‒ Provides high-performance memory shared in the work-group by every work-item
Spill, Private, Arg Segments
‒ Represent different regions of a per-work-item stack typically generated by compiler
Kernarg Segment
‒ Programmer writes kernarg segment to pass arguments to a kernel
Read-Only Segment
‒ Remains constant during execution of kernel
Flat Addressing
‒ Each segment mapped into virtual address space
‒ Flat addresses can map to segments based on virtual address
‒ Instructions with no explicit segment use flat addressing
‒ Very useful for high-level language support (ie classes, libraries)
‒ Aligns well with OpenCL™ 2.0 “generic” addressing feature
ld_global_u64 $d0, [$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]
ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat
| HSA COMPILER TECHNOLOGY | NOVEMBER 19, 2013 | PUBLIC 32
HSAIL AND SPIR
Feature HSAIL SPIR
Intended Users Compiler developers who want to control their own code generation.
Compiler developers who want a fast path to acceleration across a wide variety of devices.
IR Level Low-level, just above the machine instruction set High-level, just below LLVM-IR
Back-end code generation Thin, fast, robust. Flexible. Can include many optimizations and compiler transformation including register allocation.
Where are compiler optimizations performed?
Most done in high-level compiler, before HSAIL generation.
Most done in back-end code generator, between SPIR and device machine instruction set
Registers Fixed-size register pool Infinite SSA Form No Yes Binary format Yes Yes Code generator for LLVM Yes Yes
Back-end device targets Modern GPU architectures supported by members of the HSA Foundation Any OpenCL(tm) device including GPUs, CPUs, FPGAs
Memory Model Relaxed consistency with acquire/release, barriers, and fine-grained barriers
Flexible. Can support the OpenCL™ 1.2 Memory Model