HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW
Ofer Rosenberg
DISCLAIMER:
This presentation is not an Official HSA Foundation presentation.
Most of the Material is taken from HSA HotChips 2013
Some slides contains my insights / Opinions
CONTENT
Introduction
hUMA
hQ
HSAIL
HSA Software
HSA Challenges
HSA Availability
INTRODUCTION
5
HISTORIC PERSPECTIVE
Accelerated System
Program runs on CPU
API to access Accelerators ASIC or Firmware Configurable, but operation is
fixed
Heterogeneous System
Program runs on CPU
Offloads work on Accelerators GPU, DSP, etc. Offloaded work is JITed (compiled at
runtime)
Distributed SoC based
6
HSA FOUNDATION Originated from AMD’s FSA – Fusion System Architecture HSA Foundation Founded in June 2012
HSA FOUNDATION MEMBERS
7
Founders
Promoters
Supporters
Contributors
Academic
Associates
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
8
WHAT IS HSA ALL ABOUT ?(MY TAKE)
“Bring Accelerators forward as a first class processor” Unified address space, pageable memory, coherency Eliminate drivers from dispatch path (user mode queues)
Standardized SW stack built on top of a set of HW requirements Improve interoperability between IP vendors
Unified Architecture for Accelerators Start from GPU, extend to DSP /
FPGA / Fixed-Function Acc , etc.
SoC Centric Major features are optimal for SoC
environment (same memory/die) Support of distributed system is
possible, yet inefficient (PCI atomics, others)
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
9
HSA WORKING GROUPS
HSA Systems Architecture hUMA – Unified Memory Model hQ – HSA Queuing Model
HSA Programmer Reference Specification HSAIL – HSA Intermediate Language
HSA System Runtime
HSA Compliance
HSA Tools
http://hsafoundation.com/standards/
10
OPENCL™ AND HSA
HSA is an optimized platform architecture for OpenCL™
Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU
OpenCL™ 2.0 shows considerable alignment with HSA
Many HSA member companies are also active with Khronos in the OpenCL™ working group
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
hUMA
© Copyright 2012 HSA Foundation. All Rights Reserved. 11
12
hUMAHSA Unified Memory Architecture
Evolution of CPU / GPU memory systems:
1. CPU uses Virtual Addresses, GPU uses Physical Addresses Memory had to be pinned GPU can access a limited area in the CPU memory (Aperture) Requires copy from system memory to GPU-visible memory Pointer-based data structures can’t be shared
2. CPU uses Virtual Addresses, GPU uses Virtual Addresses (but not the same) Memory still had to be pinned
GPU can access the entire system memory
Copy is not required Pointer-based data structures still
can’t be shared
3. hUMA
13
hUMAHSA Unified Memory Architecture
Shared Virtual Memory CPU & GPU see the
same addresses
Pageable Memory GPU can (somehow)
initiate a page fault
Cache coherency
14
SHARED VIRTUAL MEMORY
Advantages No mapping tricks, no copying back-and-forth between different PA
addresses Send pointers (not data) back and forth between HSA agents.
Note the Hardware Implications … Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc). Common mechanisms for address translation (and servicing
address translation faults) Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
15
CACHE COHERENCY DOMAINS
Advantages Composability Reduced SW complexity when communicating between agents Lower barrier to entry when porting software
Note the Hardware Implications … Hardware coherency support between all HSA agents Can take many forms
Stand alone Snoop Filters / Directories Combined L3/Filters Snoop-based systems (no filter) Etc …
Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
hQ
© Copyright 2012 HSA Foundation. All Rights Reserved. 16
17
hQ Motivation
1. GPU Dispatch has a lot of overhead SW/Driver stack overhead User mode to Kernel mode switch
18
hQ Motivation
2. Master/Slave pattern is limiting (and has a lot of overhead) CPU schedules work to the GPU Communication overhead (report results next kernel grid size)
Slide from “Introduction to Dynamic Parallelism”, Stephen Jones, NVIDIA Corporation
19
hQHSA QUEUING MODEL
User mode queuing for low latency dispatch Application dispatches directly No OS or driver in the dispatch path
Architected Queuing Layer Single compute dispatch path for all hardware No driver translation, direct to hardware
Allows for dispatch to queue from any agent CPU or GPU GPU can spawn its
own work
Picture from AMD Blog:hQ: From Master/Slave to Masterpiece
20
ARCHITECTED QUEUEING LANGUAGE
HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer
Support is allowed for single-producer, single-consumer
Queues consist of storage, read/write indices, ID, etc.
Queues are created/destroyed via calls to the HSA runtime
“Packets” are placed in queues directly from user mode, via an architected protocol
Packet format is architected
Producer Producer
Consumer
Read Index
Write Index
Storage in coherent, shared memory
Packets
Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
HSAIL
© Copyright 2012 HSA Foundation. All Rights Reserved. 21
22
WHAT IS HSAIL?
HSAIL is the intermediate language for parallel compute in HSA Generated by a high level compiler (LLVM, gcc, Java VM, etc) Low-level IR, close to machine ISA level Compiled down to target ISA by an IHV “Finalizer” Finalizer may execute at run time, install time, or build time
Example: OpenCL™ Compilation Stack using HSAIL
OpenCL™ Kernel
High-Level Compiler Flow (Developer)
Finalizer Flow (Runtime)
EDG or CLANGSPIRLLVMHSAIL HSAIL
FinalizerHardware ISA
Binary
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
23
HSAIL INSTRUCTION SET HIGHLIGHTS
“SIMT” – Single Instruction Multiple Data ISA is Scalar, describes one serial thread – Parallelism is done by HW
RISC-Like Load-store architecture 136 opcodes
Fixed number of Registers 1 Control Pool of 512 bytes
Single Double Quad
7 segments of memory global, read only, group, spill, private, arg, kernarg
01: version 0:95: $full : $large;02: // static method HotSpotMethod<Main.lambda$2(Player)>03: kernel &run (04: kernarg_u64 %_arg0 // Kernel signature for lambda method05: ) {06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord08:09: cvt_u64_s32 $d2, $s2; // Convert X gid to long10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref11: add_u64 $d2, $d2, 24; // Adjust for actual elements start12: add_u64 $d2, $d2, $d6; // Add to array ref ptr13: ld_global_u64 $d6, [$d2]; // Load from array element into reg14: @L0:15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()16: mov_b64 $d3, $d0;17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()18: cvt_f32_s32 $s16, $s3;19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()20: cvt_f32_s32 $s17, $s0;21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()23: ret;24: };
HSA SOFTWARE
© Copyright 2012 HSA Foundation. All Rights Reserved. 24
25
HIGH-LEVEL SOFTWARE STACK
Programming Languages OpenCL 2.0 C++ AMP Java (Aparapi/Sumatra)
HSA Runtime (User Mode Driver) System Query Access to JIT Compilers Access to Queues
JIT Compilers Offline or online (JIT) LLVM Compiler (LLVM HSAIL) HSAIL Finalizer (HSAIL BIN)
Kernel Mode Driverhttp://www.hsafoundation.com/hsa-developer-tools/
26
HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
27
JAVA HETEROGENEOUS ENABLEMENT ROADMAP
CPU ISA GPU ISA
JVM
Application
APARAPI
GPUCPU
OpenCL™
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
HSA RuntimeLLVM Optimizer
IR
CPU ISA GPU ISA
Sumatra Enabled JVM
Application
HSA CPUHSA CPU
HSA Finalizer
HSAIL
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
HSA Challenges(My Take)
© Copyright 2012 HSA Foundation. All Rights Reserved. 28
HSA CHALLENGES – VENDOR SUPPORT
29
Founders
Promoters
Supporters
Contributors
Academic
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
Missing some key players:Intel, NVIDIA, Apple, Microsoft, Google, …
30
HSA CHALLENGES – LANGUAGES SUPPORT
HSAIL (or LLVM) is not an attractive level to code at…
Leverage existing parallel languages/paradigms to exploit HSA features: C++ AMP OpenCL 2.0 (done!) OpenMP Add your favorite …
Extend popular languages to exploit HSA: Scripting languages: Python Web languages : HTML5, RoR, Javascript, …
DSL languages
31
HSA CHALLENGES – SECURITY
HSA design had some security measures in mind: Accelerator supports privilege level, with user and privileged memory Execute, Read and Write are protected by page table entries Support in fixed time context sceduling (DoS protection)
But: Advanced features such as hUMA & uQ are potential back door OS & Security Apps currently do not monitor the accelerators
Monitoring may require OS changes Detailed specification can be used to find attack vectors Some accelerators architecture may introduce a security flaw
Example: local memory on GPU
HSA Availability
© Copyright 2012 HSA Foundation. All Rights Reserved. 32
© Copyright 2012 HSA Foundation. All Rights Reserved. 33
HSA AVAILABILITY
AMD released “Kaveri”, the first SoC which is HSA-able HW supports HUMA, hQ, etc. HSA software stack is not publicly available yet (expected this year)
http://www.tomshardware.com/reviews/a88x-socket-fm2-motherboard,3764.html
34
HSA AVAILABILITY
Simulators:
HSAEMU – A full system emulator for HSA platforms Work done by System SW Lab at NTHU (National Tsing Hua University) http://hsaemu.org/ Code available on GitHub - https://github.com/SSLAB-HSA/HSAemu
HSAIL Simulator Code available on GitHub - https://
github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator
THANK YOU
35
BACKUPS
36
© Copyright 2012 HSA Foundation. All Rights Reserved. 37
REFERENCES
• HSA Foundation:
• http://hsafoundation.com/
• HSA whitepaper
• http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf
• hUMA
• http://www.slideshare.net/AMD/amd-heterogeneous-uniform-memory-access
• http://www.pcper.com/reviews/Processors/AMD-Details-hUMA-HSA-Action
• http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/
• http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3
• ANANDTECH Hawaii architecture
• http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3
• hQ
• http://community.amd.com/community/amd-blogs/amd-business/blog/2013/10/21/hq-from-masterslave-to-masterpiece
• http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf
• HSA purpose analysis by Moor
• http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf
• IOMMUv2 spec
• http://developer.amd.com/wordpress/media/2012/10/48882.pdf
38
hUMA & Discrete GPUs
hUMA can be extended beyond SoC, if the proper HW exists (such as Hawaii GPU…)
Slide from “IOMMUv2: the Ins and Outs of Heterogeneous GPU use”, AFDS 2012
39
HSAIL AND SPIRFeature HSAIL SPIR
Intended UsersCompiler developers who want to control their own code generation.
Compiler developers who want a fast path to acceleration across a wide variety of devices.
IR LevelLow-level, just above the machine instruction set High-level, just below LLVM-IR
Back-end code generation Thin, fast, robust.
Flexible. Can include many optimizations and compiler transformation including register allocation.
Where are compiler optimizations performed?
Most done in high-level compiler, before HSAIL generation.
Most done in back-end code generator, between SPIR and device machine instruction set
Registers Fixed-size register pool InfiniteSSA Form No YesBinary format Yes YesCode generator for LLVM Yes Yes
Back-end device targets
Modern GPU architectures supported by members of the HSA Foundation
Any OpenCL device including GPUs, CPUs, FPGAs
Memory Model
Relaxed consistency with acquire/release, barriers, and fine-grained barriers
Flexible. Can support the OpenCL 1.2 Memory Model
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
40
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes, User Mode Drivers
Graphics Kernel Mode Driver
AppsApps
AppsApps
AppsApps
HSA Software Stack
Task Queuing Libraries
HSA Domain Libraries,OpenCL ™ 2.x Runtime
HSA Kernel Mode Driver
HSA Runtime
HSA JIT
AppsApps
AppsApps
AppsApps
User mode component Kernel mode component Components contributed by third parties
HSA SOFTWARE STACK
41
OPENCL™ AND HSA
HSA is an optimized platform architecture for OpenCL™
Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU
OpenCL™ 2.0 shows considerable alignment with HSA
Many HSA member companies are also active with Khronos in the OpenCL™ working group
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
42
BOLT — PARALLEL PRIMITIVES LIBRARY FOR HSA
Easily leverage the inherent power efficiency of GPU computing Common routines such as scan, sort, reduce, transform More advanced routines like heterogeneous pipelines Bolt library works with OpenCL and C++ AMP
Enjoy the unique advantages of the HSA platform Move the computation not the data
Finally a single source code base for the CPU and GPU! Developers can focus on core algorithms
Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
43
HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
44
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LO
C
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Perfo
rman
ce
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2012 HSA Foundation. All Rights Reserved. 45
AMD’S FIRST HSA SOC
Top Related