Download - HSA Introduction

HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW

Ofer Rosenberg

DISCLAIMER:

This presentation is not an Official HSA Foundation presentation.

Most of the Material is taken from HSA HotChips 2013

Some slides contains my insights / Opinions

CONTENT

Introduction

hUMA

hQ

HSAIL

HSA Software

HSA Challenges

HSA Availability

INTRODUCTION

5

HISTORIC PERSPECTIVE

Accelerated System

Program runs on CPU

API to access Accelerators ASIC or Firmware Configurable, but operation is

fixed

Heterogeneous System

Program runs on CPU

Offloads work on Accelerators GPU, DSP, etc. Offloaded work is JITed (compiled at

runtime)

Distributed SoC based

https://www.youtube.com/watch?v=N9Y_naTHPsY

6

HSA FOUNDATION Originated from AMD’s FSA – Fusion System Architecture HSA Foundation Founded in June 2012

HSA FOUNDATION MEMBERS

7

Founders

Promoters

Supporters

Contributors

Academic

Associates

Slide Taken from Phil Rogers HSA Overview, HotChips 2013

http://www.apical.co.uk/

http://www.multicorewareinc.com/index.php

8

WHAT IS HSA ALL ABOUT ?(MY TAKE)

“Bring Accelerators forward as a first class processor” Unified address space, pageable memory, coherency Eliminate drivers from dispatch path (user mode queues)

Standardized SW stack built on top of a set of HW requirements Improve interoperability between IP vendors

Unified Architecture for Accelerators Start from GPU, extend to DSP /

FPGA / Fixed-Function Acc , etc.

SoC Centric Major features are optimal for SoC

environment (same memory/die) Support of distributed system is

possible, yet inefficient (PCI atomics, others)


9

HSA WORKING GROUPS

HSA Systems Architecture hUMA – Unified Memory Model hQ – HSA Queuing Model

HSA Programmer Reference Specification HSAIL – HSA Intermediate Language

HSA System Runtime

HSA Compliance

HSA Tools

http://hsafoundation.com/standards/



10

OPENCL™ AND HSA

HSA is an optimized platform architecture for OpenCL™

Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU

OpenCL™ 2.0 shows considerable alignment with HSA

Many HSA member companies are also active with Khronos in the OpenCL™ working group


hUMA

© Copyright 2012 HSA Foundation. All Rights Reserved. 11

12

hUMAHSA Unified Memory Architecture

Evolution of CPU / GPU memory systems:

1. CPU uses Virtual Addresses, GPU uses Physical Addresses Memory had to be pinned GPU can access a limited area in the CPU memory (Aperture) Requires copy from system memory to GPU-visible memory Pointer-based data structures can’t be shared

2. CPU uses Virtual Addresses, GPU uses Virtual Addresses (but not the same) Memory still had to be pinned

GPU can access the entire system memory

Copy is not required Pointer-based data structures still

can’t be shared

3. hUMA

13

hUMAHSA Unified Memory Architecture

Shared Virtual Memory CPU & GPU see the

same addresses

Pageable Memory GPU can (somehow)

initiate a page fault

Cache coherency

14

SHARED VIRTUAL MEMORY

Advantages No mapping tricks, no copying back-and-forth between different PA

addresses Send pointers (not data) back and forth between HSA agents.

Note the Hardware Implications … Common Page Tables (and common interpretation of architectural

semantics such as shareability, protection, etc). Common mechanisms for address translation (and servicing

address translation faults) Concept of a process address space (PASID) to allow multiple, per

process virtual address spaces within the system.

Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013

15

CACHE COHERENCY DOMAINS

Advantages Composability Reduced SW complexity when communicating between agents Lower barrier to entry when porting software

Note the Hardware Implications … Hardware coherency support between all HSA agents Can take many forms

Stand alone Snoop Filters / Directories Combined L3/Filters Snoop-based systems (no filter) Etc …


hQ


17

hQ Motivation

1. GPU Dispatch has a lot of overhead SW/Driver stack overhead User mode to Kernel mode switch

18

hQ Motivation

2. Master/Slave pattern is limiting (and has a lot of overhead) CPU schedules work to the GPU Communication overhead (report results next kernel grid size)

Slide from “Introduction to Dynamic Parallelism”, Stephen Jones, NVIDIA Corporation

19

hQHSA QUEUING MODEL

User mode queuing for low latency dispatch Application dispatches directly No OS or driver in the dispatch path

Architected Queuing Layer Single compute dispatch path for all hardware No driver translation, direct to hardware

Allows for dispatch to queue from any agent CPU or GPU GPU can spawn its

own work

Picture from AMD Blog:hQ: From Master/Slave to Masterpiece

http://community.amd.com/community/amd-blogs/amd-business/blog/2013/10/21/hq-from-masterslave-to-masterpiece


20

ARCHITECTED QUEUEING LANGUAGE

HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer

Support is allowed for single-producer, single-consumer

Queues consist of storage, read/write indices, ID, etc.

Queues are created/destroyed via calls to the HSA runtime

“Packets” are placed in queues directly from user mode, via an architected protocol

Packet format is architected

Producer Producer

Consumer

Read Index

Write Index

Storage in coherent, shared memory

Packets


HSAIL


22

WHAT IS HSAIL?

HSAIL is the intermediate language for parallel compute in HSA Generated by a high level compiler (LLVM, gcc, Java VM, etc) Low-level IR, close to machine ISA level Compiled down to target ISA by an IHV “Finalizer” Finalizer may execute at run time, install time, or build time

Example: OpenCL™ Compilation Stack using HSAIL

OpenCL™ Kernel

High-Level Compiler Flow (Developer)

Finalizer Flow (Runtime)

EDG or CLANGSPIRLLVMHSAIL HSAIL

FinalizerHardware ISA

Binary

Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013

23

HSAIL INSTRUCTION SET HIGHLIGHTS

“SIMT” – Single Instruction Multiple Data ISA is Scalar, describes one serial thread – Parallelism is done by HW

RISC-Like Load-store architecture 136 opcodes

Fixed number of Registers 1 Control Pool of 512 bytes

Single Double Quad

7 segments of memory global, read only, group, spill, private, arg, kernarg

01: version 0:95: $full : $large;02: // static method HotSpotMethod<Main.lambda$2(Player)>03: kernel &run (04: kernarg_u64 %_arg0 // Kernel signature for lambda method05: ) {06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord08:09: cvt_u64_s32 $d2, $s2; // Convert X gid to long10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref11: add_u64 $d2, $d2, 24; // Adjust for actual elements start12: add_u64 $d2, $d2, $d6; // Add to array ref ptr13: ld_global_u64 $d6, [$d2]; // Load from array element into reg14: @L0:15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()16: mov_b64 $d3, $d0;17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()18: cvt_f32_s32 $s16, $s3;19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()20: cvt_f32_s32 $s17, $s0;21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()23: ret;24: };

HSA SOFTWARE


25

HIGH-LEVEL SOFTWARE STACK

Programming Languages OpenCL 2.0 C++ AMP Java (Aparapi/Sumatra)

HSA Runtime (User Mode Driver) System Query Access to JIT Compilers Access to Queues

JIT Compilers Offline or online (JIT) LLVM Compiler (LLVM HSAIL) HSAIL Finalizer (HSAIL BIN)

Kernel Mode Driverhttp://www.hsafoundation.com/hsa-developer-tools/

http://www.hsafoundation.com/hsa-developer-tools/

http://www.hsafoundation.com/hsa-developer-tools/

26

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do

Component Name IHV or Common Rationale

HSA Bolt Library Common Enable understanding and debug

HSAIL Code Generator Common Enable research

LLVM Contributions Common Industry and academic collaboration

HSAIL Assembler Common Enable understanding and debug

HSA Runtime Common Standardize on a single runtime

HSA Finalizer IHV Enable research and debug

HSA Kernel Driver IHV For inclusion in linux distros

Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013

27

JAVA HETEROGENEOUS ENABLEMENT ROADMAP

CPU ISA GPU ISA

JVM

Application

APARAPI

GPUCPU

OpenCL™

CPU ISA GPU ISA

JVM

Application

APARAPI

HSA CPUHSA CPU

HSA Finalizer

HSAIL

CPU ISA GPU ISA

JVM

Application

APARAPI

HSA CPUHSA CPU

HSA Finalizer

HSAIL

HSA RuntimeLLVM Optimizer

IR

CPU ISA GPU ISA

Sumatra Enabled JVM

Application

HSA CPUHSA CPU

HSA Finalizer

HSAIL

Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013

HSA Challenges(My Take)


HSA CHALLENGES – VENDOR SUPPORT

29

Founders

Promoters

Supporters

Contributors

Academic


Missing some key players:Intel, NVIDIA, Apple, Microsoft, Google, …

http://www.apical.co.uk/

http://www.multicorewareinc.com/index.php

30

HSA CHALLENGES – LANGUAGES SUPPORT

HSAIL (or LLVM) is not an attractive level to code at…

Leverage existing parallel languages/paradigms to exploit HSA features: C++ AMP OpenCL 2.0 (done!) OpenMP Add your favorite …

Extend popular languages to exploit HSA: Scripting languages: Python Web languages : HTML5, RoR, Javascript, …

DSL languages

31

HSA CHALLENGES – SECURITY

HSA design had some security measures in mind: Accelerator supports privilege level, with user and privileged memory Execute, Read and Write are protected by page table entries Support in fixed time context sceduling (DoS protection)

But: Advanced features such as hUMA & uQ are potential back door OS & Security Apps currently do not monitor the accelerators

Monitoring may require OS changes Detailed specification can be used to find attack vectors Some accelerators architecture may introduce a security flaw

Example: local memory on GPU

HSA Availability



HSA AVAILABILITY

AMD released “Kaveri”, the first SoC which is HSA-able HW supports HUMA, hQ, etc. HSA software stack is not publicly available yet (expected this year)

http://www.tomshardware.com/reviews/a88x-socket-fm2-motherboard,3764.html




34

HSA AVAILABILITY

Simulators:

HSAEMU – A full system emulator for HSA platforms Work done by System SW Lab at NTHU (National Tsing Hua University) http://hsaemu.org/ Code available on GitHub - https://github.com/SSLAB-HSA/HSAemu

HSAIL Simulator Code available on GitHub - https://

github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator

http://hsaemu.org/

http://hsaemu.org/

https://github.com/SSLAB-HSA/HSAemu

https://github.com/SSLAB-HSA/HSAemu

https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator

https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator

THANK YOU

35

BACKUPS

36


REFERENCES

• HSA Foundation:

• http://hsafoundation.com/

• HSA whitepaper

• http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

• hUMA

• http://www.slideshare.net/AMD/amd-heterogeneous-uniform-memory-access

• http://www.pcper.com/reviews/Processors/AMD-Details-hUMA-HSA-Action

• http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/

• http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3

• ANANDTECH Hawaii architecture

• http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3

• hQ

• http://community.amd.com/community/amd-blogs/amd-business/blog/2013/10/21/hq-from-masterslave-to-masterpiece

• http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf

• HSA purpose analysis by Moor

• http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf

• IOMMUv2 spec

• http://developer.amd.com/wordpress/media/2012/10/48882.pdf

http://hsafoundation.com/

http://hsafoundation.com/

http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

http://www.slideshare.net/AMD/amd-heterogeneous-uniform-memory-access

http://www.pcper.com/reviews/Processors/AMD-Details-hUMA-HSA-Action



http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/

http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/

http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3

http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3

http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3



http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf

http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf

http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf

http://developer.amd.com/wordpress/media/2012/10/48882.pdf

38

hUMA & Discrete GPUs

hUMA can be extended beyond SoC, if the proper HW exists (such as Hawaii GPU…)

Slide from “IOMMUv2: the Ins and Outs of Heterogeneous GPU use”, AFDS 2012

39

HSAIL AND SPIRFeature HSAIL SPIR

Intended UsersCompiler developers who want to control their own code generation.

Compiler developers who want a fast path to acceleration across a wide variety of devices.

IR LevelLow-level, just above the machine instruction set High-level, just below LLVM-IR

Back-end code generation Thin, fast, robust.

Flexible. Can include many optimizations and compiler transformation including register allocation.

Where are compiler optimizations performed?

Most done in high-level compiler, before HSAIL generation.

Most done in back-end code generator, between SPIR and device machine instruction set

Registers Fixed-size register pool InfiniteSSA Form No YesBinary format Yes YesCode generator for LLVM Yes Yes

Back-end device targets

Modern GPU architectures supported by members of the HSA Foundation

Any OpenCL device including GPUs, CPUs, FPGAs

Memory Model

Relaxed consistency with acquire/release, barriers, and fine-grained barriers

Flexible. Can support the OpenCL 1.2 Memory Model

Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013

40

Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL™, DX Runtimes, User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

HSA Software Stack

Task Queuing Libraries

HSA Domain Libraries,OpenCL ™ 2.x Runtime

HSA Kernel Mode Driver

HSA Runtime

HSA JIT

AppsApps

AppsApps

AppsApps

User mode component Kernel mode component Components contributed by third parties

HSA SOFTWARE STACK

41

OPENCL™ AND HSA

HSA is an optimized platform architecture for OpenCL™

Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU

OpenCL™ 2.0 shows considerable alignment with HSA

Many HSA member companies are also active with Khronos in the OpenCL™ working group


42

BOLT — PARALLEL PRIMITIVES LIBRARY FOR HSA

Easily leverage the inherent power efficiency of GPU computing Common routines such as scan, sort, reduce, transform More advanced routines like heterogeneous pipelines Bolt library works with OpenCL and C++ AMP

Enjoy the unique advantages of the HSA platform Move the computation not the data

Finally a single source code base for the CPU and GPU! Developers can focus on core algorithms

Bolt version 1.0 for OpenCL and C++ AMP is available now at https://github.com/HSA-Libraries/Bolt


https://github.com/HSA-Libraries/Bolt

43

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack Allows a single shared implementation for many components Enables university research and collaboration in all areas Because it’s the right thing to do

Component Name IHV or Common Rationale

HSA Bolt Library Common Enable understanding and debug

HSAIL Code Generator Common Enable research

LLVM Contributions Common Industry and academic collaboration

HSAIL Assembler Common Enable understanding and debug

HSA Runtime Common Standardize on a single runtime

HSA Finalizer IHV Enable research and debug

HSA Kernel Driver IHV For inclusion in linux distros

44

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

0

50

100

150

200

250

300

350

LO

C

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Perfo

rman

ce

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0Copy-back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)


AMD’S FIRST HSA SOC