John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...

Post on 23-Feb-2016

42 views 0 download

Tags:

description

An Introduction to Goblin-Core64 A Massively Parallel Processor Architecture Designed for Complex Data Analytics. John D. Leidel jleidel ttu edu. Overview. Data Intensive Computing Architectural Challenges The destruction of cache efficiency using irregular algorithms - PowerPoint PPT Presentation

Transcript of John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...

An Introduction to Goblin-Core64

A Massively Parallel Processor Architecture Designed for Complex Data Analytics

John D. Leideljleidel<at>ttu<dot>edu

Overview

• Data Intensive Computing Architectural Challenges– The destruction of cache efficiency using irregular

algorithms

• Goblin-Core64 Architecture Infrastructure Design– Sustainable Exascale performance with data intensive

applications

• Progress and Roadmap– The path forward

DATA INTENSIVE COMPUTING ARCHITECTURAL CHALLENGES

The destruction of cache efficiency using irregular algorithms

What is Big Data?…and how does it relate to HPC?

• Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC– Complexity– Time to Solution– Scale

• These problems are generally not– Simulating the physical world– Bound by simple floating point performance– As the problem scales, the result set is fixed

• These problems are generally– Sparse in nature– Contain complex [sometimes unconstrained] data types– As the problem scales, the result set scales

• The other side of the HPC coin

Three Drivers to HPC Solutions

5

Time To Solution

Algorithmic Complexity

Scale of Working Set

HPCHPC

HPC

Convergence Criteria for HPC Adoption

• Time + Complexity– Fraud Detection– High Performance Trading Analytics

• Time + Scale– Power Grid Analytics– Graph500 Benchmark

• Complexity + Scale– Epidemiology– Agent-Based Network Analytics

• Time + Complexity + Scale– Grand Challenge Problems– Cyber Analytics

6

Dense Solver Efficiency

Tianhe-2

(Milk

yWay

-2)

Sequoia

- BlueG

ene/Q

K computer

Jaguar

- Cray

XT5-HE

Roadru

nner

Roadru

nner

BlueGen

e/L

BlueGen

e/L

BlueGen

e/L

Earth-Si

mulator

Earth-Si

mulator

Earth-Si

mulator

ASCI W

hite, S

P Power3

ASCI R

ed

ASCI R

ed

ASCI R

ed

ASCI R

ed

SR2201/1

024

Numerical

Wind Tu

nnel

XP/S140

CM-5/10240.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

top500 efficiency

top500 efficiency

Sparse Solver Efficiency

DOE/NNSA/LLNL Sequoia BGQ UV 2000 GraphCREST-MC48 Dingus Matterhorn Gordon0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Graph500 GTEPS/core

Graph500 GTEPS/core

Cache-less Architectures

GOBLIN-CORE64 ARCHITECTURE INFRASTRUCTURE DESIGN

Sustainable Exascale performance with data intensive applications

Goal

Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput

The Result: Goblin-Core64• Hierarchical set of architectural modules that provide:

– Native PGAS memory addressing– High efficiency RISC ISA– SIMD capabilities – Architectural notion of “tasks” – Latency hiding techniques

• Single cycle context/task switching– Advanced synchronization techniques

• Ease the burden of barriers and sync points by eliminating spin waits– Memory coalescing//aggregation

• Local requests • Global requests• AMO’s

– Makes use of latest high bandwidth memory technologies• Hybrid Memory Cube

Goblin-Core64 ModulesTask

Reg

Task Unit Task Proc

ALUSIMD

Task Group

MMU

GC64 Socket

Software Managed

Scratchpad

HMC Memory Interface

AMO Unit

Coalesce Unit

SOC

NET

Peripherals Packet Engine

1 2

34

GC64 Module Hierarchy• Task Unit

– Small divisible unit; Register file + control logic• Task Proc

– Multiple Task Units + context switch control logic• Task Group

– Multiple Task Procs + local MMU• GC64 Socket

– Coalesce Unit: coalesces adjacent memory requests into a single payload– AMO Unit: intelligently handles local AMO requests– HMC Unit: HMC packet request engine + SERDES– Software Managed Scratchpad: on-chip memory– Packet Engine: off-chip memory interface

GC64 Scalable Units

………..

GC64 Execution Model• GC64 execution model provides

“pressure-driven”, single cycle context switching between threads/tasks

• Pressure state machine provides fair-sharing of ALU based upon:– Number of outstanding requests– Statistical probability of a register

stall– Number of cycles in current

execution context• Minimum execution is two

instructions– Based upon instruction format

ALU SIMD

Task Unit Mux

Context Switch State Machine

GC64 Unified [PGAS] Addressing• GC64 physical addressing provides block access to:

– Local HMC [main] memory – Local software-managed scratchpad– Globally mapped [remote] memory

• Pointer arithmetic between memory spaces• Obeys all the constraints of paged, virtual memory

Base Physical Address [33:0]CUB [37:33]Reserved [41:38]Socket [49:42]Unused

[63:50]

Local HMC MemoryScratchpadRemote Memory

One of 8 local HMC devicesCUB = 0xFRemote Socket ID

Physical Address Specification

Physical Address Destinations

GC64 ISA• Simple 64-bit instruction format

– Two instructions per payload• Optional immediate payload

• Instruction control block– Specifies immediate values, breakpoints and vector register aliasing

GC64 Vector Register Aliasing• Vector register aliasing

provides access to scalar register file from SIMD unit

• No need for additional vector register file– Increasing the data path, not

the physical storage• Compiler optimizations can be

used to perform complex, irregular operations– Vector-Scalar-Vector arithmetic– Vector Fill– Scatter/Gather

GC64 Potential PerformanceTask

GroupsTask Procs

Task Units

Tasks/Socket

Peak Giga-ops Bandwidth/Socket

2 2 2 4 4 2.384 GB/s

4 4 4 64 32 9.536 GB/s

8 8 4 256 256 38.147 GB/s

16 16 8 2048 1024 152.588 GB/s

32 16 16 8192 4096 305.176 GB/s

32 32 16 16384 8192 610.352 GB/s

64 32 16 32768 16384 1220.703 GB/s

128 32 16 65536 32768 2441.406 GB/s

*256 128 32 1048576 262144 19531.25 GB/s

*256 *256 *256 16777216 524288 39062.50 GB/s

*Max config- SIMD Width = 4- Task Issue Rate = 2- Cycle = 1Ghz

PROGRESS AND ROADMAPThe path forward

GC64 Progress Report• Complete

– GC64 ISA definition– Physical Address Format– Execution Model– HMC Simulator [to be used in the GC64 sim]

• 2 x papers submitted, third paper in progress• First academic publications on Hybrid Memory Cube technology

• In Progress– Architecture specification document– GC64 Simulator– ABI definition– Virtual addressing model– Compiler & Binutils [LLVM]

• Active Research Topics– Memory coalescing & AMO techniques [Spring 2014]– Context switch pressure model– Software managed scratchpad organization– Off-chip network protocol– Thread/task runtime optimization

HMC-Sim Stream Triad Results

Goblin-Core64

• Source code and specification is licensed under a BSD license

• www.gc64.org– Source code– Architecture documentation– Developer documentation

References[1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011.[2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012.[3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12). IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/[7] http://www.hybridmemorycube.org/[8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf[9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.

BACKUP

What did we learn from CHOMP?• Pros

– We can design tightly coupled ISA’s and runtime models that are extremely efficient• Each instruction becomes precious & necessary

– Code generation is quite natural• Allows the compiler to the best opportunities for optimization

– Latency hiding characteristics function as designed– Single-cycle context switch mechanisms function as designed– AMO’s are increasingly useful

• For more than just memory protection– RISC ISA’s are still providing high performance architectures

• VLSI and JIT’ing is unnecessary overhead/area

What did we learn from CHOMP?• Cons

– NEED MORE BANDWIDTH!• Tests with dense per-thread memory operations could utilize ~4X more bandwidth

with no other changes– Designing for an FPGA has its constraints– The lack of ILP hinders arithmetic throughput in some applications

• Even SpMV kernels can utilize ILP– Paged virtual memory is always expensive

• Not all applications/programmers exploit large pages– Use the native runtime!

• Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime

– Cache is bad, but coalescing is good• Cache is expensive to implement and often impedes performance• We can occasionally take advantage of spatial locality at access time

GC64 Research• PGAS Addressing

– How do we develop a segmented address directory w/o the use of a TLB? – How do we distribute and maintain this directory while providing applications “virtual memory” security?

• Efficient Synchronization– Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is

orthogonal to traditional implementations• Memory Coalescing

– Development of memory request coalescing algorithms for local, remote and AMO-type requests– How do our synchronization techniques play into this?

• Software-Managed Scratchpad– Is there room/desire to add features such as this? – What additional pressure does this put on the programming model/compiler?

• Compilation Techniques– Providing MTA-C style loop transformations– https://sites.google.com/site/parallelizationforllvm/loop-transforms

• Runtime Models– Building a machine-level optimized runtime that is programming model agnostic

• HMC Integration– HMC 1.0 specification is available. Very different than traditional DRAM technology