Download - DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects

DARPA’s Ubiquitous High-PerformanceComputing (UHPC)/ Exascale Projects

Presented by Raj Parihar

Advanced Computer Architecture Lab

University of Rochester, Rochester

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects

References

Runnemede: An Architecture for Ubiquitous High-Performance ComputingIntel Labs, UIUC, Reservior Labs (HPCA’13)

The MIT Angstrom ProjectMIT, UMCP (HotPar’11)

GPUs and the Future of Parallel ComputingNVIDIA (IEEE Micro’11)

Sandia’s X-Caliber ProjectSandia Lab, Micron, LexisNexis, 8 academic partners

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 2

DARPA’s Exascale Challenge

Build an exascale machine by 2020 using today’s technology

Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)

DARPA’s challenge and design goals:

1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)

Design the whole stack to be energy efficient – from software and

programming model to low power circuits and transistors


Runnemede: High Performance System (HPCA’13)

Hierarchical, heterogeneous, near-threshold computing

Overprovisioned, support for selective execution and power down

No hardware cache-coherence, managed in software

Dataflow kind of execution: tasks are known as codelets


Runnemede: Block Architecture

Control Engine (CE): execute OS/runtime code, perform I/O

Execution Engine (XE): simple in-order cores, execute codelets


Runnemede: Chip Architecture

Chip consist of a total of 576 cores, three-level of hierarchy

Implements a physical address space, with no virtual memory

Fine-grain DVFS, power and clock gating to save energy


Runnemede: Network Topology

Contains two independent hierarchical networks:

A data network and a barrier/reduction network

Hierarchical network allows Runnemede to provide tapered BW

Efficient short-distance communication

Also leverage the insight that relatively high-radix switchesreduce the overall network energyThree options: Fat-tree, Hybrid-tree, Pruned-tree


HW-SW Co-design: Optimization for SAR

Benchmark: streaming sensor application based SARInput (set of vectors), output (image of reflected energy of points)

ISAopt: added sin-cos instruction to the ISA

TrigOpt: single- precision of each pixel is replaced by double-

precision of a subset of pixels and interpolation for remaining

Blocking: Each codelet copies input array into L1 scratchpad than

fetching values from DRAM

CompilerOpt: Skips few address calculations, strength reductions


Effect of Technology Scaling

Computation energy scales well: 77% redution (45nm to 10nm)

Network energy only decreases by 51%

Memory energy also decreases drastically, primarily due to use of

stacked DRAM


Network Analysis

A hybrid-tree with tapering of BW is a better choice compared to

fat-tree (energy inefficient) and pruned-tree (low bisection BW)


Evaluation of Scratchpad Memories

Matrix multiplication: memory energy breakdown


MIT Angstrom Project

Led by Anant Agrawal; team includes MIT, MTL, RLE, MPhC labs

Freescale Semiconductor, Mercury Systems, Lockheed ATL

University of Maryland

Major challenges and research topics under exploration:

Ultra low voltage SRAM designA hierarchical cache-coherency protocol with distributeddiscretionary directories and dataThe Zettabricks System: is a language, compiler and runtimesystem for automatic parallel code generationSelf-Aware Factored Operating System (sefos): SEFOS is aself-aware OS targeted for 1000+ core systemsHelper threads: Exascale computers will have 1000s of cores.Unused cores can be used for prefetching, early branch resolutionThe SEEC Framework and Decision Engine


Helper Thread in Exascale Machine

Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads

Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism

One solution: Use few cores for load and branch “pre-execution”

Key challenges and topics to explore:In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?


Helper Thread in Exascale Machine

Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads

Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism

One solution: Use few cores for load and branch “pre-execution”Key challenges and topics to explore:

In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?


Low Power Partner Cores in Multicore (HotPar’11)

Main core generates events and places them in the event queue

Partner core serves these events based on their priorities


Case Study: Memory Prefetching (EM3D)

Each core issues 1 inst per cycle; Main core - 1 GHz

Speedup: upto 2.7x; Power efficiency (perf/watt): 2.2x


Echelon: A research GPU architecture

NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek

Khailany, Michael Garland, David Glasco

GPUs and the Future of Parallel Computing (IEEE Micro’11)

The state-of-art GPU-based high throughput computing system

How to scale GPU based architecture to meet exascale demand

At 10nm in 2017: GPUs will no longer be an external accelerator

to a CPU; instead, CPUs and GPUs will be integrated on the

same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:

Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).










same die with a unified memory architecture.

The Throughput-Optimized Core architectures goals:











same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:



Sandia’s UHPC X-Caliber Project

Sandia led team with Micron, LexisNexis, 8 academic partners

Simple pipeline of some sort: Wide access(?), Multithreaded

Scratchpad vs cache: Shared w/ registers? globally addressable?

Instruction encoding: Compressed? Contains dataflow state?

Composition of stack (optics? memory? logic?)

Thermal Migration: Move computation around to keep chip within

thermal bounds

Codelet/static dataflow modelAggressive architecture focusing on the data movement problem

Vast design spaceIterative application-driven co-design process