DARPA’s Ubiquitous High-PerformanceComputing (UHPC)/ Exascale Projects
Presented by Raj Parihar
Advanced Computer Architecture Lab
University of Rochester, Rochester
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects
References
Runnemede: An Architecture for Ubiquitous High-Performance ComputingIntel Labs, UIUC, Reservior Labs (HPCA’13)
The MIT Angstrom ProjectMIT, UMCP (HotPar’11)
GPUs and the Future of Parallel ComputingNVIDIA (IEEE Micro’11)
Sandia’s X-Caliber ProjectSandia Lab, Micron, LexisNexis, 8 academic partners
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 2
DARPA’s Exascale Challenge
Build an exascale machine by 2020 using today’s technology
Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)
DARPA’s challenge and design goals:
1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)
Design the whole stack to be energy efficient – from software and
programming model to low power circuits and transistors
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3
DARPA’s Exascale Challenge
Build an exascale machine by 2020 using today’s technology
Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)
DARPA’s challenge and design goals:
1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)
Design the whole stack to be energy efficient – from software and
programming model to low power circuits and transistors
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3
DARPA’s Exascale Challenge
Build an exascale machine by 2020 using today’s technology
Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)
DARPA’s challenge and design goals:
1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)
Design the whole stack to be energy efficient – from software and
programming model to low power circuits and transistors
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3
Runnemede: High Performance System (HPCA’13)
Hierarchical, heterogeneous, near-threshold computing
Overprovisioned, support for selective execution and power down
No hardware cache-coherence, managed in software
Dataflow kind of execution: tasks are known as codelets
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 4
Runnemede: Block Architecture
Control Engine (CE): execute OS/runtime code, perform I/O
Execution Engine (XE): simple in-order cores, execute codelets
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 5
Runnemede: Chip Architecture
Chip consist of a total of 576 cores, three-level of hierarchy
Implements a physical address space, with no virtual memory
Fine-grain DVFS, power and clock gating to save energy
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 6
Runnemede: Network Topology
Contains two independent hierarchical networks:
A data network and a barrier/reduction network
Hierarchical network allows Runnemede to provide tapered BW
Efficient short-distance communication
Also leverage the insight that relatively high-radix switchesreduce the overall network energyThree options: Fat-tree, Hybrid-tree, Pruned-tree
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 7
HW-SW Co-design: Optimization for SAR
Benchmark: streaming sensor application based SARInput (set of vectors), output (image of reflected energy of points)
ISAopt: added sin-cos instruction to the ISA
TrigOpt: single- precision of each pixel is replaced by double-
precision of a subset of pixels and interpolation for remaining
Blocking: Each codelet copies input array into L1 scratchpad than
fetching values from DRAM
CompilerOpt: Skips few address calculations, strength reductions
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 8
Effect of Technology Scaling
Computation energy scales well: 77% redution (45nm to 10nm)
Network energy only decreases by 51%
Memory energy also decreases drastically, primarily due to use of
stacked DRAM
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 9
Network Analysis
A hybrid-tree with tapering of BW is a better choice compared to
fat-tree (energy inefficient) and pruned-tree (low bisection BW)
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 10
Evaluation of Scratchpad Memories
Matrix multiplication: memory energy breakdown
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 11
MIT Angstrom Project
Led by Anant Agrawal; team includes MIT, MTL, RLE, MPhC labs
Freescale Semiconductor, Mercury Systems, Lockheed ATL
University of Maryland
Major challenges and research topics under exploration:
Ultra low voltage SRAM designA hierarchical cache-coherency protocol with distributeddiscretionary directories and dataThe Zettabricks System: is a language, compiler and runtimesystem for automatic parallel code generationSelf-Aware Factored Operating System (sefos): SEFOS is aself-aware OS targeted for 1000+ core systemsHelper threads: Exascale computers will have 1000s of cores.Unused cores can be used for prefetching, early branch resolutionThe SEEC Framework and Decision Engine
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 12
MIT Angstrom Project
Led by Anant Agrawal; team includes MIT, MTL, RLE, MPhC labs
Freescale Semiconductor, Mercury Systems, Lockheed ATL
University of Maryland
Major challenges and research topics under exploration:
Ultra low voltage SRAM designA hierarchical cache-coherency protocol with distributeddiscretionary directories and dataThe Zettabricks System: is a language, compiler and runtimesystem for automatic parallel code generationSelf-Aware Factored Operating System (sefos): SEFOS is aself-aware OS targeted for 1000+ core systemsHelper threads: Exascale computers will have 1000s of cores.Unused cores can be used for prefetching, early branch resolutionThe SEEC Framework and Decision Engine
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 12
Helper Thread in Exascale Machine
Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads
Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism
One solution: Use few cores for load and branch “pre-execution”
Key challenges and topics to explore:In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 13
Helper Thread in Exascale Machine
Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads
Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism
One solution: Use few cores for load and branch “pre-execution”Key challenges and topics to explore:
In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 13
Low Power Partner Cores in Multicore (HotPar’11)
Main core generates events and places them in the event queue
Partner core serves these events based on their priorities
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 14
Case Study: Memory Prefetching (EM3D)
Each core issues 1 inst per cycle; Main core - 1 GHz
Speedup: upto 2.7x; Power efficiency (perf/watt): 2.2x
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 15
Echelon: A research GPU architecture
NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek
Khailany, Michael Garland, David Glasco
GPUs and the Future of Parallel Computing (IEEE Micro’11)
The state-of-art GPU-based high throughput computing system
How to scale GPU based architecture to meet exascale demand
At 10nm in 2017: GPUs will no longer be an external accelerator
to a CPU; instead, CPUs and GPUs will be integrated on the
same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:
Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16
Echelon: A research GPU architecture
NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek
Khailany, Michael Garland, David Glasco
GPUs and the Future of Parallel Computing (IEEE Micro’11)
The state-of-art GPU-based high throughput computing system
How to scale GPU based architecture to meet exascale demand
At 10nm in 2017: GPUs will no longer be an external accelerator
to a CPU; instead, CPUs and GPUs will be integrated on the
same die with a unified memory architecture.
The Throughput-Optimized Core architectures goals:
Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16
Echelon: A research GPU architecture
NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek
Khailany, Michael Garland, David Glasco
GPUs and the Future of Parallel Computing (IEEE Micro’11)
The state-of-art GPU-based high throughput computing system
How to scale GPU based architecture to meet exascale demand
At 10nm in 2017: GPUs will no longer be an external accelerator
to a CPU; instead, CPUs and GPUs will be integrated on the
same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:
Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16
Sandia’s UHPC X-Caliber Project
Sandia led team with Micron, LexisNexis, 8 academic partners
Simple pipeline of some sort: Wide access(?), Multithreaded
Scratchpad vs cache: Shared w/ registers? globally addressable?
Instruction encoding: Compressed? Contains dataflow state?
Composition of stack (optics? memory? logic?)
Thermal Migration: Move computation around to keep chip within
thermal bounds
Codelet/static dataflow modelAggressive architecture focusing on the data movement problem
Vast design spaceIterative application-driven co-design process
Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 17
Top Related