THINC: An Architecture for Thin-Client Computing Ricardo A. Baratto [[email protected]]
An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...
Transcript of An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Rigel:An Architecture and Scalable Programming
Interface for a 1000-core Accelerator
John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy,
Aqeel Mahesri*, Steven S. Lumetta, Matthew I. Frank†, Sanjay J. Patel
*The author is now with NVIDIA.† The author is now with Intel.
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Accelerated Computing: Today
• Contemporary Accelerators: GPUs, Cell, Larrabee• Challenges:
1. Inflexible programming models 2. Lack of conventional memory model3. Hard to scale irregular parallel apps
Effect on Development: Unattractive time to solution
2John H. Kelm
Programmable accelerator: HW entity designed to provide advantages for a class of apps including: higher performance, lower power, or lower unit cost relative to a general-purpose CPU.
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Accelerated Computing: Tomorrow
• Why research accelerators?– Insight into future general-purpose CMPs– Challenges: Performance vs. programmer effort
• Accelerator Trend: Integration over time
3
CPU
MEM
GPU
MEM
Past… …Present… …Future?
CPUMEM
GPUMEM
CPU?
Accelerator?GPU?
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Accelerated Computing: Metrics
• Enable new platforms• Open new markets• Enable new apps
4
Challenges lead to:• FLOPS/$ (area)• FLOPS/Watt (power)• FLOPS/Programmer Effort
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Context: Project Orion
5
Applications, Programming Environments, and Architecture for 1000-core Parallelism
Software Tools 2
ProgrammingEnvironments
IllinoisImage Formation and Processing
Applications1000-core
Architecture
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Rigel Design Goals
• What: Future programming models– Apps and models may not exist yet– We have ideas (visual computing), but who knows? – Flexible design easier to retarget
• How: Focus on scalability, programmer effort– Room to play: Raised HW/SW interface– Focusing design effort: Five Elements
6John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Outline• Motivation• Rigel architecture• Elements in context of Rigel architecture• Evaluation:
– Area and power– Scalability– SW Task management
• Future work and conclusions
7John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Core I$Core I$CoreI$Core I$
Rigel Architecture: Cluster View
• Basic building block• Eight 32b RISC cores• Per-core SP FPUs• 64 kB shared cache• Cache line buffer
8
Core I$
Tile Interface
Tile Interconnect
Core I$CoreI$Core I$
Cluster Cache
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
John H. Kelm
• Cluster caches not HW coherent (8 MB total)• G$ fronts mem. controllers (4 MB total)• Uniform cache access
Rigel Architecture: Full Chip View
8 Tiles Per Chip16 Clusters per Tile
8x8 Multistage CrossbarGlobal Interconnect
Global Cache Banks G$ G$ G$ G$ G$ G$ G$ G$
GDDR DRAM
GDDR DRAM
GDDR DRAM
GDDR DRAM
8 GDDR Channels
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Design Elements
• Challenges in accelerator computing• FLOPS/dev. effort Difficult to quantify• Guiding our 1000-core architecture• Room to Play: Raising the HW/SW interface
John H. Kelm 10
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Design Elements
1. Execution Model: ISA, SIMD vs. MIMD, VLIW vs. OoOE, MT
2. Memory Model: Caches vs. scratchpad, ordering, coherence
3. Work Distribution: Scheduling, spectrum of SW/HW choices
4. Synchronization: Scalability, influence on prog. model
5. Locality Management
– Moving data costs perf. and power– Balance: dev. effort, compiler, runtime, HW
11John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Element 1: Execution Model
• Tradeoff 1: MIMD vs. SIMD [Mahesri MICRO’08]– Irregular data parallelism– Task parallelism
• Tradeoff 2: Latency vs. Throughput [Azizi DasCMP’08]– Simple in-order cores
• Tradeoff 3: Full RISC ISA vs. Specialized Cores– Complete ISA conventional code generation– Wide range of apps
12John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Element 2: Memory Model
• Tradeoff 1: Single vs. multiple address space• Tradeoff 2: Hardware caches vs. scratchpads
– Hardware exploits locality– Software manages global sharing
• Tradeoff 3: Hierarchical vs. Distributed (NUCA)– Cluster cache/global cache hierarchy– ISA provides local/global mem. Operations– Non-uniformity Programmer effort
13John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Some Results: Scalability
14
• Based on cycle-accurate, execution-driven simulation• Library and run-time system code simulated• Regular C code + parallel library, standard C compiler
John H. Kelm
0x
20x
40x
60x
80x
100x
120x
dmm heat kmeans mri gjk cg sobel
Spee
dup
vs. 1
Clu
ster
16 Clusters (128 Cores) 32 Clusters (256 Cores)
64 Clusters (512 Cores) 128 Clusters (1024 Cores)
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Element 3: Work Distribution
• Tradeoff (Spectrum): HW vs. SW Implementation• SW task management: Hierarchical queues• Flexible policies + little specialized HW
15
Global Task Queue @Global Cache
Local Task Queue @Cluster Caches
CoreDequeue
Enqueue
John H. Kelm
Rigel Cluster
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Work Distribution: Rigel Task Model
16
• < 5% overhead for most data-parallel workloads• < 15% for most irregular data-parallel workloads• Task lengths: 100’s-100k instructions
60%65%70%75%80%85%90%95%
100%
128
256
512
1024 12
825
651
210
24 128
256
512
1024 12
825
651
210
24 128
256
512
1024 12
825
651
210
24
dmm heat kmeans mri gjk cg
Per-
Task
Ove
rhea
d
Task Execution Enqueue Costs Dequeue Cost
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Element 4: Synchronization
• Uses of coherence mechanisms:1. Control synchronization
2. Data sharing
• Broadcast update– Use cases: flags and barriers– Reduce contention from polling– Case Study: 2x speedup for conjugate gradient (CG)
• Atomic primitives (example)
17John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Core0Core1Core2
Core0Core1Core2
Element 4: Atomic Primitives
18
Conventional Multiprocessor
Rigel
Cache-to-cache transfer
Time
Perform atGlobal Cache
(point of coherence)
1. Network Latency
4. OperationExecution(atom.inc)
2. Overlapped Latency
3. ExposedLatency
John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Evaluation: Atomic Operations
19
K-means Clustering• Need global histogramming• With G$ atomics Pipelined in network• Without atomics Exposed transfer latency
John H. Kelm
0
0.2
0.4
0.6
0.8
1
128 256 512 1024
Spee
dup
Atomics @ G$ (baseline) Atomics @ core
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
So, Can We Build It?
20
Gcache30mm2
(10%)
Other Logic30mm2
(9%)
Overhead53mm2
(17%) Cluster Cache SRAM 75mm2
(23%)Logic: Core +
CCache)112mm2 (35%)
Register Files20mm2 (6%)
Clusters207mm2
(67%)
• RTL synthesis results + memory compiler + datasheets• Targeting 45nm process @ 1.2 GHz • 320 mm2 total die area,
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Current and Future Work
• RTL implementation• Coherence and memory model [Kelm et al. PACT’09]• Other programming models• Multi-threading (1-4 threads)• Element Five: Locality Management
21John H. Kelm
-
Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009
Conclusions
• FLOPS/Dev. Effort Elements can drive design• Software coherence viable approach• Task management requires little HW• 1000-core accelerator is feasible
– Area/performance: 8 GFLOPS/mm2 @ ~100W– Programmability: Task API + MIMD execution
22John H. Kelm
Rigel:�An Architecture and Scalable Programming Interface for a 1000-core AcceleratorAccelerated Computing: TodayAccelerated Computing: TomorrowAccelerated Computing: MetricsContext: Project OrionRigel Design GoalsOutlineRigel Architecture: Cluster ViewRigel Architecture: Full Chip ViewDesign ElementsDesign ElementsElement 1: Execution ModelElement 2: Memory ModelSome Results: ScalabilityElement 3: Work DistributionWork Distribution: Rigel Task ModelElement 4: SynchronizationElement 4: Atomic PrimitivesEvaluation: Atomic OperationsSo, Can We Build It?Current and Future WorkConclusions