An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...

Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

Rigel:An Architecture and Scalable Programming

Interface for a 1000-core Accelerator

John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy,

Aqeel Mahesri*, Steven S. Lumetta, Matthew I. Frank†, Sanjay J. Patel

*The author is now with NVIDIA.† The author is now with Intel.


Accelerated Computing: Today

• Contemporary Accelerators: GPUs, Cell, Larrabee• Challenges:

1. Inflexible programming models 2. Lack of conventional memory model3. Hard to scale irregular parallel apps

Effect on Development: Unattractive time to solution

2John H. Kelm

Programmable accelerator: HW entity designed to provide advantages for a class of apps including: higher performance, lower power, or lower unit cost relative to a general-purpose CPU.


Accelerated Computing: Tomorrow

• Why research accelerators?– Insight into future general-purpose CMPs– Challenges: Performance vs. programmer effort

• Accelerator Trend: Integration over time

3

CPU

MEM

GPU

MEM

Past… …Present… …Future?

CPUMEM

GPUMEM

CPU?

Accelerator?GPU?

John H. Kelm


Accelerated Computing: Metrics

• Enable new platforms• Open new markets• Enable new apps

4

Challenges lead to:• FLOPS/$ (area)• FLOPS/Watt (power)• FLOPS/Programmer Effort

John H. Kelm


Context: Project Orion

5

Applications, Programming Environments, and Architecture for 1000-core Parallelism

Software Tools 2

ProgrammingEnvironments

IllinoisImage Formation and Processing

Applications1000-core

Architecture

John H. Kelm


Rigel Design Goals

• What: Future programming models– Apps and models may not exist yet– We have ideas (visual computing), but who knows? – Flexible design easier to retarget

• How: Focus on scalability, programmer effort– Room to play: Raised HW/SW interface– Focusing design effort: Five Elements

6John H. Kelm


Outline• Motivation• Rigel architecture• Elements in context of Rigel architecture• Evaluation:

– Area and power– Scalability– SW Task management

• Future work and conclusions

7John H. Kelm


Core I$Core I$CoreI$Core I$

Rigel Architecture: Cluster View

• Basic building block• Eight 32b RISC cores• Per-core SP FPUs• 64 kB shared cache• Cache line buffer

8

Core I$

Tile Interface

Tile Interconnect

Core I$CoreI$Core I$

Cluster Cache

John H. Kelm


John H. Kelm

• Cluster caches not HW coherent (8 MB total)• G$ fronts mem. controllers (4 MB total)• Uniform cache access

Rigel Architecture: Full Chip View

8 Tiles Per Chip16 Clusters per Tile

8x8 Multistage CrossbarGlobal Interconnect

Global Cache Banks G$ G$ G$ G$ G$ G$ G$ G$

GDDR DRAM

GDDR DRAM

GDDR DRAM

GDDR DRAM

8 GDDR Channels


Design Elements

• Challenges in accelerator computing• FLOPS/dev. effort Difficult to quantify• Guiding our 1000-core architecture• Room to Play: Raising the HW/SW interface

John H. Kelm 10


Design Elements

1. Execution Model: ISA, SIMD vs. MIMD, VLIW vs. OoOE, MT

2. Memory Model: Caches vs. scratchpad, ordering, coherence

3. Work Distribution: Scheduling, spectrum of SW/HW choices

4. Synchronization: Scalability, influence on prog. model

5. Locality Management

– Moving data costs perf. and power– Balance: dev. effort, compiler, runtime, HW

11John H. Kelm


Element 1: Execution Model

• Tradeoff 1: MIMD vs. SIMD [Mahesri MICRO’08]– Irregular data parallelism– Task parallelism

• Tradeoff 2: Latency vs. Throughput [Azizi DasCMP’08]– Simple in-order cores

• Tradeoff 3: Full RISC ISA vs. Specialized Cores– Complete ISA conventional code generation– Wide range of apps

12John H. Kelm


Element 2: Memory Model

• Tradeoff 1: Single vs. multiple address space• Tradeoff 2: Hardware caches vs. scratchpads

– Hardware exploits locality– Software manages global sharing

• Tradeoff 3: Hierarchical vs. Distributed (NUCA)– Cluster cache/global cache hierarchy– ISA provides local/global mem. Operations– Non-uniformity Programmer effort

13John H. Kelm


Some Results: Scalability

14

• Based on cycle-accurate, execution-driven simulation• Library and run-time system code simulated• Regular C code + parallel library, standard C compiler

John H. Kelm

0x

20x

40x

60x

80x

100x

120x

dmm heat kmeans mri gjk cg sobel

Spee

dup

vs. 1

Clu

ster

16 Clusters (128 Cores) 32 Clusters (256 Cores)

64 Clusters (512 Cores) 128 Clusters (1024 Cores)


Element 3: Work Distribution

• Tradeoff (Spectrum): HW vs. SW Implementation• SW task management: Hierarchical queues• Flexible policies + little specialized HW

15

Global Task Queue @Global Cache

Local Task Queue @Cluster Caches

CoreDequeue

Enqueue

John H. Kelm

Rigel Cluster


Work Distribution: Rigel Task Model

16

• < 5% overhead for most data-parallel workloads• < 15% for most irregular data-parallel workloads• Task lengths: 100’s-100k instructions

60%65%70%75%80%85%90%95%

100%

128

256

512

1024 12

825

651

210

24 128

256

512

1024 12

825

651

210

24 128

256

512

1024 12

825

651

210

24

dmm heat kmeans mri gjk cg

Per-

Task

Ove

rhea

d

Task Execution Enqueue Costs Dequeue Cost

John H. Kelm


Element 4: Synchronization

• Uses of coherence mechanisms:1. Control synchronization

2. Data sharing

• Broadcast update– Use cases: flags and barriers– Reduce contention from polling– Case Study: 2x speedup for conjugate gradient (CG)

• Atomic primitives (example)

17John H. Kelm


Core0Core1Core2

Core0Core1Core2

Element 4: Atomic Primitives

18

Conventional Multiprocessor

Rigel

Cache-to-cache transfer

Time

Perform atGlobal Cache

(point of coherence)

1. Network Latency

4. OperationExecution(atom.inc)

2. Overlapped Latency

3. ExposedLatency

John H. Kelm


Evaluation: Atomic Operations

19

K-means Clustering• Need global histogramming• With G$ atomics Pipelined in network• Without atomics Exposed transfer latency

John H. Kelm

0

0.2

0.4

0.6

0.8

1

128 256 512 1024

Spee

dup

Atomics @ G$ (baseline) Atomics @ core


So, Can We Build It?

20

Gcache30mm2

(10%)

Other Logic30mm2

(9%)

Overhead53mm2

(17%) Cluster Cache SRAM 75mm2

(23%)Logic: Core +

CCache)112mm2 (35%)

Register Files20mm2 (6%)

Clusters207mm2

(67%)

• RTL synthesis results + memory compiler + datasheets• Targeting 45nm process @ 1.2 GHz • 320 mm2 total die area,


Current and Future Work

• RTL implementation• Coherence and memory model [Kelm et al. PACT’09]• Other programming models• Multi-threading (1-4 threads)• Element Five: Locality Management

21John H. Kelm


Conclusions

• FLOPS/Dev. Effort Elements can drive design• Software coherence viable approach• Task management requires little HW• 1000-core accelerator is feasible

– Area/performance: 8 GFLOPS/mm2 @ ~100W– Programmability: Task API + MIMD execution

22John H. Kelm

Rigel:�An Architecture and Scalable Programming Interface for a 1000-core AcceleratorAccelerated Computing: TodayAccelerated Computing: TomorrowAccelerated Computing: MetricsContext: Project OrionRigel Design GoalsOutlineRigel Architecture: Cluster ViewRigel Architecture: Full Chip ViewDesign ElementsDesign ElementsElement 1: Execution ModelElement 2: Memory ModelSome Results: ScalabilityElement 3: Work DistributionWork Distribution: Rigel Task ModelElement 4: SynchronizationElement 4: Atomic PrimitivesEvaluation: Atomic OperationsSo, Can We Build It?Current and Future WorkConclusions

An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...

Documents

Transcript of An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...