An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...

22
Presented at the 36 th Annual International Symposium on Computer Architecture June 22 nd , 2009 Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri * , Steven S. Lumetta, Matthew I. Frank , Sanjay J. Patel * The author is now with NVIDIA. The author is now with Intel.

Transcript of An Architecture and Scalable Programmingisca09.cs.columbia.edu/pres/13.pdf · 2009. 7. 29. ·...

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Rigel:An Architecture and Scalable Programming

    Interface for a 1000-core Accelerator

    John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy,

    Aqeel Mahesri*, Steven S. Lumetta, Matthew I. Frank†, Sanjay J. Patel

    *The author is now with NVIDIA.† The author is now with Intel.

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Accelerated Computing: Today

    • Contemporary Accelerators: GPUs, Cell, Larrabee• Challenges:

    1. Inflexible programming models 2. Lack of conventional memory model3. Hard to scale irregular parallel apps

    Effect on Development: Unattractive time to solution

    2John H. Kelm

    Programmable accelerator: HW entity designed to provide advantages for a class of apps including: higher performance, lower power, or lower unit cost relative to a general-purpose CPU.

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Accelerated Computing: Tomorrow

    • Why research accelerators?– Insight into future general-purpose CMPs– Challenges: Performance vs. programmer effort

    • Accelerator Trend: Integration over time

    3

    CPU

    MEM

    GPU

    MEM

    Past… …Present… …Future?

    CPUMEM

    GPUMEM

    CPU?

    Accelerator?GPU?

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Accelerated Computing: Metrics

    • Enable new platforms• Open new markets• Enable new apps

    4

    Challenges lead to:• FLOPS/$ (area)• FLOPS/Watt (power)• FLOPS/Programmer Effort

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Context: Project Orion

    5

    Applications, Programming Environments, and Architecture for 1000-core Parallelism

    Software Tools 2

    ProgrammingEnvironments

    IllinoisImage Formation and Processing

    Applications1000-core

    Architecture

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Rigel Design Goals

    • What: Future programming models– Apps and models may not exist yet– We have ideas (visual computing), but who knows? – Flexible design easier to retarget

    • How: Focus on scalability, programmer effort– Room to play: Raised HW/SW interface– Focusing design effort: Five Elements

    6John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Outline• Motivation• Rigel architecture• Elements in context of Rigel architecture• Evaluation:

    – Area and power– Scalability– SW Task management

    • Future work and conclusions

    7John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Core I$Core I$CoreI$Core I$

    Rigel Architecture: Cluster View

    • Basic building block• Eight 32b RISC cores• Per-core SP FPUs• 64 kB shared cache• Cache line buffer

    8

    Core I$

    Tile Interface

    Tile Interconnect

    Core I$CoreI$Core I$

    Cluster Cache

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    John H. Kelm

    • Cluster caches not HW coherent (8 MB total)• G$ fronts mem. controllers (4 MB total)• Uniform cache access

    Rigel Architecture: Full Chip View

    8 Tiles Per Chip16 Clusters per Tile

    8x8 Multistage CrossbarGlobal Interconnect

    Global Cache Banks G$ G$ G$ G$ G$ G$ G$ G$

    GDDR DRAM

    GDDR DRAM

    GDDR DRAM

    GDDR DRAM

    8 GDDR Channels

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Design Elements

    • Challenges in accelerator computing• FLOPS/dev. effort Difficult to quantify• Guiding our 1000-core architecture• Room to Play: Raising the HW/SW interface

    John H. Kelm 10

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Design Elements

    1. Execution Model: ISA, SIMD vs. MIMD, VLIW vs. OoOE, MT

    2. Memory Model: Caches vs. scratchpad, ordering, coherence

    3. Work Distribution: Scheduling, spectrum of SW/HW choices

    4. Synchronization: Scalability, influence on prog. model

    5. Locality Management

    – Moving data costs perf. and power– Balance: dev. effort, compiler, runtime, HW

    11John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Element 1: Execution Model

    • Tradeoff 1: MIMD vs. SIMD [Mahesri MICRO’08]– Irregular data parallelism– Task parallelism

    • Tradeoff 2: Latency vs. Throughput [Azizi DasCMP’08]– Simple in-order cores

    • Tradeoff 3: Full RISC ISA vs. Specialized Cores– Complete ISA conventional code generation– Wide range of apps

    12John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Element 2: Memory Model

    • Tradeoff 1: Single vs. multiple address space• Tradeoff 2: Hardware caches vs. scratchpads

    – Hardware exploits locality– Software manages global sharing

    • Tradeoff 3: Hierarchical vs. Distributed (NUCA)– Cluster cache/global cache hierarchy– ISA provides local/global mem. Operations– Non-uniformity Programmer effort

    13John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Some Results: Scalability

    14

    • Based on cycle-accurate, execution-driven simulation• Library and run-time system code simulated• Regular C code + parallel library, standard C compiler

    John H. Kelm

    0x

    20x

    40x

    60x

    80x

    100x

    120x

    dmm heat kmeans mri gjk cg sobel

    Spee

    dup

    vs. 1

    Clu

    ster

    16 Clusters (128 Cores) 32 Clusters (256 Cores)

    64 Clusters (512 Cores) 128 Clusters (1024 Cores)

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Element 3: Work Distribution

    • Tradeoff (Spectrum): HW vs. SW Implementation• SW task management: Hierarchical queues• Flexible policies + little specialized HW

    15

    Global Task Queue @Global Cache

    Local Task Queue @Cluster Caches

    CoreDequeue

    Enqueue

    John H. Kelm

    Rigel Cluster

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Work Distribution: Rigel Task Model

    16

    • < 5% overhead for most data-parallel workloads• < 15% for most irregular data-parallel workloads• Task lengths: 100’s-100k instructions

    60%65%70%75%80%85%90%95%

    100%

    128

    256

    512

    1024 12

    825

    651

    210

    24 128

    256

    512

    1024 12

    825

    651

    210

    24 128

    256

    512

    1024 12

    825

    651

    210

    24

    dmm heat kmeans mri gjk cg

    Per-

    Task

    Ove

    rhea

    d

    Task Execution Enqueue Costs Dequeue Cost

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Element 4: Synchronization

    • Uses of coherence mechanisms:1. Control synchronization

    2. Data sharing

    • Broadcast update– Use cases: flags and barriers– Reduce contention from polling– Case Study: 2x speedup for conjugate gradient (CG)

    • Atomic primitives (example)

    17John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Core0Core1Core2

    Core0Core1Core2

    Element 4: Atomic Primitives

    18

    Conventional Multiprocessor

    Rigel

    Cache-to-cache transfer

    Time

    Perform atGlobal Cache

    (point of coherence)

    1. Network Latency

    4. OperationExecution(atom.inc)

    2. Overlapped Latency

    3. ExposedLatency

    John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Evaluation: Atomic Operations

    19

    K-means Clustering• Need global histogramming• With G$ atomics Pipelined in network• Without atomics Exposed transfer latency

    John H. Kelm

    0

    0.2

    0.4

    0.6

    0.8

    1

    128 256 512 1024

    Spee

    dup

    Atomics @ G$ (baseline) Atomics @ core

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    So, Can We Build It?

    20

    Gcache30mm2

    (10%)

    Other Logic30mm2

    (9%)

    Overhead53mm2

    (17%) Cluster Cache SRAM 75mm2

    (23%)Logic: Core +

    CCache)112mm2 (35%)

    Register Files20mm2 (6%)

    Clusters207mm2

    (67%)

    • RTL synthesis results + memory compiler + datasheets• Targeting 45nm process @ 1.2 GHz • 320 mm2 total die area,

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Current and Future Work

    • RTL implementation• Coherence and memory model [Kelm et al. PACT’09]• Other programming models• Multi-threading (1-4 threads)• Element Five: Locality Management

    21John H. Kelm

  • Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009

    Conclusions

    • FLOPS/Dev. Effort Elements can drive design• Software coherence viable approach• Task management requires little HW• 1000-core accelerator is feasible

    – Area/performance: 8 GFLOPS/mm2 @ ~100W– Programmability: Task API + MIMD execution

    22John H. Kelm

    Rigel:�An Architecture and Scalable Programming Interface for a 1000-core AcceleratorAccelerated Computing: TodayAccelerated Computing: TomorrowAccelerated Computing: MetricsContext: Project OrionRigel Design GoalsOutlineRigel Architecture: Cluster ViewRigel Architecture: Full Chip ViewDesign ElementsDesign ElementsElement 1: Execution ModelElement 2: Memory ModelSome Results: ScalabilityElement 3: Work DistributionWork Distribution: Rigel Task ModelElement 4: SynchronizationElement 4: Atomic PrimitivesEvaluation: Atomic OperationsSo, Can We Build It?Current and Future WorkConclusions