George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford...

21
George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks 1

Transcript of George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford...

Page 1: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

George Michelogiannakis,Prof. William J. Dally

Concurrent architecture & VLSI group

Stanford University

Elastic Buffer Flow Control for On-chip Networks

1

Page 2: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

The PPL VisionThe PPL Vision

Domain Embedding Language (Scala)

Virtual Worlds

Personal Robotics

Datainformatics

ScientificEngineering

Physics(Liszt)

ScriptingProbabilistic(RandomT)

Machine Learning(OptiML)

Rendering

Parallel Runtime (Delite, Sequoia, GRAMPS)

Dynamic Domain Spec. Opt. Locality Aware Scheduling

StagingPolymorphic Embedding

Applications

DomainSpecific

Languages

HeterogeneousHardware

DSLInfrastructure

Task & Data Parallelism

Hardware Architecture

OOO CoresOOO Cores SIMD CoresSIMD Cores Threaded CoresThreaded Cores Specialized CoresSpecialized Cores

Static Domain Specific Opt.

ProgrammableHierarchies

ProgrammableHierarchies

Scalable CoherenceScalable

CoherenceIsolation & Atomicity

Isolation & Atomicity

On-chipNetworksOn-chip

NetworksPervasive MonitoringPervasive Monitoring

Page 3: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

In a Nutshell

Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs• Input buffers at routers are not needed

Compared to VC routers:• Reduces cycle time up to 67%

• Provides 43% more throughput per unit power, and 22% more throughput per unit area

• Makes for a simpler network

EB uses duplicate subnetworks for traffic isolation• For many classes, a hybrid EB-VC router is used instead

• Uses buffers only to alleviate severe contention and deadlocks. Increases power efficiency

3

Page 4: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Outline

Building EB channels• The basic building blocks of EB networks

EB router design

Deadlock avoidance & congestion sensing

Evaluation results

4

Page 5: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

The Idea

Use the network channels as distributed FIFOs

Use that storage instead of input buffers at routers• To remove input buffer area and power costs

Pipelined channel

Channel as FIFO

5

Page 6: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Building an Elastic Buffer

To build an EB in a pipelined channel with master-slave flip-flops (FFs):

Use latches for storage by driving their enables independently

Master-slave FF

Elastic buffer

6

Page 7: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

How Elastic Buffer Channels Work

Ready/valid handshake between elastic buffers• Ready: At least one free storage slot

• Valid: Non-empty (driving valid data)

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 67

Page 8: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Outline

Building EB channels

EB router design• The implications in router design

Deadlock avoidance & congestion sensing

Evaluation results

8

Page 9: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Use EB Flow-Control Through the Router

VC input-buffered router

EB router

Input bufferreplaced byinput EB

VC & SWallocators removed.Per-output arbitersinstead.

Three-slot outputEB to cover forarbitration doneone cycle inadvance.

LA routing alsoapplicable to EBnetworks.

9

Page 10: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Two Improved Router Designs

Enhanced two-stage• Fixes baseline

design’s main inefficiencies

• Prioritizes cycle time

Single-stage• Removes

pipelining overhead

• Prioritizes latency

10

Page 11: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Outline

Building EB channels

EB router design

Deadlock avoidance & congestion sensing• How to provide traffic classes

Evaluation results

11

Page 12: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Deadlock Avoidance

No input buffers no virtual channels

Can provide traffic isolation with duplicate physical channels• Duplicating subnetworks most efficient due to crossbar

quadratic cost

• That is only true for up to a certain number of classes

12

Page 13: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Hybrid EB-VC Router

For many classes, have an input buffer to drain flits after a predefined number of blocking cycles

Thus, buffer is used only to alleviate heavy contention and resolve deadlocks• In the common case, as energy efficient as EB networks

13

Page 14: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Output Channel Occupancy Load Metric

Flit-buffered networks use credit count

EB networks measure output channel occupancy• At a certain segment of the output channel (shown in red)

• Occupancy decremented when flits leave that segment

• Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle

14

Page 15: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Outline

Building EB channels

EB router design

Deadlock avoidance & congestion sensing

Evaluation results• Let’s talk numbers

15

Page 16: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Throughput-Power Mesh (Baseline Router)

EB network improvement:

Same power: 10% increased throughput

Same throughput: 12% reduced power

Throughput gain

EB: 18% lower cycle time.Not taken into account.

16

Page 17: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Router RTL Implementation

No buffers, VCs, allocators, credits

• VC router had look-ahead routing

Buffers: FF arrays. 2 VCs, 8 slots each

Aspect VC router EB router Savings

Area (μm2) 63,515 14,730 77%

Clock (ns) 3.3 2.7 18%

Power (mW) 2.59 0.12 95%

45nm, LP-CMOS, worst-caseMesh 5x5 routers. DOR. 64-bit datapath

17

Page 18: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Router Comparison

18

Baseline: 9% less energy than single-stage. 35% than enhanced

Enhanced: 26% reduced cycle time than single-stage. 42% than baseline

Page 19: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Hybrid EB-VC Comparison

Cycle time comparable to VC, not EB routers19

Hybrid offers 21% more throughput per unit power than VC.12% than EB

The VC network offers 41% more throughput per unit area.The EB 49%

Page 20: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Conclusions

EB flow-control uses channels as distributed FIFOs• Uses the pipeline flip-flops that are required anyway

• Removes input buffers from routers

Provides 43% more throughput per unit power, and 22% more throughput per unit area• Depends on what fraction of the cost input buffers are

Reduces cycle time up to 67%

Hybrid EB-VC router provides a large number of classes. Input buffer is used only when it has to• 21% more throughput per unit power than VC

Remove buffers, keep buffering. Elastic buffers!20

Page 21: George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Questions?

21