George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford...
-
Upload
amanda-fowler -
Category
Documents
-
view
218 -
download
0
Transcript of George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford...
George Michelogiannakis,Prof. William J. Dally
Concurrent architecture & VLSI group
Stanford University
Elastic Buffer Flow Control for On-chip Networks
1
The PPL VisionThe PPL Vision
Domain Embedding Language (Scala)
Virtual Worlds
Personal Robotics
Datainformatics
ScientificEngineering
Physics(Liszt)
ScriptingProbabilistic(RandomT)
Machine Learning(OptiML)
Rendering
Parallel Runtime (Delite, Sequoia, GRAMPS)
Dynamic Domain Spec. Opt. Locality Aware Scheduling
StagingPolymorphic Embedding
Applications
DomainSpecific
Languages
HeterogeneousHardware
DSLInfrastructure
Task & Data Parallelism
Hardware Architecture
OOO CoresOOO Cores SIMD CoresSIMD Cores Threaded CoresThreaded Cores Specialized CoresSpecialized Cores
Static Domain Specific Opt.
ProgrammableHierarchies
ProgrammableHierarchies
Scalable CoherenceScalable
CoherenceIsolation & Atomicity
Isolation & Atomicity
On-chipNetworksOn-chip
NetworksPervasive MonitoringPervasive Monitoring
In a Nutshell
Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs• Input buffers at routers are not needed
Compared to VC routers:• Reduces cycle time up to 67%
• Provides 43% more throughput per unit power, and 22% more throughput per unit area
• Makes for a simpler network
EB uses duplicate subnetworks for traffic isolation• For many classes, a hybrid EB-VC router is used instead
• Uses buffers only to alleviate severe contention and deadlocks. Increases power efficiency
3
Outline
Building EB channels• The basic building blocks of EB networks
EB router design
Deadlock avoidance & congestion sensing
Evaluation results
4
The Idea
Use the network channels as distributed FIFOs
Use that storage instead of input buffers at routers• To remove input buffer area and power costs
Pipelined channel
Channel as FIFO
5
Building an Elastic Buffer
To build an EB in a pipelined channel with master-slave flip-flops (FFs):
Use latches for storage by driving their enables independently
Master-slave FF
Elastic buffer
6
How Elastic Buffer Channels Work
Ready/valid handshake between elastic buffers• Ready: At least one free storage slot
• Valid: Non-empty (driving valid data)
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 67
Outline
Building EB channels
EB router design• The implications in router design
Deadlock avoidance & congestion sensing
Evaluation results
8
Use EB Flow-Control Through the Router
VC input-buffered router
EB router
Input bufferreplaced byinput EB
VC & SWallocators removed.Per-output arbitersinstead.
Three-slot outputEB to cover forarbitration doneone cycle inadvance.
LA routing alsoapplicable to EBnetworks.
9
Two Improved Router Designs
Enhanced two-stage• Fixes baseline
design’s main inefficiencies
• Prioritizes cycle time
Single-stage• Removes
pipelining overhead
• Prioritizes latency
10
Outline
Building EB channels
EB router design
Deadlock avoidance & congestion sensing• How to provide traffic classes
Evaluation results
11
Deadlock Avoidance
No input buffers no virtual channels
Can provide traffic isolation with duplicate physical channels• Duplicating subnetworks most efficient due to crossbar
quadratic cost
• That is only true for up to a certain number of classes
12
Hybrid EB-VC Router
For many classes, have an input buffer to drain flits after a predefined number of blocking cycles
Thus, buffer is used only to alleviate heavy contention and resolve deadlocks• In the common case, as energy efficient as EB networks
13
Output Channel Occupancy Load Metric
Flit-buffered networks use credit count
EB networks measure output channel occupancy• At a certain segment of the output channel (shown in red)
• Occupancy decremented when flits leave that segment
• Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle
14
Outline
Building EB channels
EB router design
Deadlock avoidance & congestion sensing
Evaluation results• Let’s talk numbers
15
Throughput-Power Mesh (Baseline Router)
EB network improvement:
Same power: 10% increased throughput
Same throughput: 12% reduced power
Throughput gain
EB: 18% lower cycle time.Not taken into account.
16
Router RTL Implementation
No buffers, VCs, allocators, credits
• VC router had look-ahead routing
Buffers: FF arrays. 2 VCs, 8 slots each
Aspect VC router EB router Savings
Area (μm2) 63,515 14,730 77%
Clock (ns) 3.3 2.7 18%
Power (mW) 2.59 0.12 95%
45nm, LP-CMOS, worst-caseMesh 5x5 routers. DOR. 64-bit datapath
17
Router Comparison
18
Baseline: 9% less energy than single-stage. 35% than enhanced
Enhanced: 26% reduced cycle time than single-stage. 42% than baseline
Hybrid EB-VC Comparison
Cycle time comparable to VC, not EB routers19
Hybrid offers 21% more throughput per unit power than VC.12% than EB
The VC network offers 41% more throughput per unit area.The EB 49%
Conclusions
EB flow-control uses channels as distributed FIFOs• Uses the pipeline flip-flops that are required anyway
• Removes input buffers from routers
Provides 43% more throughput per unit power, and 22% more throughput per unit area• Depends on what fraction of the cost input buffers are
Reduces cycle time up to 67%
Hybrid EB-VC router provides a large number of classes. Input buffer is used only when it has to• 21% more throughput per unit power than VC
Remove buffers, keep buffering. Elastic buffers!20
Questions?
21