The Design And Implementation Of A Low Latency On Chip Network
-
Upload
nirmala-last -
Category
Technology
-
view
774 -
download
0
Transcript of The Design And Implementation Of A Low Latency On Chip Network
![Page 1: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/1.jpg)
The Design and Implementation of a Low-Latency On-Chip Network
Robert Mullins
11th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27th, 2006, Yokohama, Japan.
![Page 2: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/2.jpg)
2
Introduction
• Current economic and technology scaling trends will force a step change in computing architectures and approaches to VLSI design
• Design methodologies will shift from computation-centric to communication-centric ones.
• This talk will examine a major component of such approaches: the on-chip network
![Page 3: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/3.jpg)
3
Economic Trends
• Falling chip design budgets– Hardware budgets squeezed as software
complexity grows – Rising Non-Recoverable Engineering (NRE)
costs as fabrication technologies scale
• Continued time to market pressures
• Need to reduce complexity and risk
![Page 4: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/4.jpg)
4
Technology Scaling Trends
• Interconnect scales poorly – Begins to dominate delay, power budgets and area
• Benefits of regular interconnects increase– Ability to better optimise power and delay– Reduced verification effort– Simple to analyse, low risk
• Yield and reliability issues– Fault tolerant design, remapping and reconfiguration
• Power limited designs– Optimizing power boosts performance
![Page 5: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/5.jpg)
5
Design Trends
• Systems will be continue to be composed from larger numbers of IP blocks
• Increasing use of coarse-grain parallelism– The last remaining tool to maintain historical
performance gains in a power constrained environment
• Economic and risk pressures are forcing designs to become increasingly programmable and general purpose– Ability to map many applications to a single
chip
![Page 6: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/6.jpg)
6
Communication-Centric SoC Design• Scalable communication
infrastructure– Regular and optimised
• Network eases application mapping, reuse and integration issues– General purpose interconnect
• Network schedules compute resources:– Optimises/manages power
• Has global view and influence– Manages local thermal budgets– Central to fault tolerant abilities
• Much more than simply a move from buses to networks
eDRAM,SRAM and
CacheBlocks
On-Chip Network
ROM andFlash
Memories
Processors/DSPs
ConfigurableI/Os
Custom IP
FPGAsEmbedded
![Page 7: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/7.jpg)
7
Many Challenges• Application mapping• Network topologies• Fault-tolerant techniques• System-level communication-centric power
management• Guaranteeing correctness in these increasingly
distributed systems• Low-power techniques for on-chip networks• …..• This talk will look at:
– Building low-latency on-chip routers– How to clock on-chip networks
![Page 8: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/8.jpg)
8
Introduction to On-Chip Networks• All chip-wide
communications are handled by an on-chip network
• Packet-switched network
• Each router contains – Input buffers– Routing logic– Scheduling hardware
• Arbitration
– Crossbar
![Page 9: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/9.jpg)
9
Virtual-Channel Flow Control
![Page 10: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/10.jpg)
10
Synchronous Router Pipeline
• Router Pipeline may be many stages– Increases communication latency– Can make packet buffers less effective– Incurs pipelining overheads
![Page 11: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/11.jpg)
11
Speculative Router Architecture
• VC and switch allocation may be performed concurrently:– Speculate that waiting packets will be successful in acquiring a VC– Prioritize non-speculative requests over speculative ones
Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers”, In Proceedings HPCA’01, 2001.
![Page 12: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/12.jpg)
12
Single Cycle Speculative Router
R. D. Mullins, A. West and S. W. Moore, “Low-Latency Virtual-Channel Routers for On-Chip Networks”, In Proceedings ISCA’04.
![Page 13: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/13.jpg)
13
Basic Concept
• Consider two extremes of operation:• Multiple flits are queued waiting for access to the
same output port– We have all the information we need to schedule the
output port accurately ahead of time
• No requests are outstanding for a particular output port– In this case we speculate that arbitration will be
unnecessary and permit any new flit to be routed to its required output immediately
– Easy to abort if things go wrong. Just look at newly arriving flits and the output ports they require
![Page 14: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/14.jpg)
14
Optimisations
• To produce control signals for the next clock cycle we compute the requests (VC or switch allocation) that we know will remain
• In the case of the VC allocator it is important for performance that this is accurate
• For the switch allocator logic a better trade-off is to minimise this logic and obtain gains through reduction in cycle-time
![Page 15: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/15.jpg)
15
Results
Comparison to single-cycle router without speculative optimisations
4x4 mesh network, random traffic, 4 flit (256-bit) packets
![Page 16: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/16.jpg)
16
The LOCHSIDE Testchip
• UMC 0.18um Process• 4x4 mesh network, 25mm2
• Single Cycle Routers (router + link = 1 clock)
• May be clocked by both traditional H-tree and DCG
• 4 virtual-channels/input• 80-bit links
– 64-bit data + 16-bit control
• 250MHz (worst-case PVT) 16Gb/s/channel (~35 FO4)
• Approx 5M transistors
TILE
TrafficGenerator, Debug &
Test
R
![Page 17: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/17.jpg)
17
Clocking On-Chip Networks
• Challenges:– Clock Distribution Issues
• Challenging due to networks physically distributed implementation
• Potentially a high-frequency clock – Power and skew concerns
– Synchronization• IP Blocks will run at many different or even adaptive clock
frequencies
– What frequency does network run at?• Interesting problem! • Would like to avoid running at max. freq all the time - may not
want to increase latency?
![Page 18: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/18.jpg)
18
Data-Driven Clocking
• Idea:– Generate the clock locally at each router– Generate clock pulses only when required!
• Existence of data on router’s input triggers new clock pulses• Local calibrated delay line ensures clock frequency never
exceeds router’s maximum• Clock is aperiodic
![Page 19: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/19.jpg)
19
Benefits of Data-Driven Clocking
• Robust value safe synchronization– No synchronization delay if router is quiescent
• Event-driven synchronous system!
• Benefits of asynchronous implementation but router remains fast and simple– Can still exploit synchronous single-cycle router
design – No one single network operating frequency– No global clock!
– Network links can be fully-asynchronous if beneficial
![Page 20: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/20.jpg)
20
Data-Driven Clocking• Arbitration is
necessary to determine whether input data is admitted on the subsequent clock cycle or not.
• If there are always input requests waiting the clock will be periodic and operating at its maximum frequency
(See ASYNC’07 paper)
![Page 21: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/21.jpg)
21
Summary
• Single cycle speculative routers– Reduce router pipeline to single stage – This provides a significant reduction in
network latency
• Data-driven clocking for on-chip networks– Removes need for global clock– Network router are clocked at rate determined
by traffic
![Page 22: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/22.jpg)
22
Conclusion
• Current trends suggest a major shift to a communication-centric approach will be inevitable
• On-chip networks are one important piece of the puzzle!
• Continued performance gains depend on shift in design practices – End of the road for evolutionary advances– Cannot rely on technology alone for gains
![Page 23: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/23.jpg)
23
Thank You
Comments/Questions? Email: [email protected]
Papers, slides and tutorial at http://www.cl.cam.ac.uk/users/rdm34
![Page 24: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/24.jpg)
24
Other Slides….
![Page 25: The Design And Implementation Of A Low Latency On Chip Network](https://reader037.fdocuments.us/reader037/viewer/2022110121/5591b30f1a28abe71f8b4597/html5/thumbnails/25.jpg)
25
Distributed Clock Generator (DCG)• Exploits self-timed
circuitry to generate and distribute a clock in a distributed fashion
• Low-skew and low-power solution to providing global synchrony
• Topology matches that of a mesh network
• Single Frequency
• clock gating? S. Fairbanks and S. Moore “Self-timed circuitry for global clocking”, ASYNC’05