High-Level Interconnect Architectures for FPGAs
Nick Barrow-Williams
Introduction
Continued shrinking of device dimension introduces new design challenges
Moving data around a chip can now be the limiting factor of performance
Existing interconnection solutions do not scale well
2
Why do existing solutions not scale?
Global connections are longer
Wire depth increased to counter width decrease
Parasitic capacitive effects increase and cause slow signal propagation
3
Why do existing solutions not scale?
Existing system-level connection uses buses
Buses increase resource efficiency and decrease wiring congestion
Not suitable for a large number of modules
A network based alternative would offer higher aggregate bandwidth
4
Why design for FPGA systems?
FPGA silicon area already dominated by wiring
Global wires are limited in number
Increasing gate count only increases wiring congestion
5
The Solution: Network-on-Chip
Use technologies from network systems
Replace inefficient global wiring with high-level interconnection network
Create scalable systems to handle large numbers of modules
6
Existing Solutions Most existing systems are for ASIC designs
Stanford Interconnect RAW SCALE SPIN
PNoC: An solution for FPGAs Complex High hardware cost
Other simulated solutions exist but few are implemented
7
Proposal: Two network systems
Existing solutions use either packet switching or circuit switching techniques
Design, implement, test and synthesise one of each to compare performance and hardware cost
Map solutions to an FPGA platform to evaluate hardware cost in current generation systems
8
Network Architecture Design
Topology Simple Scalable 2 Dimensional
Solution: 2D mesh Topology
9
Network Architecture Design
Routing Algorithm Deterministic
Data always follows same path through network Simple hardware Sensitive to congestion
Adaptive Paths through network can change according to load Complex hardware Avoids congestion
10
Network Architecture Design When choosing routing algorithms must avoid:
Deadlock:
Livelock
Solution: Use unidirectional wiring and allow each node to make two connections
Solution: Use deterministic routing
11
Network Architecture Design Flow control methods
Circuit switched Circuit request propagates through network Path reserved to destination Grant signal propagates back Data sent then circuit deallocated
Packet switched Use header, body and tail Wormhole routing
Forward header and body without waiting for tail Need buffers to store stalled packets
12
Router Design Each router contains a number of modules
FIFOs (only present in packet switched router)
Address to port-request decoder
Arbiter
Control finite state machines
Crossbar
13
Circuit Switched Router Structure
Request
InR
equest
In
Request
Out
Gra
nt In
Gra
nt
Out
Data
In
Data
O
ut
Data In
In & Out Ports
CrossbarCrossbarCrossbarCrossbar
FSMFSMFSMFSM
ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder
Address to Port Address to Port DecoderDecoder
14
Packet Switched Router Structure
Request
Fro
m
FIF
Os
Request
In
Write
Out
Full In
Gra
nt
Out
Data
Fro
m
FIF
Os
Data
O
ut
Data From FIFOs
In & Out Ports
CrossbarCrossbarCrossbarCrossbar
ControlControlControlControl
ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder
Address to Port Address to Port DecoderDecoderFIFOFIFO FSMFSMData In
Full
Write
Grant
Req
Data
15
Router Implementation and Testing
Both routers were coded using VHDL
Simulation and testing used a combination of ModelSim and Xilinx ISE 9.1
Ad-hoc tests used for individual modules
VHDL testbench used for system verification
16
Testbench Structure
Mesh Network
Mesh Network
ReadInputReadInput
Input Tables
TestTable
SourceSource
OutputTable
SinkSink
CompareCompare
TESTBENCH
Command File
Output File
Clock Gen
Clock Gen
Reset Gen
Reset Gen
Cycle CountCycle Count
Success: ID: 1 Source : (0,3) Dest : (1,0) Hops : 4 Latency: 34Success: ID: 2 Source : (0,2) Dest : (1,0) Hops : 3 Latency: 27Success: ID: 3 Source : (3,2) Dest : (1,1) Hops : 3 Latency: 22Success: ID: 4 Source : (1,3) Dest : (0,1) Hops : 3 Latency: 22Success: ID: 5 Source : (3,0) Dest : (3,1) Hops : 1 Latency: 12
#START SOURCE DEST SIZE ID# ------------------------------------------------------ 2 3 0 0 1 8 1 3 2 0 0 1 2 2 3 2 3 1 1 2 3 4 3 1 1 0 8 4 5 0 3 1 3 7 5
17
Synthesis
Each router was synthesised for a Virtex-4 LX platform
Post-synthesis verification
Resource usage
Timing
18
Circuit Switched Resource Usage
LUTsLUTs Flip-FlopsFlip-Flops
Total of 586 4 Input LUTS
~0.1% of a Virtex 5
Total of 202 Flip Flops
19
Packet Switched Resource Usage
LUTsLUTs Flip-FlopsFlip-Flops
Total of 786 4 Input LUTS
+34% compared to circuit switched
Total of 237Flip Flops
20
Timing Results
Circuit Switched Packet Switched
Max Freq 126.330MHz
Setup time 5.308ns
Hold time 0.272ns
Max Freq 144.533MHz
Setup time 6.125ns
Hold time 0.272ns
Critical path is through Arbiter in both designs
21
Project Appraisal Maintaining an accurate software simulation
proved difficult
A great deal was learnt during the implementation of the circuit switched network
HDL implementations are only prototypes
Testbench provides a good framework but more time is needed to gather performance data
22
Conclusions
Possible to make low complexity network-on-chip systems suitable for FPGAs
Latency has to be traded for throughput
Hard to collect performance data without application driven benchmarks
Both networks are viable so why not use both?
23
Future Work
Cycle accurate software simulations
Application driven benchmarking
Serial transmission
Power efficiency
Industry standard solution
24
Top Related