Download - High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

High-Level Interconnect Architectures for FPGAs

Nick Barrow-Williams

Introduction

Continued shrinking of device dimension introduces new design challenges

Moving data around a chip can now be the limiting factor of performance

Existing interconnection solutions do not scale well

2

Why do existing solutions not scale?

Global connections are longer

Wire depth increased to counter width decrease

Parasitic capacitive effects increase and cause slow signal propagation

3

Why do existing solutions not scale?

Existing system-level connection uses buses

Buses increase resource efficiency and decrease wiring congestion

Not suitable for a large number of modules

A network based alternative would offer higher aggregate bandwidth

4

Why design for FPGA systems?

FPGA silicon area already dominated by wiring

Global wires are limited in number

Increasing gate count only increases wiring congestion

5

The Solution: Network-on-Chip

Use technologies from network systems

Replace inefficient global wiring with high-level interconnection network

Create scalable systems to handle large numbers of modules

6

Existing Solutions Most existing systems are for ASIC designs

Stanford Interconnect RAW SCALE SPIN

PNoC: An solution for FPGAs Complex High hardware cost

Other simulated solutions exist but few are implemented

7

Proposal: Two network systems

Existing solutions use either packet switching or circuit switching techniques

Design, implement, test and synthesise one of each to compare performance and hardware cost

Map solutions to an FPGA platform to evaluate hardware cost in current generation systems

8

Network Architecture Design

Topology Simple Scalable 2 Dimensional

Solution: 2D mesh Topology

9

Network Architecture Design

Routing Algorithm Deterministic

Data always follows same path through network Simple hardware Sensitive to congestion

Adaptive Paths through network can change according to load Complex hardware Avoids congestion

10

Network Architecture Design When choosing routing algorithms must avoid:

Deadlock:

Livelock

Solution: Use unidirectional wiring and allow each node to make two connections

Solution: Use deterministic routing

11

Network Architecture Design Flow control methods

Circuit switched Circuit request propagates through network Path reserved to destination Grant signal propagates back Data sent then circuit deallocated

Packet switched Use header, body and tail Wormhole routing

Forward header and body without waiting for tail Need buffers to store stalled packets

12

Router Design Each router contains a number of modules

FIFOs (only present in packet switched router)

Address to port-request decoder

Arbiter

Control finite state machines

Crossbar

13

Circuit Switched Router Structure

Request

InR

equest

In

Request

Out

Gra

nt In

Gra

nt

Out

Data

In

Data

O

ut

Data In

In & Out Ports

CrossbarCrossbarCrossbarCrossbar

FSMFSMFSMFSM

ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder

Address to Port Address to Port DecoderDecoder

14

Packet Switched Router Structure

Request

Fro

m

FIF

Os

Request

In

Write

Out

Full In

Gra

nt

Out

Data

Fro

m

FIF

Os

Data

O

ut

Data From FIFOs

In & Out Ports

CrossbarCrossbarCrossbarCrossbar

ControlControlControlControl

ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder

Address to Port Address to Port DecoderDecoderFIFOFIFO FSMFSMData In

Full

Write

Grant

Req

Data

15

Router Implementation and Testing

Both routers were coded using VHDL

Simulation and testing used a combination of ModelSim and Xilinx ISE 9.1

Ad-hoc tests used for individual modules

VHDL testbench used for system verification

16

Testbench Structure

Mesh Network

Mesh Network

ReadInputReadInput

Input Tables

TestTable

SourceSource

OutputTable

SinkSink

CompareCompare

TESTBENCH

Command File

Output File

Clock Gen

Clock Gen

Reset Gen

Reset Gen

Cycle CountCycle Count

Success: ID: 1 Source : (0,3) Dest : (1,0) Hops : 4 Latency: 34Success: ID: 2 Source : (0,2) Dest : (1,0) Hops : 3 Latency: 27Success: ID: 3 Source : (3,2) Dest : (1,1) Hops : 3 Latency: 22Success: ID: 4 Source : (1,3) Dest : (0,1) Hops : 3 Latency: 22Success: ID: 5 Source : (3,0) Dest : (3,1) Hops : 1 Latency: 12

#START SOURCE DEST SIZE ID# ------------------------------------------------------ 2 3 0 0 1 8 1 3 2 0 0 1 2 2 3 2 3 1 1 2 3 4 3 1 1 0 8 4 5 0 3 1 3 7 5

17

Synthesis

Each router was synthesised for a Virtex-4 LX platform

Post-synthesis verification

Resource usage

Timing

18

Circuit Switched Resource Usage

LUTsLUTs Flip-FlopsFlip-Flops

Total of 586 4 Input LUTS

~0.1% of a Virtex 5

Total of 202 Flip Flops

19

Packet Switched Resource Usage

LUTsLUTs Flip-FlopsFlip-Flops

Total of 786 4 Input LUTS

+34% compared to circuit switched

Total of 237Flip Flops

20

Timing Results

Circuit Switched Packet Switched

Max Freq 126.330MHz

Setup time 5.308ns

Hold time 0.272ns

Max Freq 144.533MHz

Setup time 6.125ns

Hold time 0.272ns

Critical path is through Arbiter in both designs

21

Project Appraisal Maintaining an accurate software simulation

proved difficult

A great deal was learnt during the implementation of the circuit switched network

HDL implementations are only prototypes

Testbench provides a good framework but more time is needed to gather performance data

22

Conclusions

Possible to make low complexity network-on-chip systems suitable for FPGAs

Latency has to be traded for throughput

Hard to collect performance data without application driven benchmarks

Both networks are viable so why not use both?

23

Future Work

Cycle accurate software simulations

Application driven benchmarking

Serial transmission

Power efficiency

Industry standard solution

24