Marvin Tom University of British Columbia Department of Electrical and Computer Engineering

Channel Width Reduction Techniques for System-on-Chip Circuits in

Field-Programmable Gate Arrays

Marvin Tom

University of British ColumbiaDepartment of Electrical and Computer Engineering

Vancouver, BC, Canada

2

Contributions• Two new FPGA benchmark circuit “suites”

– Meta Circuit: mimic “System-on-Chip” design by randomly “stitching” real designs– Stdev: synthetic clones of Meta Circuit, used to vary interconnect demand

• Two new FPGA CAD flows

– DHPack: Design Hierarchy Packing• Identify congested IP blocks depopulate reduced interconnect demand• Conference paper: “Logic Block Clustering…”, published at DAC 2005

– Un/DoPack: UnPack and DoPack• Find “local” interconnect congestion depopulate reduced interconnect demand• Conference paper, submitted to DAC 2006

• Discoveries…– “Non-uniform” depopulation limits area inflation– “BLE limiting” gives better interconnect controllability than “Input limiting”– “Interconnect variation” important for area inflation and FPGA architecture design– “Routing closure” achieved by re-clustering and incremental place & route

• UNROUTABLE circuits made ROUTABLE buy an FPGA with MORE LOGIC!!!

3

Mesh-Based FPGA Architecture• 9 logic blocks• 4 wires per channel• 3*4=12 total horizontal tracks

L L L

L L L

L L L

L L L

L L L

L L L

L L L

L

L

L

L

• Larger FPGAs have more “aggregate” interconnect

• 16 logic blocks• 4 wires per channel• 4*4=16 total horizontal tracks

4

alu4

apex2

apex4

bigkey

clma

des

diffeq

dsip

elliptic

ex1010

ex5p

frisc

misex3

pdc

s298s38417

s38584seq

spla

tseng

pdc

ex1010

frisc splaapex4 elliptic

10

20

30

40

50

60

70

80

90

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700

CLB Count

Ro

ute

d C

ha

nn

el W

idth

Logic Utilization vs. Channel Width• Trade-off logic utilization for channel width

– User can always buy more logic…. (not more wires)

FPGA 1 FPGA 2

L L L L

L L L L

L L L L

L L L L

L L L L

L L L L

L L L L

L L L L

L

L

L

L

L L L L L

Trade-off:

CLB count

for

Channel width

But….. can we achieve lower Total Area? ( = SIZE * CLB Count)( No! but we can break even! )

5

Logic Element: BLE and CLB

• Basic Logic Element (BLE)– ‘k’-input LUT + FF

• Configurable Logic Block (CLB) – ‘N’ BLEs, ‘N’ outputs– ‘I’ shared inputs

‘I’ Inputs ‘N’ Outputs

BLE #1

BLE #2

BLE #3

BLE #4

BLE #5

CLB

L L L L

L L L L

L L L L

L L L L

Note: I < k*N

6

CLB Depopulation

• General Approach– Use existing clustering tools– Do not fill CLB while

clustering

1. Input-Limited• Eg. Maximum 67% input

utilization per CLB• Might use all BLEs

2. BLE-Limited• Eg. Maximum 60% BLE

utilization per CLB• Might use all Inputs

BLE #1

BLE #2

BLE #3

BLE #4

BLE #5

CLB

‘I’ Inputs ‘N’ Outputs

7

Reducing Channel Width Results(max cluster size 16, max num inputs 51)

• Input-Limited• No channel width control

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Cluster Size (BLE-Limit)

Routed Channel

Width

6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54Number of Inputs (Input-Limit)

Input-limited clmaBLE-Limited clma

• BLE-Limited• (almost) monotonically increasing good channel width control

8

Meta Benchmark Circuit Creation

• Mimic process of creating large designs– “IP Blocks” <==> MCNC Circuits– SoC <==> Randomly integrate/stitch together “IP Blocks”– IP Blocks have varied interconnect needs

• Considered 3 stitching schemes…

– Independent• IP Blocks are not connected to each other

– Pipeline• Outputs of one IP block connected to inputs of next IP block

– Clique• Outputs of each IP block are uniformly distributed to inputs of all other IP

blocks

9

DHPack: Meta Circuit P&R

• Use VPR FPGA tools from University of Toronto

• Observation 1– VPR placer successfully

groups IP blocks from random initial placement

• Observation 2– VPR router confirms channel

width of MetaCircuit is dominated by a few IP blocks{ pdc, clma, ex1010 }

10

0.8

1

1.2

1.4

1.6

1.8

2

40 50 60 70 80 90 100

1

Channel Width Constraint

No

rmal

ized

Are

a

DHPack: Meta Circuit P&R Results

40

50

60

70

80

90

100

40 50 60 70 80 90 100

• Clique MetaCircuit– P&R channel width results closely match “constraints”

• Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1.7x area increase

Channel Width Constraint

Ch

ann

el W

idth

Constraint Routed

11

Meta Circuits vs. Stdev Circuits

• Meta Circuit Drawbacks– Design hierarchy boundaries not well-defined– Coarse-grained IP block boundary– Stitching unrealistic

• Flip Flop placed at every output• Connections only have FO1

• Stdev Circuits (created using GNL)– Synthetic clone of Meta circuits– Hierarchical specify Rent parameter of each partition

• Root # I/Os, # IP blocks• Second Level 20 IP blocks, # LEs, Rent parameter

12

Stdev Circuits: Rent Parameters• 7 benchmark circuits• 240/120 primary inputs/outputs, approx 52,000 CLBs• Rent parameter: Average 0.62, vary Stdev 0.0 to 0.12

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

bigke

y

s385

84.1

ellipt

icdif

feq

s298 alu

4

mise

x3 pdc

ex5p

ex10

10

MCNC Circuit IP Blocks

Ren

t P

aram

eter

Stdev000Stdev002

Stdev004Stdev006

Stdev008 / meta cloneStdev010

Stdev012

13

Un/DoPack Flow

• Iterative non-uniform cluster depopulation tool

• Step 1: Traditional SIS/VPR• Step 2: UnPack:

– Congestion Calculator

• Step 3: DoPack:– Incremental Re-Cluster

• Step 4,5: Fast Place/Route

Circuit DescriptionArchitecture Description

Channel Width ConstraintArray Size Constraint

Cluster(iRAC Replica)

Placement(VPR)

Routing(VPR)

Channel WidthConstraint Met?

Success!

CongestionCalculator(UnPack)

Fast Placement(Incremental or

VPR)

Fast Routing(VPR)


Yes Yes

No No

Array Size LimitsReached?

Failure

Yes

No

Synthesize andTechnology Map(SIS/Flowmap)

IncrementalCluster

(DoPack)

14

Un/DoPack Flow: SIS/VPRCircuit Description

Architecture DescriptionChannel Width Constraint

Array Size Constraint


Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 1: Traditional SIS/VPR

Circuit DescriptionArchitecture Description

Channel Width ConstraintArray Size Constraint

15





Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)



Placement(VPR)

Routing(VPR)


16





Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)



Success!

Yes

No

17

Un/DoPack Flow: UnPackCircuit Description




Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 2: UnPack– Generate Congestion Map– CLB Label = Largest CW occ

in 4 adjacent channels

18





Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 2: UnPack:Depop Center = Largest CLB label

M X M Array

19





Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 2: UnPack:Depop Radius = M/4

Depop Amt: 1 new row/col in array

M X M Array

20

Un/DoPack Flow: DoPackCircuit Description




Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 3: DoPack:– Incremental Re-Cluster

IncrementalCluster

(DoPack)

No

21

Un/DoPack Flow: Fast P&RCircuit Description




Placement(VPR)

Routing(VPR)


Success!



VPR)

Fast Routing(VPR)


Yes Yes

No No


Failure

Yes

No


IncrementalCluster

(DoPack)

• Step 4,5: Fast Place/Route

• Fast Placement– UBC Incremental Placer

(under development)– VPR “–fast” option

• Router– Use full routed solution

• Slow but reliable

22

Before 120/79/27

After 100/79/20Peak / Avg / Stddev

Peak / Avg / Stddev

23

Normalized Area of GNL Benchmarks

0.901.001.101.201.301.401.501.601.701.801.902.00

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05

% of Maximum Channel Width

No

rmal

ized

Are

a

stdev000

stdev002

stdev004

stdev006

stdev008/clone

stdev010

stdev012

24

Absolute Area of GNL Benchmarks

0.901.001.101.201.301.401.501.601.701.801.902.00

60 70 80 90 100 110 120 130 140 150 160 170

Absolute Channel Width

No

rmal

ized

Are

a

stdev000

stdev002

stdev004

stdev006

stdev008/clone

stdev010

stdev012

25

Interconnect Variation: Impact on FPGA Architecture Design

70

80

90

100

110

120

130

140

Min

imu

m R

ou

ted

Ch

an

ne

l W

idth

Baseline

10% Area Increase

20% Area Increase

25% Area Increase

High VariationHigh VariationCircuits RequireCircuits Require

Wide Channel WidthWide Channel Width

26

Contributions• Two new FPGA benchmark circuit “suites”

– Meta Circuit: mimic “System-on-Chip” design by randomly “stitching” real designs– Stdev: synthetic clones of Meta Circuit, used to vary interconnect demand

• Two new FPGA CAD flows

– DHPack: Design Hierarchy Packing• Identify congested IP blocks depopulate reduced interconnect demand• Conference paper: “Logic Block Clustering…”, published at DAC 2005

– Un/DoPack: UnPack and DoPack• Find “local” interconnect congestion depopulate reduced interconnect demand• Conference paper, submitted to DAC 2006

• Discoveries…– “Non-uniform” depopulation limits area inflation– “BLE limiting” gives better interconnect controllability than “Input limiting”– “Interconnect variation” important for area inflation and FPGA architecture design– “Routing closure” achieved by re-clustering and incremental place & route

• UNROUTABLE circuits made ROUTABLE buy an FPGA with MORE LOGIC!!!

End of Talk

Marvin Tom University of British Columbia Department of Electrical and Computer Engineering

Documents

Transcript of Marvin Tom University of British Columbia Department of Electrical and Computer Engineering