A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb

aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

IntroductionStandard binary - Separating Function and Architecture

SW__________________

ProfilingStandard Compiler

Binaryx86 Binary

Software binaries of the past Binary reflected specific language of underlying

architecture – limited portability Current “standard binary”

Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization

IntroductionBut Today’s Binaries are More than just Software

SW__________________

ProfilingStandard Compiler

BinarySW Binary

ProfilingCompiler/ Synthesis

BinaryBinary

Processor1Processor1

FPGAProc.

SW__________________

HW__________________

ProcessorProcessor2

Processor3Processor3 FPGA

IntroductionJust-in-Time FPGA Compilation?

JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for

microprocessor Portability, transparency, standard tools

Embedded JIT compilation tools optimized for each FPGA

BinaryVHDL/Verilog

ProfilingStandard CAD Tools

BinaryStd. HW Binary

JIT FPGA Comp.

+ + JIT FPGA Comp.

IntroductionOne Use of JIT FPGA Compilation

CableTV Company

FeatureUpgradeFeatureUpgrade

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

BinarySW Binary

CableTV Company

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Netlist3

BinarySW Binary

BinaryHW Netlist2

BinarySW Binary

BinaryHW Netlist1

BinarySW Binary

BinaryHW Netlist4

HW1____________HW2____________

HW3____________

HW4____________

CableTV Company

SW____________

Processor ARM7

Processor ARM9

Processor ARM10

Processor ARM11

HW____________

Processor FPGA 1

Processor FPGA 2

Processor FPGA 3

Processor FPGA 4

BinarySW Binary

BinaryHW Binary

JIT FPGA Comp.

Profiler

Dynamic Part.

Module (DPM)

Time Energy

SW Only

HW/ SW

Partitioned application executes faster with lower energy consumption

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

Profile application to determine critical regions

Profiler

Initially execute application in software only

Partition critical regions to hardware

Dynamic Part.

Module (DPM)

Program configurable logic & update software binary

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

Profiler

DPM(CAD)

IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)

BinaryBinary

Decompilation

BinaryHW Bitstream

RT Synthesis

PartitioningBinary Updater

BinaryUpdated Binary

BinaryStd. HW Binary

JIT FPGA CompilationJIT FPGA

Compilation

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

IntroductionExisting FPGAs Not Suitable for JIT FPGA Compilation

Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution

50 MB 60 MB10 MB

1-2 minsPl

2-30 mins

JIT FPGA Comp.

JIT FPGA CompilationCAD-Oriented FPGA

Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD

Enables development of fast, lean JIT FPGA compilation tools

1s <1s

3.6 MB

Tech. Mapping/Packing

Placement

Logic Synthesis

Routing

Lysecky/Vahid, DATE’04

Simple Configurable Logic FabricCAD-Oriented FPGA

Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics

Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD

Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices

(SMs) CLB is directly connected to a SM

Along with SM design, allows for design of lean JIT routing

Simple Configurable Logic Fabric Combinational Logic Block

Combinational Logic Block Incorporate two 3-input 2-output LUTs

Equivalent to four 3-input LUTs with fixed internal routing

Allows for good quality circuit while reducing JIT technology mapping complexity

Provide routing resources between adjacent CLBs to support carry chains

Reduces number of nets we need to route

FPGAs SCLFFlexibility/Density: Large CLBs, various internal routing resources

Simplicity: Limited internal routing, reduce on-chip CAD complexity

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Simple Configurable Logic Fabric Switch Matrix

0L1L2L

0L1L2L3L

0 1 2 3 0L1L2L3L Switch Matrix

All nets are routed using only a single pair of channels throughout the configurable logic fabric

Each short channel is associated with single long channel

Designed for fast, lean JIT FPGA routing

FPGAs SCLFFlexibility/Speed: Large routing resources, various routing options

Simplicity: Allow for design of fast, lean routing algorithm

JIT FPGA Compilation Routing

FPGA Routing Find a path within FPGA to connect source

and sinks of each net within our hardware circuit

Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route

nets using shortest path Allows overuse (congestion) of resources

If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets

VPR [Betz, et al., 1997] Provides various improvements over

Pathfinder Routability-driven: Use fewest tracks

possible Timing-driven: Optimize circuit speed Many techniques are used in commercial

FPGA CAD tools

congestion

SM SMSM

CLB CLB

Routing Resource Graph

0/4 0/4

0/4 0/4 0/4

SM SMSM

Resource Graph

ROCR - Riverside On-Chip Router Resource Graph

Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel

Requires much less memory as resource graph is smaller

JIT FPGA Compilation ROCR – Riverside On-chip Router

Rip-up

illegal?

Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04

Scalability of On-chip RoutingExperimental Setup

Experimental Setup 100x100 configurable logic fabric

array Routing channel width of 34

Large enough to support all HW circuits

123 MCNC benchmark circuits Circuit complexity ranges from few

LUTs to tens of thousands of LUTs Performed technology mapping,

packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement

Routed each HW benchmark circuit using:

VPR’s timing-driven router VPR’s fast timing-driven router (-fast

option) Riverside On-Chip Router (ROCR)

Scalability of On-chip Routing

Memory Usage

126602

113235

100000

120000

140000

VPR VPR (Fast) ROCR

Minimum

Average

Maximum

VPR requires over 100MB of on average

ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average

Algorithm Performance

Circuit Size (CLBs)

VPR VPR (Fast) ROCR

ROCR is over 40X times faster than VPR for small HW circuits

ROCR is 2X-3X times faster than VPR for large HW circuits

Critical Path

Circuit Size (CLBs)

VPR VPR (Fast) ROCR

19% longer critical path than VPR2.6% shorter than VPR (Fast)

30%/27% longer critical path than VPR/VPR (Fast)

Wire Segments

Circuit Size (Nets)

VPR VPR (Fast) ROCR

ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits

Conclusions and Future Work Conclusions

Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router

Requiring 18X less memory than VPR Produces good circuit quality

Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit

Requires on average 5% fewer wire segments

Future Work Currently project: Major microprocessor vendor is fabricating our

custom FPGA Improvements to Riverside On-Chip Router (ROCR)

Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity

JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

Documents

Transcript of A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

Fpga 03-cpld-and-fpga

FPGA based system design Programmable logic. FPGA Introduction FPGA Architecture Advantages & History of FPGA FPGA-Based System Design Goals.

Hadoop scalability

SevOne Scalability

Lustre Scalability

Scalability and Heterogeneity · Scalability and Heterogeneity Colin Perkins

HMFlow: Accelerating FPGA Compilation with Hard Macros for …rapidsmith.sourceforge.net/papers/Lavin-FCCM11... · 2011. 5. 18. · How To Create a Hard Macro? •Use Xilinx tools

1. Scalability

Just-in-Time Compilation for Verilog · Just-in-Time Compilation for Verilog A New Technique for Improving the FPGA Programming Experience Eric Schkufza VMware Research Palo Alto,

HMFlow: Accelerating FPGA Compilation with Hard Macros for Rapid Prototyping Christopher Lavin, Marc Padilla, Jaren Lamprecht, Philip Lundrigan Brent Nelson.

Scalability 09262012

linux* kernel scalability linux* kernel scalability

DBSight Scalability

Scalable and Modularized RTL Compilation of Convolutional ... · Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA ... Design Entry RTL C-language

100G Interlaken Intel FPGA IP User Guide€¦ · Interlaken provides low I/O count compared to earlier protocols, supporting scalability in both number of lanes and lane speed. Other

Compilation from Matlab to Process Networks Realized in FPGAkienhuis/ftp/daes2002.pdf · Compilation from Matlab to Process Networks Realized in FPGA 3 1.2. DESIGN TIME OF A PARAMETERIZEDQR

L21 scalability

Scalability & Availability

ReSpace / MAPLD 2011 - Scalability of Sustainable Self-Repair to … - Imran.pdf · Scalability of Sustainable Self-Repair to Mitigate Aging Induced Degradation in SRAM-based FPGA

Cygnus: GPU meets FPGA for HPC - RIKEN R-CCS · 2020. 2. 27. · FPGA-GPU DMA (FPGA ← GPU) FPGA-GPU DMA (FPGA → GPU) direction via CPU FPGA-GPU DMA GPU→FPGA 17 1.44 FPGA→GPU