WARP PROCESSORS
Roman Lysecky , Greg Stitt , Frank Vahid, Warp Processors, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.11 n.3, p.659-681, July 2006
MOTIVATION
Wish to overcome barriers for FPGA acceleration: Integrating tools to SW flows Non-conformance to standard binary concept
Aim to make FPGAs invisible to SW developer Dynamically determine critical regions Re-implement as custom HW Communicate between HW/SW
SYSTEM OVERVIEW
Initially execute application in SW only
Profile application to determine critical regions
Partition critical regions to HW
Program configurable logic and update SW binary
Partitioned application’s speed “warps” as accelerator takes over critical region
PROFILING
Typical profilers instrument code Change behaviour Require extra tools
Warp profiler monitors instruction addresses seen on instruction memory bus
Maintains cache of 16 8-bit entries to store backward branch frequencies Maintains relative frequencies Accurately selects kernels within 10 saturations
ON-CHIP CAD
On-chip CAD module implemented on separate ARM7 processor
In multi-processor environments, only one CAD module is necessary
Stages: Decompilation Partitioning Behavioral & RT Synthesis JIT FPGA Compilation
DECOMPILATION
Used to recover high-level constructs i.e. loops, if statements, arrays
Decompiles critical region into CDFG1) Intermediate code creation2) High-level construct recovery3) Map into statements/expressions
Use techniques to undo compiler optimizations Loop re-rolling Strength promotion Compare-with-zero optimization
PARTITIONING
Determines which software kernels are most suitable for implementation in HW
Uses heuristic [assumed the 0-1 knapsack heuristic] to choose kernels to maximize speedup while reducing energy
BEHAVIORAL & RT SYNTHESIS
Converts CDFG for each critical kernel to HW circuit description
Then converts into netlist format
JIT COMPILATIONLOGIC SYNTHESIS
Optimizes hardware circuit
Creates acyclic graph of Boolean logic network
Nodes correspond to any simple 2-input logic gate
Uses Riverside on-chip minimizer (ROCM), a simple two-level logic minimizer Traverse in breadth-first manner, apply logic minimization at each node
15x faster & 3x less memory than Espresso-II 2% increase in circuit size
JIT COMPILATIONTECHNOLOGY MAPPING
Maps hardware onto CLBs and LUTs of RCLF
3-phase greedy hierarchical graph-clustering algorithm
1) Breadth-first traversal of input acyclic graph creates 3-input 1-output LUT nodes
2) Breadth-first traversal combines nodes where possible to form final 3-input 2-output LUTs
3) Traverses graph final time, packs LUTs into CLBs
25X faster than commercial algorithms Only minimally impacts circuit delay
JIT COMPILATIONPLACEMENT
Places network of CLBs onto configurable logic
Greedy dependency-based positional algorithm Places critical path nodes on single horizontal row of RCLF Analyzes dependencies between placed/unplaced nodes Based on dependencies, place above (input to placed node)
or below (uses output from placed node)
Attempts to utilize routing resources between adjacent CLBs
Superimposes and aligns relative placement onto RCLF
JIT COMPILATIONROUTING
Rips up illegal routes, adjusts routing costs of entire routing resource graph
Uses general approach of VPR’s routability-driven router Allows both overuse of routing resources and illegal routes
Constructs routing conflict graph Two routes conflict when both pass through a switch matrix and
assigning them the same channel would result in illegal routing Uses vertex coloring algorithm to assign routing channels
If any routes cannot be assigned legal channel, rips up, re-adjusts, and re-reroutes
JIT COMPILATIONBINARY UPDATER
Used to allow SW to communicate with accelerated HW kernel
Replaces original SW instructions for loop with a jump to HW init. code Enables HW with memory-mapped register Shuts down microprocessor to power-down sleep mode HW asserts completion signal to cause SW interrupt to
wake up microprocessor Jumps back to end of SW loop
W-FPGAS Data Address
Generator (DADG) Loop Control
Hardware (LCH) Multiplier-Accumulator
(MAC)
All memory accesses handled through DADG
LCH for zero loop overhead
DADG &
LCH
Routing-Oriented Configurable Logic
Fabric
Reg0
32-bit MAC
Reg1
Reg2
W-FPGASROUTING-ORIENTED CONFIGURABLE LOGIC FABRIC
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
CLB
SM
SM
SM SM
DADGLCH
Configurable Logic Fabric
32-bit MAC
RCLF consists of array of CLBs surrounded by switch matrices for routing between CLBs
Handle routing between CLBs using switch matrices SMs can route signals to one of 4 neighbour SMs or
two SM two rows/cols apart
W-FPGASCONFIGURABLE LOGIC BLOCKS
Incorporates two 3-input 2-output LUTs Equivalent to four 3-input 1-output LUTs with fixed internal
routing Reduces mapping complexity to increase speed
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
CLB
SM
SM
SM SM
W-FPGASSWITCH MATRICES
All nets are routed using only a single pair of channels throughout the CLF Each short channel is associated with single long channel
Designed for fast, lean JIT FPGA routing
0
0L
1
1L2L
2
3L
3
0123
0L1L2L3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
CLB
SM
SM
SM SM
W-FPGAS
Lean place & route tools on RCLF can execute 10X faster using 18X less memory than existing tools Results in lower clock frequencies for large
circuits Inclusion of DADG and MAC helps offset low freq.
RESULTSBENCHMARKS
Benchmark Benchmark Suite Description
brev Powerstone Bit reversal
g3fax Powerstone Group three fax decode
matmul Powerstone Matrix multiplication
mpeg2 MediaBench MPEG-2 decoder
pktflow EEMBC IP header validation
bitmnp EEMBC Bit manipulation
canrdr EEMBC Controller area network (CAN)
tblook EEMBC Table lookup and interpolation
ttsprk EEMBC Engine spark controller
matrix EEMBC Matrix operations
idct EEMBC Inverse discrete cosine transform
fir EEMBC Finite impulse response filter
rocm Warp RDCAD logic minimizer
prewitt Warp (MT) Prewitt edge detection
search Warp (MT) Parallel search
moravec Warp (MT) Moravec image processing
wavelet Warp (MT) Wavelet transform
maxfilter Warp (MT) Maximum window image filter
N-body Warp (MT) Barnes-Hut N-body simulation
RESULTSSINGLE CRITICAL REGION
RESULTSOVERALL SPEEDUP (MAX 4 CRITICAL REGIONS)
RESULTS
IMPLEMENTATION WITH MICROBLAZE
MicroBlaze
Instr.(BRAM)
lmb_cntrl
W-FPGA
W-FPGA Interface
Instr/ Data
(BRAM)
MicroBlaze
(ROCCAD)
lmb_cntrl
lmb_cntrl
lmb_cntrl
opb_ddr
uartlite
d_lmb
i_lmb
Data(BRAM)
opb
profiler
prof_intf lmb_cntrl
d_lmb
i_lmb
Dynamic Partitioning
Base MicroBlaze system
Top Related