1280 IEEE TRANSACTIONS ON VERY LARGE SCALE...

13
1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016 Hybrid LUT/Multiplexer FPGA Logic Architectures Stephen Alexander Chin, Student Member, IEEE, Jason Luu, Safeen Huda, and Jason H. Anderson, Member, IEEE Abstract— Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of lookup tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both nonfracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites (VTR and CHStone) using a custom tool flow consisting of LegUp-HLS, Odin-II front-end synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for nonfracturable architectures, without any mapper optimizations, we naturally save up to 8% area postplace and route; both accounting for complex logic block and routing area while maintaining mapping depth. With architecture-aware technology mapper optimizations in ABC, additional area is saved, post-place-and-route. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to 2%. For both nonfracturable and fracturable architectures, we see minimal impact on timing performance for the architectures with best area-efficiency. Index Terms— Field-programmable gate array (FPGA), hybrid complex logic block, multiplexer (MUX). I. I NTRODUCTION T HROUGHOUT the history of field-programmable gate arrays (FPGAs), lookup tables (LUTs) have been the primary logic element (LE) used to realize combinational logic. A K -input LUT is generic and very flexible—able to implement any K -input Boolean function. The use of LUTs simplifies technology mapping as the problem is reduced to a graph covering problem. However, an exponential area price is paid as larger LUTs are considered. The value of K between 4 and 6 is typically seen in industry and academia, and this range has been demonstrated to offer a good area/performance compromise [4], [5]. Recently, a number of other works have explored alternative FPGA LE architectures for performance improvement [6]–[10] to close the large gap between FPGAs and application-specific integrated circuits (ASICs) [11]. In this paper, we propose incorporating (some) hardened Manuscript received January 19, 2015; revised May 8, 2015; accepted June 9, 2015. Date of publication September 17, 2015; date of current version March 18, 2016. S. A. Chin, S. Huda, and J. H. Anderson are with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada (e-mail: [email protected]; [email protected]; [email protected]). J. Luu is with Altera Inc., Toronto, ON M5S 2X9, Canada (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2015.2451658 multiplexers (MUXs) in the FPGA logic blocks as a means of increasing silicon area efficiency and logic density. The MUX-based logic blocks for the FPGAs have seen success in early commercial architectures, such as the Actel ACT-1/2/3 architectures, and efficient mapping to these structures has been studied [12] in the early 1990s. However, their use in commercial chips has waned, perhaps partly due to the ease with which logic functions can be mapped into LUTs, simplifying the entire computer aided design (CAD) flow. Nevertheless, it is widely understood that the LUTs are inefficient at implementing MUXs, and that MUXs are frequently used in logic circuits. To underscore the ineffi- ciency of LUTs implementing MUXs, consider that a six- input LUT (6-LUT) is essentially a 64-to-1 MUX (to select 1 of 64 truth-table rows) and 64-SRAM configuration cells, yet it can only realize a 4-to-1 MUX (4 data + 2 select = 6 inputs). In this paper, we present a six-input LE based on a 4-to-1 MUX, MUX4, that can realize a subset of six-input Boolean logic functions, and a new hybrid complex logic block (CLB) that contains a mixture of MUX4s and 6-LUTs. The proposed MUX4s are small compared with a 6-LUT (15% of 6-LUT area), and can efficiently map all {2, 3}-input functions and some {4, 5, 6}-input functions. In addition, we explore fracturability of LEs—the ability to split the LEs into multiple smaller elements—in both LUTs and MUX4s to increase logic density. The ratio of LEs that should be LUTs versus MUX4s is also explored toward optimizing logic density for both nonfracturable and fracturable FPGA architectures. To facilitate the architecture exploration, we developed a CAD flow for mapping into the proposed hybrid CLBs, created using ABC [13] and VPR [14], and describe technology mapping techniques that encourage the selection of logic functions that can be embedded into the MUX4 elements. The main contributions in this paper are as follows. 1) Two hybrid CLB architectures (nonfracturable and fracturable) that contain a mixture of MUX4 LEs and the traditional LUTs yielding up to 8% area savings. 2) Mapping techniques called NaturalMux and MuxMap targeted toward the hybrid CLB architecture that optimize for area, while preserving the original mapping depth. 3) A full post-place-and-route architecture evaluation with VTR7 [1], and CHStone [2] benchmarks facilitated by LegUp-HLS [3], the Verilog-to-Routing project [1] showing impact on both area and delay. Compared with the preliminary publication [15], we have performed transistor level modelling of the MUX4 LE, further studied the fracturable architectures, and unified the open 1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE...

Page 1: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

Hybrid LUT/Multiplexer FPGA Logic ArchitecturesStephen Alexander Chin, Student Member, IEEE, Jason Luu, Safeen Huda, and Jason H. Anderson, Member, IEEE

Abstract— Hybrid configurable logic block architectures forfield-programmable gate arrays that contain a mixture of lookuptables and hardened multiplexers are evaluated toward the goalof higher logic density and area reduction. Multiple hybridconfigurable logic block architectures, both nonfracturable andfracturable with varying MUX:LUT logic element ratios areevaluated across two benchmark suites (VTR and CHStone)using a custom tool flow consisting of LegUp-HLS, Odin-IIfront-end synthesis, ABC logic synthesis and technology mapping,and VPR for packing, placement, routing, and architectureexploration. Technology mapping optimizations that targetthe proposed architectures are also implemented within ABC.Experimentally, we show that for nonfracturable architectures,without any mapper optimizations, we naturally save up to ∼8%area postplace and route; both accounting for complex logicblock and routing area while maintaining mapping depth. Witharchitecture-aware technology mapper optimizations in ABC,additional area is saved, post-place-and-route. For fracturablearchitectures, experiments show that only marginal gains areseen after place-and-route up to ∼2%. For both nonfracturableand fracturable architectures, we see minimal impact on timingperformance for the architectures with best area-efficiency.

Index Terms— Field-programmable gate array (FPGA), hybridcomplex logic block, multiplexer (MUX).

I. INTRODUCTION

THROUGHOUT the history of field-programmable gatearrays (FPGAs), lookup tables (LUTs) have been the

primary logic element (LE) used to realize combinationallogic. A K-input LUT is generic and very flexible—able toimplement any K -input Boolean function. The use of LUTssimplifies technology mapping as the problem is reduced to agraph covering problem. However, an exponential area price ispaid as larger LUTs are considered. The value of K between4 and 6 is typically seen in industry and academia, and thisrange has been demonstrated to offer a good area/performancecompromise [4], [5]. Recently, a number of other works haveexplored alternative FPGA LE architectures for performanceimprovement [6]–[10] to close the large gap between FPGAsand application-specific integrated circuits (ASICs) [11].In this paper, we propose incorporating (some) hardened

Manuscript received January 19, 2015; revised May 8, 2015; acceptedJune 9, 2015. Date of publication September 17, 2015; date of current versionMarch 18, 2016.

S. A. Chin, S. Huda, and J. H. Anderson are with the Departmentof Electrical and Computer Engineering, University of Toronto,Toronto, ON M5S 3G8, Canada (e-mail: [email protected];[email protected]; [email protected]).

J. Luu is with Altera Inc., Toronto, ON M5S 2X9, Canada (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2015.2451658

multiplexers (MUXs) in the FPGA logic blocks as a meansof increasing silicon area efficiency and logic density.

The MUX-based logic blocks for the FPGAs have seensuccess in early commercial architectures, such as theActel ACT-1/2/3 architectures, and efficient mapping to thesestructures has been studied [12] in the early 1990s. However,their use in commercial chips has waned, perhaps partly dueto the ease with which logic functions can be mapped intoLUTs, simplifying the entire computer aided design (CAD)flow. Nevertheless, it is widely understood that the LUTs areinefficient at implementing MUXs, and that MUXs arefrequently used in logic circuits. To underscore the ineffi-ciency of LUTs implementing MUXs, consider that a six-input LUT (6-LUT) is essentially a 64-to-1 MUX (to select 1of 64 truth-table rows) and 64-SRAM configuration cells, yet itcan only realize a 4-to-1 MUX (4 data + 2 select = 6 inputs).

In this paper, we present a six-input LE based on a4-to-1 MUX, MUX4, that can realize a subset of six-inputBoolean logic functions, and a new hybrid complex logicblock (CLB) that contains a mixture of MUX4s and 6-LUTs.The proposed MUX4s are small compared with a 6-LUT(15% of 6-LUT area), and can efficiently map all{2, 3}-input functions and some {4, 5, 6}-input functions.In addition, we explore fracturability of LEs—the ability tosplit the LEs into multiple smaller elements—in both LUTsand MUX4s to increase logic density. The ratio of LEs thatshould be LUTs versus MUX4s is also explored towardoptimizing logic density for both nonfracturable andfracturable FPGA architectures.

To facilitate the architecture exploration, we developed aCAD flow for mapping into the proposed hybrid CLBs, createdusing ABC [13] and VPR [14], and describe technologymapping techniques that encourage the selection of logicfunctions that can be embedded into the MUX4 elements.

The main contributions in this paper are as follows.1) Two hybrid CLB architectures (nonfracturable and

fracturable) that contain a mixture of MUX4 LEs andthe traditional LUTs yielding up to 8% area savings.

2) Mapping techniques called NaturalMux and MuxMaptargeted toward the hybrid CLB architecture thatoptimize for area, while preserving the original mappingdepth.

3) A full post-place-and-route architecture evaluation withVTR7 [1], and CHStone [2] benchmarks facilitatedby LegUp-HLS [3], the Verilog-to-Routing project [1]showing impact on both area and delay.

Compared with the preliminary publication [15], we haveperformed transistor level modelling of the MUX4 LE, furtherstudied the fracturable architectures, and unified the open

1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1281

source tool-flow from C through LegUp-HLS to the VTRflow. Sparse crossbars (versus full crossbars in the previouswork) have also been included in our CLBs, increasingmodelling accuracy. The new transistor-level modelling of theMUX4 also provides more accurate results as compared withthe previous work. Results have also been expanded withthe inclusion of timing results as well as larger architecturalratio sweeps.

The remainder of this paper is organized as follows.Section II outlines related work. Section III discusses theproposed MUX4 LE, the variant used in the fracturablearchitecture and the design of the hybrid complex logic block.Section IV presents the technology mapping approaches totarget the proposed hybrid architecture. Section V shows howwe modeled the hybrid complex logic blocks for both the non-fracturable and fracturable architectures in VPR. Section VIdiscusses our evaluation methodology and provides theevaluation results. Finally, we conclude with final remarksin Section VII.

II. RELATED WORK

Recent works have shown that the heterogeneousarchitectures and synthesis methods can have a significantimpact on improving logic density and delay, narrowingthe ASIC–FPGA gap. Works by Anderson and Wang with“gated” LUTs [7], then with asymmetric LUT LEs [8], showthat the LUT elements present in commercial FPGAs provideunnecessary flexibility.

Toward improved delay and area, the macrocell-basedFPGA architectures have been proposed [9], [10]. Thesestudies describe significant changes to the traditionalFPGA architectures, whereas the changes proposed here buildon architectures used in industry and academia [4]. Similarly,and-inverter cones have been proposed as replacements for theLUTs, inspired by and-inverter graphs (AIGs) [6].

Purnaprajna and Ienne [16] explored the possibilityof repurposing the existing MUXs contained within theXilinx Logic Slices [17]. Similar to this work, they use theABC priority cut mapper as well as VPR for packing, place,and route. However, their work is primarily delay-basedshowing an average speedup of 16% using only tenof 19 VTR7 benchmarks.

III. PROPOSED ARCHITECTURES

A number of FPGA architecture variants were evaluatedand all are based on the basic MUX4 element describedin Section III-A. The design of LEs and considerationsfor fracturability are addressed in Section III-B followed byhybrid-CLB design in Section III-C. The areas of all proposedarchitectures are discussed in Section III-D.

A. MUX4: 4-to-1 Multiplexer Logic Element

The MUX4 LE shown in Fig. 1 consists of a 4-to-1 MUXwith optional inversion on its inputs that allow the realizationof any {2, 3}-input function, some {4, 5}-input functions, andone 6-input function—a 4-to-1 MUX itself with optionalinversion on the data inputs. A 4-to-1 MUX matches the input

Fig. 1. MUX4 LE depicting optional data input inversions.

pin count of a 6-LUT, allowing for fair comparisons withrespect to the connectivity and intracluster routing.

Naturally, any two-input Boolean function can be easilyimplemented in the MUX4: the two function inputs can betied to the select lines and the truth table values (logic-0 orlogic-1) can be routed to the data inputs accordingly.Or alternately, a Shannon decomposition can be performedabout one of the two variables—the variable can then feeda select input. The Shannon cofactors will contain at mostone variable and can, therefore, be fed to the data inputs(the optional inversion may be needed).

For three-input functions, consider that a Shannon decom-position about one variable produces cofactors with at mosttwo variables. A second decomposition of the cofactors aboutone of their two remaining variables produces cofactors withat most one variable. Such single-variable cofactors can befed to the data inputs (the optional inversion may be needed),with the decomposition variables feeding the select inputs.Likewise, functions of more than four inputs can beimplemented in the MUX4 as long as Shannon decompositionwith respect to any two inputs produce cofactors with at mostone input.

Observe that input inversion on each select input is omittedas this would only serve to permute the four MUX data inputs.While this could help routability within the CLB’s internalcrossbar, additional inversions on the select inputs would notincrease the number of Boolean functions that are able to mapto the MUX4 LE.

B. Logic Elements, Fracturability, andMUX4-Based Variants

Two families of architectures were created: 1) withoutfracturable LEs and 2) with fracturable LEs. In this paper,the fracturable LEs refer to an architectural element onwhich one or more logic functions can be optionally mapped.Nonfracturable LEs refer to an architectural element on whichonly one logic function is mapped. In the nonfracturablearchitectures, the MUX4 element shown in Fig. 1 is usedtogether with nonfracturable 6-LUTs. This element shares thesame number of inputs as a 6-LUT lending for fair comparisonwith respect to the input connectivity.

For the fracturable architecture, we consider aneight-input LE, closely matched with the adaptive logic

Page 3: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1282 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

Fig. 2. Fracturable 6-LUT that can be fractured into two 5-LUTs withtwo shared inputs.

Fig. 3. Dual MUX4 LE that utilizes dedicated select inputs and shared datainputs.

module in recent Altera Stratix FPGA families. A 6-LUT thatcan be fractured into two 5-LUTs using eight inputs is shownin Fig. 2. Two five-input functions can be mapped into thisLE if two inputs are shared between the two functions. If noinputs are shared, two four-input functions can be mappedto each 5-LUT. For the MUX4 variant, Dual MUX4, we usetwo MUX4s within a single eight-input LE. In theconfiguration, shown in Fig. 3, the two MUX4s arewired to have dedicated select inputs and shared datainputs. This configuration allows this structure to map twoindependent (no shared inputs) three-input functions, whilelarger functions may be mapped dependent on the sharedinputs between both functions.

An architecture in which a 4-to-1 MUX (MUX4) isfractured into two smaller 2-to-1 MUXs was first considered.

Fig. 4. Hybrid CLB with a 50% depopulated intra-CLB crossbar depictingBLE internals for a nonfracturable (one optional register and one output)architecture.

Fig. 5. Hybrid CLB with a 50% depopulated intra-CLB crossbar depictingBLE internals for a fracturable (two optional registers and two outputs)architecture.

However, since a 2-to-1 MUX’s mapping flexibility is quitelimited (can only map two-input functions and the three-input2-to-1 MUX itself), little benefit was added compared withthe overheads of making the MUX4 fracturable and poor arearesults were observed.

C. Hybrid Complex Logic Block

A variety of different architectures were considered—thefirst being a nonfracturable architecture. In the nonfracturablearchitecture, the CLB has 40 inputs and ten basic LEs (BLEs),with each BLE having six inputs and one output followingempirical data in prior work [4]. Fig. 4 shows this nonfrac-turable CLB architecture with BLEs that contain an optionalregister. We vary the ratio of MUX4s to LUTs within theten element CLB from 1:9 to 5:5 MUX4s:6-LUTs. TheMUX4 element is proposed to work in conjunction with6-LUTs, creating a hybrid CLB with a mixture of 6-LUTs andMUX4s (or MUX4 variants). Fig. 4 shows the organization ofour CLB and internal BLEs.

For fracturable architectures, the CLB has 80 inputsand ten BLEs, with each BLE having eight inputs andtwo outputs emulating an Altera Stratix Adaptive-LUT [18].The same sweep of MUX4 to LUT ratios was also performed.Fig. 5 shows the fracturable architecture with eight inputs toeach BLE that contains two optional registers. We evaluatefracturability of LEs versus nonfracturable LEs in the contextof MUX4 elements since fracturable LUTs are common

Page 4: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1283

TABLE I

LE TRANSISTOR MODELS WITH AREA GIVEN IN MINIMUM-WIDTH TRANSISTOR AREA AND DELAYS SCALED FOR A 40-nm PROCESS

in commercial architectures. For example, Altera Adaptive6-LUTs in Stratix IV and Xilinx Virtex 5 6-LUTs can befractured into two smaller LUTs with some limitations oninputs.

The crossbar for fracturable architectures are larger than thenonfracturable architectures for two reasons. Due to the virtualincrease of LEs, a larger number of CLB inputs are required,which increases crossbar size. Since there are now twice asmany outputs from the LEs, these additional outputs need toalso be fed back into the crossbar, also increasing its size.Due to this disparity in crossbar size, fair comparisons cannotbe made between fracturable and nonfracturable architectures.Therefore, in this paper, we compare nonfracturable hybridCLB architectures to a baseline LUT only nonfracturablearchitecture and we compare fracturable hybrid CLB archi-tectures to a baseline LUT-only fracturable architecture.

Sparse crossbars have been previously studied [19] andin this paper, we model a 50% depopulated crossbar withinthe CLB for intracluster routing for both nonfracturable andfracturable architectures as compared with the preliminarypublication [15] that only modeled a full input crossbar.Extended discussion on architecture modelling followsin Section V.

D. Area Modelling

1) MUX4 Logic Element: Initial estimates of the MUX4 ele-ment showed that the MUX4 is ∼10% the area of a 6-LUToverall. A 4-to-1 MUX can be realized withthree 2-to-1 MUXs. Hence, the MUX4 element containsseven 2-to-1 MUXs, four SRAM cells, and four invertersin total (see Fig. 1). The optional inversion uses the fourSRAM cells, whereas the rest of the LE configuration isperformed through routing. In addition, the depth of theMUX tree is halved compared with the 6-LUT, which hassix 2-to-1 MUXs on its longest paths. Conservatively,assuming constant pass transistor sizing and that the areaof a 2-to-1 MUX and six transistor SRAM cell are roughlyequivalent, the MUX4 element has (1/16)th the SRAM areaand (1/8)th the MUX area of a 6-LUT.

These estimates were revised using transistor level mod-elling of the circuit blocks. Transistor-level optimization of theconstituent circuit blocks of an FPGA requires anunderstanding of the optimal area-delay tradeoffs foreach individual circuit block. This requires extracting arepresentative critical path, which is a path whose compositionof blocks and topology will be similar to the critical path ofa specific design. Extracting the representative critical pathallows us to judge to what extent each individual block istiming critical, which thus establishes an area-delay tradeoff

goals for each block. This is in line with the transistor-leveloptimization tool developed previously [20]. We use the resultsof prior work [20] to establish the optimal area-delay tradeofffor 6-LUTs in a conventional island-style FPGA architecturewith typical architectural parameters. The resulting 6-LUTdelay serves as a point of reference for optimization for thecircuits considered in this paper: in the interest of maximizingarea reduction while allowing performance to be maintained(ignoring the differences in cell counts between mapping toa conventional LUT and the LEs proposed in this paper),we attempt to match the delay of a 6-LUT while minimizingthe area of each of the variants of the MUX4 circuits.Transistor level modelling and optimizations were based ona predictive 22-nm high performance process [21], while thearea model presented in prior work [20] was used to estimatethe area of various circuit structures. With this methodology,we determined an area-delay optimal 6-LUT has an areaof 930 minimum-width transistors, and a worst-case delayof 261 ps. For the MUX4 cell and Dual MUX4 cell, aminimum area and minimum delay cell was created. Theminimum area MUX4 cell has an area of 95 minimum-width transistors and a delay of 204 ps; all transistors wereminimum-width in this case, and as the minimum area solutionfor this circuit was able to meet (and improve upon) theworst-case delay target of a 6-LUT. Similarly, the Dual MUX4cell has an area of 249 minimum-width transistors whilemeeting the worst-case delay requirement. However, we choseto use the minimum delay design for both the MUX4 andDual MUX4 elements for the rest of the study as there isnot a significant increase in area over the minimum areadesign.

Although the modelling was performed in the 22-nmprocess, the standard VPR architecture we use has all para-meters (routing delays, crossbar delays, and so on) scaled to a40-nm process. In this standard VPR architecture, parametersare compounded from a multitude of sources, some also inother lithographic processes, and subsequently scaled to 40-nm. Likewise, we linearly scale our delays by comparing thedelays of our 22-nm 6-LUT (261 ps) and the 6-LUT in thestandard architecture (398 ps). The delays for each design afterscaling to 40-nm are shown in Table I.

2) FPGA Area Model: Although determining the area ofa MUX4 element relative to a 6-LUT is important, we needto also examine global FPGA area considering the numberof CLB tiles, area overheads within the CLB and routingarea per CLB. Throughout this paper, global FPGA area wasestimated assuming that, per tile, 50% of the area is interclusterand intracluster routing, 30% of the area is used for LUTs,and 20% for registers and other miscellaneous logic, following

Page 5: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1284 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

TABLE II

PER ARCHITECTURE, MINIMUM RELATIVE ARCHITECTURAL PERCENTAGE AREA, AND TOLERABLE PERCENTAGE

CLB INCREASE ASSUMING CONSTANT ROUTING DEMAND

Anderson and Wang [7] and a private communication [22].It is important to note that this 50%–30%–20% model is anestimate based on a traditional full FPGA design where-bythe routing and internal CLB crossbars are optimizedtoward 6-LUTs. Production of an optimized FPGA utilizingour new MUX4 elements would surely change said model.However, optimizing the entire routing architecture towardour MUX4 variants, measuring the routing architecture, andclosing the loop by creating a more accurate model is out ofthe scope of this work.

Using this model, we can make some observations aboutthe hybrid CLB architecture. The 30% that normally wouldaccount for ten 6-LUT LEs within the tile is now split betweenthe smaller MUX4 elements and 6-LUTs. For example,in a 3 MUX4:7 6-LUT architecture, the area relative tothe reference area model can be estimated by deducing theLogicChange% = (3 × 0.116 + 7)/10 (3 MUX4s eachat 0.116 the area of a 6-LUT and 7 6-LUTs), and multiplyingLogicChange% × 30% = 22% of total FPGA area.If routing and miscellaneous area were held constant, our over-all architecture area is Area3:7 = 50% + 20% + 22% = 92%of the reference area—8% area savings. However, this is themaximum area savings and it can only be realized by circuitsthat have a natural (i.e., inherent) MUX4:LUT ratio greaterthan or equal to the architecture ratio. In addition, sinceany function that can be mapped to a MUX4 element canalso be mapped into a 6-LUT, all excess MUX4 functionscan be mapped to 6-LUTs. If the natural MUX4:LUT ratioof the circuit is less than the architecture ratio, additionalCLBs will be required to supply more LUTs. In addition,the number of CLBs may also increase during CLB packing(C L BChange%) and routing demand may increase postplace-ment and routing (RoutingChange%). In general, the modelused to estimate area relative to the baseline 6-LUTonly architecture (nonfracturable or fracturable) is asfollows:

Area% = CLBChange% × (50% × RoutingChange%

+ 30% × LogicChange% + 20%). (1)

Using this model, it is useful to calculate how manyadditional CLBs can be tolerated for our new architectures.Again, consider a 3:7 MUX4:LUT architecture. Disregardingpacking, placement, and routing effects

NumCLB3:7 ≤ 1/Area3:7 × NumCLBLUT

≤ 1.08 × NumCLBLUT . (2)

This means that an area win can only be achieved if thenumber of CLBs needed to implement circuits in a hybrid

3:7 architecture is less than 1.08× the number needed for atraditional LUT-only architecture.

Similarly, the calculation is performed for the fracturablearchitecture with the larger Dual MUX4 element. A full tablefor all architectures showing the architectural minimum areaand tolerable CLB increase is shown in Table II.

IV. TECHNOLOGY MAPPING USING ABC

ABC [13] was used for technology mapping, with modifica-tions that allow for MUX4-embeddable function identificationand MUX2-embeddable function indentification in the caseof fracturable MUX4s and custom mapping. The internaldata structure used within the ABC is an AIG, where thelogic circuit is represented using 2-input AND gates withinverters. Priority Cuts mapping in ABC (invoked with theif command) [23] was modified to perform our customtechnology mapping. This mapper traverses the AIG fromprimary inputs to primary outputs finding intermediatemappings for internal nodes and finally the primary outputs,using a dynamic programming approach. The priority cutsmapper performs multiple passes on the AIG to find the bestcut per node. For depth-oriented mapping, the mapper firstprioritizes mapping depth then optimizes for area discardingcuts whose selection would increase the overall depth of themapped network.

Based on this standard mapper, two mapper variants wereproduced and evaluated. The first variant, NaturalMux,evaluates and identifies internal functions that areMUX4-embeddable, agnostic of the target architecture;i.e., this flow uses the default priority cuts mapping andperforms a postprocessing step to identify MUX4-embeddablefunctions. From this mapping, we can evaluate what areasavings are possible without any mapper changes. The secondvariant MuxMap, area-weights the MUX4-embeddable cutsrelative to 6-LUT cuts, thereby establishing a preference forselection/creation of MUX4-embeddable solutions.

In all mapper variants, cuts that are MUX4-embeddableneed to be identified, meaning that we must determinewhether the logic function implied by such cuts can beimplemented in a MUX4 LE. For LUTs that are identified asMUX4-embeddable, we tag them in the mapped networkwritten out by ABC so that the VPR is able to pack theseelements into the hybrid-CLB architectures. The identificationfunction essentially performs a 2-level Shannon decompositionfor all combinations of select inputs—a maximum of C6

2 timesfor a cut of size 6. For a logic function f , let Inputs( f )represent its variable set, {x0, . . . , xi }, and let fxi x j representthe Shannon cofactor of f with respect to its variablesxi and x j . Logic function f can be implemented in a MUX4

Page 6: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1285

if and only if

|Inputs( f )| ≤ 3 (3)

or

∃xi , x j ∈ Inputs( f ) such that

|Inputs( fxi x j )| ≤ 1

|Inputs( fxi x j )| ≤ 1

|Inputs( fxi x j )| ≤ 1

|Inputs( fxi x j )| ≤ 1. (4)

That is, any function up to three inputs can be implementedin a MUX4, and for functions with four or more inputs, theremust exist two variables such that the Shannon cofactors withrespect to such variables have one or fewer inputs.

Likewise, conditions must hold true forMUX2-embeddability using a 1-level Shannon decomposition.Note that a 1-level Shannon decomposition technique haspreviously been leveraged for mapping into asymmetric-LUTarchitectures [8]. Any function up to two inputs can beimplemented in a MUX2. The only three-input functionthat can be implemented in a MUX2 is a 2-to-1 MUX withoptional data input inversion. This limited flexibility wasquickly seen with the implementation of a MUX4 that wasfracturable into two MUX2s. These MUX2s do not provideample function diversity to justify the overheads of gettingthe signals into and out of the hybrid CLB and led to thecreation of the Dual MUX4.

Compared with the preliminary publication [15], we havealso fully modeled all MUX inputs for wholly accurateMUX4 input mapping. Previously, all the MUX data-inputswere optimistically modeled to be equivalent andinterchangeable. Likewise for MUX select-inputs.In this paper, each of the select input(s) and data inputs to theMUX4 element is classified by the mapper on a pin-by-pinbasis, so that much more accurate packing can be performedin the VPR.

A. NaturalMux

NaturalMux mapping invokes the standard priority cutsmapper. Following mapping, we use the preceding approachto determine if the LUT logic functions in the mapping areMUX4-embeddable. This is needed so we can identify whichLUTs are MUX4-embeddable in the subsequent packingstage.

B. MuxMap

In default ABC technology mapping, each LUT has a unitarea of 1. In our MuxMap approach, we use a lower weightfor the cases where logic functions are MUX4-embeddable.Following the area model where 50% of an FPGA tile area isrouting, 30% is 6-LUTs and 20% is miscellaneous circuitry(FFs + other), we can derive the weight of a MUX4 elementversus a 6-LUT. Dividing an FPGA tile into ten subtiles thatcontain a single 6-LUT plus the 6-LUT’s associated routingand miscellaneous circuitry, the 6-LUT or logic portion ofa subtile is 3% and the miscellaneous circuitry and routing

Fig. 6. MUX4 LE model.

Fig. 7. MUX4 mode in the 6-LUT element model.

is 7% of a complete tile. Recall from Section III-D that aMUX4 element consumes 11.6% of the area of a 6-LUT.Therefore, the area of a subtile with a MUX4 is 7.45% of theentire tile, i.e., 7% routing and miscellaneous circuitry areaplus 11.6% × 3% logic area. The area ratio of a subtile witha MUX4 versus a subtile with a 6-LUT would be roughly7.34%/10% = 0.734% (assuming the routing and othercircuitry is held constant). Following this reasoning, we weightMUX4s conservatively at 80% of a 6-LUT during technologymapping. Experimental results shown in Section VI-A showthat this is a reasonable choice.

C. Select Mapping

Depending on the circuit, NaturalMux or MuxMap may bepreferred. In select mapping, the circuit is first mapped usingNaturalMux. Following from the discussion in Section III-D,we know that if a circuit’s MUX4:LUT ratio is higher thanthe architectural ratio, maximum area reductions are realized.Therefore, if the natural ratio of the circuit is higher thanour target architectural ratio, we use this mapping. Otherwise,if the natural ratio is lower than the architectural ratio, we rerunthe mapping with the MuxMap mapper to encourage theselection of more MUX4-embeddable LEs. Note that the tech-nology mapping run-time is a small fraction of that requiredfor placement and routing.

V. MODELLING USING VPR

VPR was used to perform architectural evaluation. Thestandard ten 6-LUT CLB architecture in 40-nm included withthe VPR distribution was used for baseline modelling [1]. Thehybrid CLBs shown in Figs. 4 and 5 were modeled using theXML-based VPR architectural language [1]. The snippet fromthe architecture file for the physical block hardenedMUX4 element is shown in Fig. 6—this code specifiesa MUX4 as a six-input one-output blackbox to the VPR.In addition, since all MUX4s can also be mapped tothe 6-LUTs, an additional mode was added to the 6-LUTphysical block, shown in Fig. 7. The mode concept allows

Page 7: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1286 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

the VPR packer to pack LUTs into LUTs (as usual), but alsoenables MUX4s to be packed into the LUTs. The architectureswith CLBs having MUX4:LUT ratios from 1:9 to 5:5 werecreated from the baseline 40-nm architectures with delaysobtained through circuit simulations of the MUX4 variants.

Importantly, we made minor modifications to theVPR packing algorithm [1] itself, so that the MUX4 netlistelements are preferred to be packed into the MUX4 LEsin the architecture (while limiting packing MUX4 netlistelements into LUTs). The modifications involved changingthe attraction function during the CLB packing. One changewas to ensure that the logic functions that were MUX4embeddable were preferentially packed into a physicalMUX4 element and not into an LUT. Another was toapply a negative weight on MUX4-embeddable functionswhen the current CLB’s physical MUX4 elements are alloccupied—also preventing MUX4-embeddable functions frombeing placed into the LUTs. Without this, the MUX4 netlistelements might needlessly consume LUTs, which shouldbe reserved, where possible, for those netlist elements thatdemand their flexibility. This becomes doubly important forfracturable architectures, since their packing problem is morecomplex. Without this modification, a significant CLB usageincrease was observed across all benchmark sets.

VI. EXPERIMENTAL EVALUATION

To determine the benefits of these new architectures,evaluation was performed for each architecture using multiplebenchmark suites and mapping schemes. Two benchmarkssuites were used to evaluate our hybrid architectures:1) VTR7 [1] and 2) CHStone [2]. Over the nonfracturableand fracturable architecture families, two sets of experimentswere performed using the NaturalMux and MuxMap mappingschemes. After both mappings were performed, the selectmapping was performed as described in Section IV-C.

While the VTR7 benchmarks are input to our flowas Verilog, the CHStone benchmarks are input as C.LegUp 3.0 [3] was used for C-to-Verilog HLS for theseCHStone benchmarks. Differing from our preliminarywork [15] (that had two separate tool flows), modificationswere made to both the LegUp and ODIN-II tools so thatCHStone benchmarks could be carried through the typical fullVTR flow (ODIN-II, ABC, VPR) using ODIN-II forfront-end synthesis. In the preliminary work [15],Altera Quartus II is used for the front-end synthesisthus producing an Altera-specific mapping for DSP blocks,Memory blocks, and Carry Chains—all of which are notsupported by the standard VPR architectures. This restrictedus only to be able to make post-mapping estimates. In thisnew tool-flow, ODIN-II is able to synthesize the Verilogoutput from LegUp-HLS. ODIN-II, thereby produces thecompatible DSP and Memory blocks for our modified standardFPGA architectures, allowing for post-place-and-route results(previously nonexisting).

Lastly, the mcml and LU32PEEng benchmarks were omittedfrom the VTR7 benchmark suite due to excessive runtime.

1) Mapping: Using ABC, we performed technology-independent optimization using the standard resyn2 script

included with the ABC distribution [13]. Then, we per-formed technology mapping with the priority cuts mapper(if command), targeting an LUT size of 6. Our modifiedpriority cuts mapper was invoked using the if command with acut set size of 32 also limiting the maximum cut size to 6. Theinitial unaltered baseline mapper was run without any mappingto MUX4s, and then NaturalMux mapping was performed forthe identification of MUX4-embeddable functions. Our secondmapper, MuxMap, seeks to improve the mapping by reducingthe area cost of a MUX4-embeddable function within prioritycuts mapping.

After mapping with both mappers, we computed theratio of the number of MUX4-embeddable LEs tothe total number of LEs, i.e., the MUX4 percentage.Based on this ratio for each circuit, we projectedthe area benefits of hybrid architectures 1:9 to 5:5MUX4:LUT ratios in the nonfracturable architectures.The area projections were made assuming complete (full)CLB packing, and the tile area breakdown describedin Section III-D (assuming no routing area change). Theresults are shown in Table III and demonstrate the maximumarea savings to be had for each benchmark, since theseprojections ignore the effects of packing, placement, androuting constraints. For the fracturable architectures, thisprediction is made difficult, since pairing of functions to bepacked into fractured LEs is largely determined by pin-sharingon the LE inputs—a packing constraint. No predictions aremade for the fracturable architectures, however, packed,placed, and routed results for the fracturable architectures arediscussed in Section VI-B.

The MuxMap improves mapping by increasing the numberof MUX4-embeddable logic functions by reducing the areacost of a MUX4-embeddable function, thereby encouragingthe mapper to create more MUX4-embeddable functions,increasing the MUX4-embeddable ratio of each circuit overall.However, the improvements in the ratio may come at the costof a higher total number of LEs. If there is a significantincrease in LEs (greater than the limits shown in Table II),no area savings may be seen. Fig. 8 shows a sweep of theweighting of MUX4-embeddable LEs from 0% cost to 100%cost of a 6-LUT over the VTR7 suite; and the projected area isshown on the vertical axis (normalized to a traditionalLUT-based architecture). Area is minimized for moreaggressive architectures (5:5 and 4:6) around a weight-ing of 50%–60% whereas less aggressive architectures(3:7, 2:8, 1:9) do not benefit at all from a reduced weighting.This can be seen as the minimum area for a weightingof 100%.

As the number of LEs grow, packing, placement, androuting effects play a greater role in the final circuit area.In the remainder of this paper, a weighting of 80% was chosenfor the MuxMap as this gave a good balance of additionalMUX4-embeddable LEs. Lower weightings result in manyadditional LEs, exacerbating the losses due to packing,placement, and routing.

The left-hand side of Table III shows the projected arearesults for NaturalMux mapping as well as the baselinestatistics of each benchmark in the two benchmark suites.

Page 8: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1287

TABLE III

POSTMAPPING AREA ESTIMATE FOR VTR7 AND CHSTONE BENCHMARKS ASSUMING COMPLETE CLB PACKING

AND NO INCREASE IN ROUTING DEMAND

Fig. 8. MuxMap varying the weight from zero area to equal 6-LUT areafor different MUX4:LUT architectures. A weight equal to the LUT area isequivalent to the NaturalMux mapping. As the MUX4 area weighting isincreased for some architecture ratios (e.g., 5:5 and 4:6), an increase in relativearea is seen due to unbalanced circuit and architecture MUX4:LUT ratios.

The right-hand side shows projected area results for theMuxMap mapping with a weighting of 0.8. For bothof the mappers, we show projected results for the architectureswith ratios ranging from 1:9 to 5:5. Again, Table II showsthe lower bound architectural areas relative to a traditional6-LUT-based architecture for the configurations tested and

this lower bound can only be achieved when the naturalMUX4-embeddable ratio exceeds the architecture ratio.When the natural ratio is lower than the architecture ratio,more CLBs are required, increasing area. Table III shows thelower bound on area for each benchmark and nonfracturablearchitecture, since this projection assumes perfect packing,placement, and routing of each benchmark on eacharchitecture. Looking at the left half of Table III, the baselinenumber of LEs along with the natural MUX4-embeddable ratiois given along with projected areas for each architecture forNaturalMux.

The VTR7 benchmarks are projected to perform well, witha high natural MUX4-embeddable ratio of 40% over allbenchmarks. For the VTR7 benchmarks, a 3:7 architectureseems to be the best architecture for NaturalMux mapping,yielding ∼7% projected area reduction postmapping. MuxMapshows an increase in MUX4-embeddable ratio to 46% overall benchmarks; however, the increase in LE count associ-ated with MuxMap leads to worse projected area results perarchitecture versus NaturalMux, except for the 4:6 and 5:5architectures. Employing the Select mapping scheme yieldsmarginally better area reduction for the best architectureratio, 3:7 (∼1% above the architectural minimum).

The CHStone benchmarks all show large percentagesof MUX4-embeddable functions using the NaturalMuxmapper, 55%. In fact, the percentage of MUX4-embeddablefunctions is so large that in nearly all cases, the architecturalminimum area is reached. In this case, the MuxMap mapper

Page 9: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

Fig. 9. Input-size distribution of VTR7 benchmarks with theMUX4-embeddable LEs in blue (light gray) and LUTs in purple (dark gray).

Fig. 10. Input-size distribution for CHStone benchmarks with theMUX4-embeddable cuts in blue (light gray) and LUTs in purple (dark gray).

and Select schemes have no further improvements. Fromthese postmapping estimates, it seems as though even a veryaggressive 5:5 architectural ratio for the CHStone suite maybe the best as we reach the architectural minimum area forthe nonfracturable 5:5 architecture. This variation betweenbenchmark suites is interesting and the CHStone benchmarkssuiting more aggressive architectures could be a function ofhow LegUp-HLS transforms the CHStone benchmarks, writtenin C, to hardware. Again, we see that different architecturalconclusions can be made based on the benchmark circuitsemployed in an architectural study [24].

We can also examine how the weighting within theMuxMap mapper affects the used input-size distribution forthe LUTs in the circuits versus the NaturalMux mapper. Theinput-size distributions for the NaturalMux and MuxMapmappings are shown in Figs. 9 and 10 for VTR7 and CHStonebenchmarks, respectively. Each bar in the distribution showsthe portion of LEs (with a given number of used inputs) thatare MUX4-embeddable (in light-blue). The VTR suite hasonly a small percentage of functions with more than threeinputs that are naturally MUX4-embeddable (∼10%). WithMuxMap, the percentage of larger five- and six-input functionsis somewhat reduced with these functions being implementedmostly as three- or four-input MUX4-embeddable functions.Looking at the six-input functions specifically, we also observethat very few are MUX4-embeddable. Recall that the onlysix-input logic function that is MUX4-embeddable is a4-to-1 MUX (with possible data input inversions).Two- and three-input MUX4-embeddable functions are alsoquite abundant in both the VTR7 and CHStone benchmarks,comprising the majority of all the MUX4-embeddablefunctions. Concerning these small functions, note thatover 35% of functions in the VTR7 circuits and 40% inCHStone have three or fewer inputs (Fig. 9), which bodes

well for the proposed hybrid architectures. In addition,in both benchmark suites, MuxMap encourages a decreasein five-input elements and an increase in MUX4-embeddablethree-input elements.

A. Nonfracturable Architectures: Packing, Place, and Route

Of course, the previous postmapping results do not considerthe possible impact of the addition of MUX4 elements tothe FPGA architecture on packing, placement, and routabil-ity. Therefore, the mapped netlists were packed, placed,and routed, using VPR, into the architectures discussed inSection V. The final area estimates in Tables IV and Vshow the area reduction relative to the baseline mapper and atraditional nonfracturable 6-LUT architecture. We use the areaequation shown in Section III for these estimates. VPR wasconfigured to find minimum channel width with timing drivenrouting. For all benchmarks, five different placement seedswere used to account for the random effects of placement onfinal routing. For C L BChange%, we use the change in packedCLBs over the baseline. For RoutingChange%, we use thechange in routing area per logic tile (measured in minimum-width transistors) reported by VPR. Finally, we computeLogicChange% according to the chosen hybrid CLB archi-tecture ratio—this is constant for a given architecture andrepresents the physical reduction in LE area.

Tables IV and V show the results for each benchmark usingthe NaturalMux and MuxMap mapping schemes respectively,over all nonfracturable architectures. In Tables IV and V,the baseline CLB count is shown with the naturalMUX4-embeddable LE ratio. For each architecture, thepercentage change in CLBs and routing with respect to thebaseline 6-LUT-only architecture is shown with area based onthe model in Section III-D. For each benchmark suite, we takethe geomean area across all benchmarks in the suite. Eachof these final mean relative area numbers is shown in bold.Table V has additionally a row showing the select geomean,which is the result of applying the select scheme between bothmappers for each architecture.

Overall, we see increases in packed CLBs for bothbenchmark suites, but for most architectures, this gain is offsetby the smaller hybrid CLB area. Looking at the minimumgeomean areas in Table IV, we can see that the best archi-tecture suited toward NaturalMux mapping is the 2:8 forVTR7 and 4:6 for CHStone. But for the MuxMap mappingscheme in Table V, the VTR7 performs best on a3:7 architecture and CHStone performs the best on a 5:5architecture. However, the best results overall are shown withSelect. For Select mapping, the VTR7 on a 3:7 architectureis the best with ∼4% area reduction, but it is down four per-centage points from the postmap estimate. CHStone performsbest in a 4:6 architecture at ∼9% area reduction. The postmap-ping projections again suggested 3% points better area for the5:5 architecture.

For the nonfracturable architectures, a ratio of 3:7 seemsfitting as we see gains for both benchmark suites—though notfull gains for the CHStone.

Page 10: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1289

TABLE IV

NONFRACTURABLE ARCHITECTURE: POSTPLACE AND ROUTE RESULTS SHOWING ARCHITECTURAL SWEEPS BETWEEN ZERO (BASELINE) TO

FIVE MUX4 LEs OUT OF TEN, FOR FRACTURABLE HYBRID-CLB ARCHITECTURES, USING THE NATURALMUX MAPPING SCHEME.

ALL BENCHMARKS WERE AVERAGED OVER FIVE PLACEMENT SEEDS

TABLE V

NONFRACTURABLE ARCHITECTURE: POSTPLACE AND ROUTE RESULTS SHOWING ARCHITECTURAL SWEEPS BETWEEN ZERO (BASELINE) TO

FIVE MUX4 LEs OUT OF TEN, FOR FRACTURABLE HYBRID-CLB ARCHITECTURES USING THE MUXMAP MAPPING SCHEME.

ALL BENCHMARKS WERE AVERAGED OVER FIVE PLACEMENT SEEDS

Page 11: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1290 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

TABLE VI

FRACTURABLE ARCHITECTURE: POSTPLACE AND ROUTE RESULTS SHOWING ARCHITECTURAL SWEEPS BETWEEN ZERO (BASELINE) TO

FIVE MUX4 LEs OUT OF TEN, FOR FRACTURABLE HYBRID-CLB ARCHITECTURES, USING THE NATURALMUX MAPPING SCHEME.

ALL THE BENCHMARKS WERE AVERAGED OVER FIVE PLACEMENT SEEDS

TABLE VII

FRACTURABLE ARCHITECTURE: POSTPLACE AND ROUTE RESULTS SHOWING ARCHITECTURAL SWEEPS BETWEEN ZERO (BASELINE) TO

FIVE MUX4 LEs OUT OF TEN, FOR FRACTURABLE HYBRID-CLB ARCHITECTURES, USING MUXMAP MAPPING SCHEME.

ALL THE BENCHMARKS WERE AVERAGED OVER FIVE PLACEMENT SEEDS

Page 12: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

CHIN et al.: HYBRID LUT/MUX FPGA LOGIC ARCHITECTURES 1291

TABLE VIII

GEOMEAN FMAX CHANGE FOR EACH ARCHITECTURE FOR SELECT

MAPPING RELATIVE TO THE STANDARD TEN 6-LUT

ARCHITECTURE FOR EACH BENCHMARK SUITE

B. Fracturable Architectures: Packing, Place, and Route

The final area estimates in Tables VI and VII show thearea reduction relative to the baseline mapper and a traditionalfracturable 6-LUT architecture. The same experimentalmethodology used for nonfracturable architectures is used forthe fracturable architectures.

The fracturable architectures, overall, show vastly differingperformance compared with the nonfracturable architectures.Using NaturalMux mapping shown in Table VI, the bestarchitectures are the 1:9 architecture for VTR7 and the2:8 architecture for CHStone. Only up to ∼2% area savingsare seen for NaturalMux mapping for fracturable architectures.Using MuxMap mapping (Table VII), the best architecturesare the 1:9 architecture for VTR7 but now the 3:7 architecturefor CHStone. Again, only up to ∼2% area savings are seenfor MuxMap mapping for fracturable architectures. For Selectmapping, we see marginal gains and show the best overallresult for both the VTR7 and CHStone benchmark suites.A 1:9 architecture looks best for VTR7 at an overall 2.1%area reduction and a 2:8 or 3:7 looks best for the CHStonesuite at 2.6% and 2.7% reduction, respectively.

Recall that the fracturable 6-LUT architecture presented isvery flexible—it can map two functions with up to four disjointinputs each. This flexibility is quite powerful as seen with thereduced performance of the fracturable architectures versusthe nonfracturable architectures. The Dual MUX4 element canonly map some combinations of three and four input functions,since four inputs to the LE are shared (see Fig. 3). Theconstraints (pin sharing) on the input pins of the Dual MUX4restrict many pairs of functions, usually two to four inputs, tobe mapped, whereas the fracturable LUT is able to map anypair of functions up to four-inputs each.

Looking again at the function input distributionsin Figs. 9 and 10, we can see that over 50% of the functionsin each benchmark suite is four-input or smaller. Thus, thesefour-input or smaller functions are ideal candidates formapping into the fracturable LUTs, allowing the fracturableLUT to perform very well in terms of logic density. Alsofrom the distributions, there are not many five- or six-inputfunction that can be mapped to MUX4 elements. Thus,most of these elements also require LUTs leaving not manyfunctions left to be mapped to Dual MUX4 elements.

C. Performance

The worst-case delay for both the 6-LUT and MUX4variants was used to determine the longest delay and FMax.Table VIII shows the geomean FMax change for Select map-ping for nonfracturable and fracturable architectures for eachbenchmark suite. Using the Select mapping results, we are

taking the best mapping scheme based on area and then mea-suring performance. Over all tested architectures, we observedbetween a 2% improvement to a 7% slowdown. The generaltrend seems to be that the more aggressive architectures seemore slowdown especially for fracturable architectures.

For the nonfracturable architecture, the best area savingswere seen in 3:7 and 4:6 ratios. The performance is relativelyconstant for the 3:7 architecture, while the 4:6 architecturethe VTR7 benchmarks have a 2.5% hit and the CHStonebenchmarks have a 2.5% gain.

For fracturable architectures, we saw the best area savings ina 1:9 architecture for VTR7 and for performance, we now seea 1% slowdown. For CHStone, based on area, the preferablearchitecture was a toss-up between a 2:8 and 3:7 architecturebut, here, we see that a 2:8 architecture is preferable, since wetake a smaller performance hit of 1.5%.

Overall, the performance impact on the architectures thatyielded the best area savings were minimal for fracturablearchitectures, while some gains were seen for nonfracturablearchitectures.

D. Discussion

The results show that relative to baseline nonfracturableLUT-based LE architectures, the inclusion of hardenedMUX4s to form a hybrid LUT/MUX4-based CLB providesan area benefit of up to 8.9% and a 2.5% increase in FMax(for the case of the CHStone circuits, 4:6 architecture andSelect mapping). In essence, this result reaffirms conventionalwisdom that LUTs are in a sense over-engineered for manylogic functions that regularly occur in application circuits. Forsuch functions, one can get away with using a LE with reducedflexibility.

With respect to a baseline of fracturable LUT-basedelements, the incorporation of MUX4s offers at most a modestarea benefit of 2.6% with a small performance hit of 1.5%(CHStone, with a 2:8 ratio). The notion of making LUTs frac-turable is, in fact, a feature whose purpose is rooted in theoverengineered nature of nonfracturable LUTs: fracturableLUTs permit area recovery for implementing small logicfunctions. Thus, the underlying purpose of making LUTs frac-turable is not orthogonal to the purpose of incorporatingMUX4s, and hence, it is perhaps unsurprising that MUX4sprovide a small benefit in this case. Nevertheless, we considerthe results encouraging in that they point out that the LUT,while being the long-standing lynchpin of FPGA logic, maynot be the only circuit structure capable of delivering highlogic density. Other structures, such as MUXs, are worthy ofconsideration.

VII. CONCLUSION

We have proposed a new hybrid CLB architecture con-taining MUX4 hard MUX elements and shown techniquesfor efficiently mapping to these architectures. Weighting ofMUX4-embeddable functions with our MuxMap techniquecombined with a select mapping strategy provided aid tocircuits with low natural MUX4-embeddable ratios. We alsoprovided analysis of the benchmark suites postmapping, dis-cussing the distribution of functions within each benchmark

Page 13: 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE …kresttechnology.com/krest-academic-projects/krest-mtech... · 2016-07-07 · 1280 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)

1292 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 4, APRIL 2016

suite. From our first set of experiments with nonfracturablearchitectures, area reductions of up to 8% were seen fora 4:6 MUX4:LUT architecture in the CHStone suite witha 2:8 architecture most viable for the VTR suiteswith ∼5% area savings. Our second set of experimentswith fracturable architectures showed that the flexibility of afracturable LUT is very powerful, reducing the impact ofthe MUX4 LEs, yielding smaller ∼2%–3% area savingsover the VTR7 and CHStone benchmark suites with lessaggressive 2:8 and 1:9 architectures, respectively. Interestingly,we again found that different architectural conclusions canbe made based on the benchmark circuits employed in anarchitecture study [24], since CHStone benchmarks generallypreferred more aggressive MUX4:LUT architecture ratios.The CHStone benchmarks being high-level synthesized withLegUp-HLS also showed marginally better performance andthis could be due to the way LegUp performs HLS onthe CHStone benchmarks themselves. Overall, the additionof MUX4s to FPGA architectures minimally impact FMaxand show potential for improving logic-density in nonfrac-turable architectures and modest potential for improving logic-density in fracturable architectures.

REFERENCES

[1] J. Rose et al., “The VTR project: Architecture and CAD for FPGAsfrom verilog to routing,” in Proc. ACM/SIGDA FPGA, 2012, pp. 77–86.

[2] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal andquantitative analysis of the CHStone benchmark program suite forpractical C-based high-level synthesis,” J. Inf. Process., vol. 17,pp. 242–254, Oct. 2009.

[3] A. Canis et al., “LegUp: High-level synthesis for FPGA-basedprocessor/accelerator systems,” in Proc. ACM/SIGDA FPGA, 2011,pp. 33–36.

[4] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-submicron FPGA performance and density,” IEEE Trans. Very LargeScale Integr. (VLSI), vol. 12, no. 3, pp. 288–298, Mar. 2004.

[5] J. Rose, R. Francis, D. Lewis, and P. Chow, “Architecture of field-programmable gate arrays: The effect of logic block functionalityon area efficiency,” IEEE J. Solid-State Circuits, vol. 25, no. 5,pp. 1217–1225, Oct. 1990.

[6] H. Parandeh-Afshar, H. Benbihi, D. Novo, and P. Ienne, “RethinkingFPGAs: Elude the flexibility excess of LUTs with and-inverter cones,”in Proc. ACM/SIGDA FPGA, 2012, pp. 119–128.

[7] J. Anderson and Q. Wang, “Improving logic density through synthesis-inspired architecture,” in Proc. IEEE FPL, Aug./Sep. 2009, pp. 105–111.

[8] J. Anderson and Q. Wang, “Area-efficient FPGA logic elements:Architecture and synthesis,” in Proc. ASP DAC, 2011, pp. 369–375.

[9] J. Cong, H. Huang, and X. Yuan, “Technology mapping and architectureevalution for k/m-macrocell-based FPGAs,” ACM Trans. Design Autom.Electron. Syst., vol. 10, no. 1, pp. 3–23, Jan. 2005.

[10] Y. Hu, S. Das, S. Trimberger, and L. He, “Design, synthesis andevaluation of heterogeneous FPGA with mixed LUTs and macro-gates,”in Proc. IEEE ICCAD, Nov. 2007, pp. 188–193.

[11] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2,pp. 203–215, Feb. 2007.

[12] K. Karplus, “Amap: A technology mapper for selector-based field-programmable gate arrays,” in Proc. 28th ACM/IEE DAC, Jun. 1991,pp. 244–247.

[13] A. Mishchenko, S. Chatterjee, and R. Brayton, “DAG-aware AIGrewriting a fresh look at combinational logic synthesis,” in Proc. 43rdAnnu. DAC, 2006, pp. 532–535.

[14] V. Betz and J. Rose, “VPR: A new packing, placement and routing toolfor FPGA research,” in Proc. 7th Int. Workshop FPL, 1997, pp. 213–222.

[15] S. A. Chin and J. H. Anderson, “A case for hardened multiplexers inFPGAs,” in Proc. FPT, Dec. 2013, pp. 42–49.

[16] M. Purnaprajna and P. Ienne, “A case for heterogeneous technology-mapping: Soft versus hard multiplexers,” in Proc. IEEE 21st Annu. Int.Symp. FCCM, Apr. 2013, pp. 53–56.

[17] (2011). Virtex-6 FPGA User Guide. [Online]. Available:http://www.xilinx.com

[18] (2011). Stratix IV Device Handbook. [Online]. Available:http://www.altera.com

[19] G. Lemieux and D. Lewis, “Using sparse crossbars within LUT,”in Proc. 9th Int. Symp. ACM/SIGDA FPGA, 2001, pp. 59–68.

[20] C. Chiasson and V. Betz, “COFFE: Fully-automated transistor sizing forFPGAs,” in Proc. Int. Conf. FPT, Dec. 2013, pp. 34–41.

[21] Predictive Technology Model. [Online]. Available: http://ptm.asu.edu/,accessed 2015.

[22] Altera, private communication, Mar. 2014.[23] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton, “Combinational

and sequential mapping with priority cuts,” in Proc. IEEE/ACM Int.Conf. ICCAD, Nov. 2007, pp. 354–361.

[24] A. Yan, R. Cheng, and S. J. E. Wilton, “On the sensitivity ofFPGA architectural conclusions to experimental assumptions, tools,and techniques,” in Proc. 10th Int. Symp. ACM/SIGDA FPGA, 2002,pp. 147–156.

Stephen Alexander Chin received theB.A.Sc. degree in electrical engineering andthe M.A.Sc. degree in computer engineering fromthe University of Toronto, Toronto, ON, Canada,in 2009 and 2013, respectively, where he iscurrently pursuing the Ph.D. degree in computerengineering.

His current research interests includereconfigurable architectures and related computer-aided design for silicon efficiency.

Jason Luu received the B.A.Sc. degree in computerengineering from the University of Waterloo,Waterloo, ON, Canada, in 2007, and theM.A.Sc. and Ph.D. degrees from the Universityof Toronto, Toronto, ON, Canada, in 2010 and 2014,respectively.

He has contributed eight years to the open sourceVTR project for field-programmable gate arraycomputer-aided design and architecture research.He is currently a Senior Software Engineerwith Altera Inc., Toronto.

Safeen Huda received the B.A.Sc. andM.A.Sc. degrees in electrical engineering fromthe University of Toronto, Toronto, ON, Canada,in 2009 and 2012, respectively, where he iscurrently pursuing the Ph.D. degree in computerengineering.

He has been involved in the development ofSTT-MRAM at the device and circuit level. Hiscurrent research interests include the developmentof circuits, architectures, and computer-aided designflows for field-programmable gate arrays, with an

emphasis on increasing energy efficiency.Mr. Huda has held the Natural Sciences and Engineering Research Council

of Canada Canada Graduate Scholarship and the University of TorontoFellowship.

Jason H. Anderson (S’96–M’05) received theB.Sc. degree in computer engineering from theUniversity of Manitoba, Winnipeg, MB, Canada,and the M.A.Sc. and Ph.D. degrees in electricaland computer engineering from the Universityof Toronto (U of T), Toronto, ON, Canada.

He joined the Field-Programmable GateArray (FPGA) Implementation Tools Group, Xilinx,Inc., San Jose, CA, USA, in 1997, where he wasinvolved in placement, routing, and synthesis.He is currently an Associate Professor with the

Department of Electrical and Computer Engineering, U of T, and holds theJeffrey Skoll Chair in Software Engineering. He has authored over 60 papersin refereed conference proceedings and journals, and holds 26 issuedU.S. patents. His current research interests include computer-aided design,architecture, and circuits for FPGAs.