Optimal utilization of available reconfigurable hardware resources

15
Optimal utilization of available reconfigurable hardware resources q Kashif Latif , Arshad Aziz, Athar Mahboob National University of Sciences and Technology, H-12 Islamabad, Pakistan article info Article history: Received 4 June 2010 Received in revised form 25 July 2011 Accepted 25 July 2011 Available online 27 August 2011 abstract Field programmable gate arrays (FPGAs) are continuously gaining momentum and becom- ing essential part of today’s digital systems and applications. The growing use of these devices coupled with increasingly more complex and integrated designs necessitates search for techniques in efficient utilization of their internal resources. Standard HDL cod- ing techniques and synthesis tools implement logic to look up table (LUT) based architec- ture. The resulting design utilizes more area on the chip and some fast and dedicated areas and resources of the chip remain unutilized. This in turn results in slower clock rates and larger critical path lengths, hence the design remains inefficient in terms of both speed and area. In this paper we present and discuss techniques to effectively utilize the FPGA dedi- cated resources in order to speed up achievable clock rates and reduce the FPGA area uti- lization. Various useful HDL constructs are presented that utilize dedicated hardware resources of modern Xilinx FPGAs. Optimization techniques are presented with implemen- tation examples and corresponding quantitative performance evaluation. In most of the cases we have achieved 50% reduction in chip area utilization and simultaneously improved timing results significantly. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Field programmable gate array (FPGA) technology is continuously gaining market share and becoming essential part of today’s modern embedded systems. Since their first invention by Xilinx in 1984, FPGAs have moved on from being simple digital logic chips to actually replacing custom application-specific integrated circuits (ASICs) and processors for signal pro- cessing and control applications [1,2]. FPGAs represent the largest segment in the programmable logic device (PLD) market, as reported by Semiconductor Industry Association (SIA, USA) [3]. FPGA technology is the most cost and time effective solu- tion for heavily invested fully customized designs i.e. application specific integrated circuits (ASICs) [4]. The main advantage of the FPGA is its reconfigurability and low development cost and time. FPGA market was forecast to grow from $1895.0 mil- lion in 2005 to surpass $2756.7 million by 2010 [5]. The growing use of reconfigurable devices makes it more important to develop techniques to effectively and efficiently utilize the internal resources of these devices. Conventional HDL coding techniques and synthesis tools implement every type of logic to a look up table (LUT) based architecture. This results in utilization of more area on the chip and some fast and dedicated areas of the chip remain unutilized. This in turn results in slower clock rates and bigger critical path lengths. Hence the design remains inefficient in terms of both speed and area. Normally, utilized chip area of an FPGA is calculated in terms of configurable logic blocks (CLBs) count. A modern FPGA’s CLBs not only contain the LUTs but there are other dedi- cated hardware resources also included within a CLB. For example, Xilinx’s modern FPGAs contain dedicated carry logic gates MUXCY and ORCY, and other dedicated functional gates like MUXFXs and MULT_AND. Conventional HDL and synthesis tech- niques map all of the logic to a LUT based architecture and dedicated area of a CLB remain unutilized. In this scenario 0045-7906/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng.2011.07.010 q Reviews processed and approved for publication to Al-Rabadi. Corresponding author. E-mail addresses: [email protected] (K. Latif), [email protected] (A. Aziz), [email protected] (A. Mahboob). Computers and Electrical Engineering 37 (2011) 1043–1057 Contents lists available at SciVerse ScienceDirect Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Transcript of Optimal utilization of available reconfigurable hardware resources

Page 1: Optimal utilization of available reconfigurable hardware resources

Computers and Electrical Engineering 37 (2011) 1043–1057

Contents lists available at SciVerse ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier .com/ locate/compeleceng

Optimal utilization of available reconfigurable hardware resources q

Kashif Latif ⇑, Arshad Aziz, Athar MahboobNational University of Sciences and Technology, H-12 Islamabad, Pakistan

a r t i c l e i n f o

Article history:Received 4 June 2010Received in revised form 25 July 2011Accepted 25 July 2011Available online 27 August 2011

0045-7906/$ - see front matter � 2011 Elsevier Ltddoi:10.1016/j.compeleceng.2011.07.010

q Reviews processed and approved for publication⇑ Corresponding author.

E-mail addresses: [email protected] (K. Latif), a

a b s t r a c t

Field programmable gate arrays (FPGAs) are continuously gaining momentum and becom-ing essential part of today’s digital systems and applications. The growing use of thesedevices coupled with increasingly more complex and integrated designs necessitatessearch for techniques in efficient utilization of their internal resources. Standard HDL cod-ing techniques and synthesis tools implement logic to look up table (LUT) based architec-ture. The resulting design utilizes more area on the chip and some fast and dedicated areasand resources of the chip remain unutilized. This in turn results in slower clock rates andlarger critical path lengths, hence the design remains inefficient in terms of both speed andarea. In this paper we present and discuss techniques to effectively utilize the FPGA dedi-cated resources in order to speed up achievable clock rates and reduce the FPGA area uti-lization. Various useful HDL constructs are presented that utilize dedicated hardwareresources of modern Xilinx FPGAs. Optimization techniques are presented with implemen-tation examples and corresponding quantitative performance evaluation. In most of thecases we have achieved 50% reduction in chip area utilization and simultaneouslyimproved timing results significantly.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Field programmable gate array (FPGA) technology is continuously gaining market share and becoming essential part oftoday’s modern embedded systems. Since their first invention by Xilinx in 1984, FPGAs have moved on from being simpledigital logic chips to actually replacing custom application-specific integrated circuits (ASICs) and processors for signal pro-cessing and control applications [1,2]. FPGAs represent the largest segment in the programmable logic device (PLD) market,as reported by Semiconductor Industry Association (SIA, USA) [3]. FPGA technology is the most cost and time effective solu-tion for heavily invested fully customized designs i.e. application specific integrated circuits (ASICs) [4]. The main advantageof the FPGA is its reconfigurability and low development cost and time. FPGA market was forecast to grow from $1895.0 mil-lion in 2005 to surpass $2756.7 million by 2010 [5].

The growing use of reconfigurable devices makes it more important to develop techniques to effectively and efficientlyutilize the internal resources of these devices. Conventional HDL coding techniques and synthesis tools implement everytype of logic to a look up table (LUT) based architecture. This results in utilization of more area on the chip and some fastand dedicated areas of the chip remain unutilized. This in turn results in slower clock rates and bigger critical path lengths.Hence the design remains inefficient in terms of both speed and area. Normally, utilized chip area of an FPGA is calculated interms of configurable logic blocks (CLBs) count. A modern FPGA’s CLBs not only contain the LUTs but there are other dedi-cated hardware resources also included within a CLB. For example, Xilinx’s modern FPGAs contain dedicated carry logic gatesMUXCY and ORCY, and other dedicated functional gates like MUXFXs and MULT_AND. Conventional HDL and synthesis tech-niques map all of the logic to a LUT based architecture and dedicated area of a CLB remain unutilized. In this scenario

. All rights reserved.

to Al-Rabadi.

[email protected] (A. Aziz), [email protected] (A. Mahboob).

Page 2: Optimal utilization of available reconfigurable hardware resources

1044 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

counting chip area in terms of CLBs does not represent the actual area utilization of the chip simply because the hardwarewithin a CLB is not fully utilized.

This paper is an extended version of our previously published work [6]. In this paper we present and discuss various tech-niques to effectively utilize the FPGA resources in order to speed up the clock rate and reduce the area utilization. The chiparea and clock speed are the core constraints in all major application areas of FPGAs. Some common application areas ofFPGAs are cryptography, digital signal & image processing, software-defined radios, aerospace and defense systems, ASICprototyping, medical imaging, speech recognition, bio-informatics, computer hardware emulation, metal detection and ahost of emerging application areas. While focusing on the cryptographic applications, the area utilization and speed con-straints of the design are of utmost importance. Most of the cryptographic applications require processing of operands storedin registers with large number of bits. For example, in the RSA public key encryption algorithm keys used for encryption/decryption must be a minimum of 1024 bits in length in order to achieve an acceptable security level. While implementingarithmetic with such large numbers on FPGAs, availability of optimized techniques to fit the design in area/time constraintenvironment is indispensable. Most of the researchers in this area have presented their work either in optimizing the imple-mented algorithm or have presented different architectural techniques to implement the design. We do not find much mate-rial on efficient utilization of available dedicated resources of reconfigurable platforms. This is the reason why most of thereferences used in this paper are of vendor specific documents and Internet resources. In this work we explore the availablededicated hardware resources in reconfigurable platforms, and present them in one place. This work will be extremely help-ful for researchers in FPGA based implementation in various application areas.

The remainder of this paper is organized as follows. We briefly review the internal architecture of a modern Xilinx FPGAin Section 2. Section 3 describes the conventional HDL coding approach with the help of an implementation example. In Sec-tion 4 we present our optimization techniques for the implementation of example in Section 3 and compare the performanceresults of both approaches. In Section 5 we discuss some additional optimized design techniques for wide input Booleanoperations. Section 6 describes some useful techniques to route signals efficiently within an FPGA. Section 7 provides theimplementation results for different Boolean function architectures and some selected ISCAS’85 and ISCAS’89 benchmark cir-cuits, using conventional and optimized techniques. A thorough comparison of results is also presented in this section.Finally, we provide some conclusions in Section 8.

2. Architecture review of Xilinx FPGAs

First, we present a brief review of internal architecture of some modern Xilinx FPGAs. Fig. 1 illustrates the general over-view of the internal architecture of a modern Xilinx FPGA.

A Xilinx FPGA’s internal architecture consists of configurable logic blocks (CLBs), input/output blocks (IOBs), switch matri-ces (SMs) and wire segments [7,8]. A CLB is the basic building block where the logic resides. SMs are programmable inter-connects of wire segments. The CLBs are organized in a grid array to implement different type of logic designs i.e.combinational and synchronous. Each CLB comprises of 4 slices, which in turn contain 2 LUTs each. A LUT combined witha register element is called a logic cell (LC). Each CLB also contains fast local internal routing resources. Fig. 2 illustratesthe simplified view of the Xilinx CLB.

In an array of CLBs, each CLB element is coupled with a switch matrix to access the general routing matrix, as shown inFig. 3. Within a CLB, four slices are arranged in two columns of two slices each with two independent carry logic routingchains and one common shift chain. Fig. 4 illustrates the simplified internal view of a slice.

In some modern FPGAs Xilinx has categorized the slices into two types. One type is of logic only slices called SliceL andother type is of memory slices called SliceM. As shown in Fig. 3, Virtex-4 and Spartan-3 consist of these type of slices. Firstslice column (down) consists of SliceM type slices and second column (up) consists of SliceL type slices. SliceL can only imple-ment combinational logic functions. However, SliceM can also be configured to implement distributed memory or shift

Fig. 1. Generic FPGA architecture.

Page 3: Optimal utilization of available reconfigurable hardware resources

Fig. 2. Xilinx FPGA’s CLB.

Fig. 3. Slice structure of Virtex-4 and Spartan-3 CLBs [9,10].

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1045

registers, in addition to combinational logic functions. Fig. 5 shows the block diagram of internal structure of Xilinx Virtex-IIFPGA’s SliceM.

Each slice includes two 4-input LUTs also called function generators, carry logic, arithmetic logic gates, wide functionmultiplexers and two storage elements. Each 4-input LUT is programmable as either a 4-input LUT or 16 bits of distributedmemory or a 16-bit variable-tap shift register. The output from the function generator in each slice drives both the slice out-put and the D input of the storage element.

The new generation FPGA architecture includes dedicated two-input multiplexers for combining LUTs, allowing devicesto support up to eight or even higher number of inputs. These specialized multiplexers improve the performance, density,and size of wide logic that can be implemented in each CLB. In addition, the slices also contain a dedicated two-input ORgate (ORCY) and a two-input AND gate (MULT_AND) to perform operations involving wide input AND and OR gates. Thesegates combine the four-input LUT outputs. These gates can be cascaded in a chain to provide wide AND functionality acrossslices. The output from the cascaded AND gates can then be combined with the dedicated ORCY to produce a sum of products(SOP) function. Fig. 6 describes the detailed internal structure of Xilinx Virtex-II FPGA’s single slice.

3. The conventional design approach

The conventional design approach is to code the design logic in an hardware description language (HDL), and then let thesynthesis tool to do the job of generating the FPGA level design. The drawback of this approach is that synthesis tools are not

Page 4: Optimal utilization of available reconfigurable hardware resources

Fig. 4. Simplified slice structure.

Fig. 5. Structural diagram of Virtex-II SliceM [11].

1046 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

intelligent enough and tend to map all of the logic to a LUT based architecture which results in consumption of a bigger chiparea and longer input to output path delays. Hence, the design becomes bigger and runs at slower clock rates. We explain theconventional approach with the help of a simple design example. Let us consider that we have to design an 8 input AND gate.We can simply code it using an HDL instruction. Following is an implementation using the Verilog HDL.

assign out ¼ a½0� & a½1� & a½2� & a½3� & a½4� & a½5� & a½6� & a½7�;

where a is 8-bit input and out is output of the AND gate. This instruction performs the logical AND of the 8 input bits ofvariable a and the output goes to out. The synthesis tool will map this instruction to 8-bit AND function using 3 4-inputLUTs. The first two LUTs perform the AND operation on two 4 bit groups of input and then the resulting two bits will beANDed using a third 4-input LUT. Fig. 7 illustrates the resulting hardware.

In Xilinx Spartan-3 FPGA a LUT4 has a gate-delay of 0.479ns and net-delay of 0.976ns, the overall critical path delay of thiscircuit would thus be 9.215ns.

4. The optimized approach

In the example given in the previous section there are two stages of LUTs, therefore there will be involvement of a twostage delay in critical path length of the output. We can avoid the second LUT stage using some dedicated hardware within a

Page 5: Optimal utilization of available reconfigurable hardware resources

Fig. 6. Detailed structure of Virtex-II slice [12].

Fig. 7. 8-Input AND function – LUT based architecture.

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1047

Slice. By utilizing a dedicated AND gate (MULT_AND) or a dedicated multiplexer (MUXCY) we can achieve the same function-ality with lesser path delay. Following is an example code fragment using a MULT_AND gate in place of the third LUT.

assign temp1 = a[0] & a[1] & a[2] & a[3];

assign temp2 = a[4] & a[5] & a[6] & a[7];

MULT_AND MULT_AND_inst(.LO(out);.I0(temp1);.I1(temp2););

Page 6: Optimal utilization of available reconfigurable hardware resources

Fig. 8. 8-Input AND function – using MULT_AND.

1048 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

Note here that MULT_AND is not in a standard Verilog HDL style. It is a library keyword which constructs MULT_AND gatein hardware. Fig. 8 describes the resulting hardware.

The MULT_AND gate in Spartan-3 FPGA has a gate-delay of only 0.001ns and has no net-delay. Now the overall criticalpath delay of the circuit will be a LUT4 delay plus the MULT_AND delay which equals to 2.171ns. This is much smaller thanthe delay in the standard implementation circuit and there is additionally the benefit of saving one LUT4, which is veryimportant in terms of saving area. This circuit is advantageous when the output of MULT_AND is used within the chip. How-ever, a drawback is that the output of MULT_AND cannot be directly connected to the IO buffers of the chip. To route its out-put to the IO buffer, carry chain logic multiplexer i.e. MUXCY must be used which puts an additional delay of 1.664ns in thecritical path and the Output buffer delay of 4.909ns. The final critical path length will be 8.744ns which is still smaller thanthe LUT only architecture.

MUXCY can be directly used for the same circuit functionality. Following is an example HDL code using MUXCY in place ofMULT_AND.

assign temp1 = a[0] & a[1] & a[2] & a[3];

assign temp2 = a[4] & a[5] & a[6] & a[7];

MUXCY MUXCY_inst(.O(out);

.CI(1’b0);

.DI(temp2);

.S(!temp1)

);

Again note that MUXCY is not in a standard Verilog coding style. It is a library keyword which constructs the MUXCY gate inhardware. Fig. 9 describes the resulting hardware. The MUXCY in Spartan-3 FPGA has a gate-delay of 0.983ns and net-delay of 0.681ns. The overall critical path delay of the circuit is therefore 8.743ns.

Table 1 compares the timing results of 8-input AND gate implementation with conventional and optimized designapproaches using MULT_AND gate. Results are shown for commonly used Xilinx FPGAs. For simplicity, and to understandthe timing effects of two different approaches more clearly, input and output buffer delays have been omitted. The lastcolumn of table shows the percent improvement in terms of critical path delay for each device.

Fig. 9. 8-Input AND function – using MUXCY.

Page 7: Optimal utilization of available reconfigurable hardware resources

Table 1Timing results for different Xilinx devices.

Device Conventional approach (ns) Optimized approach (ns) Percent improvement (%)

Spartan-3 2.615 0.480 81.64Virtex-2 1.685 0.417 75.25Virtex-2Pro 1.349 0.277 79.47Virtex-5 0.743 0.160 78.46

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1049

5. Efficient techniques for wide input gates

In this section we will present some optimized design techniques for implementing wide input Boolean operations usingdedicated hardware resources.

5.1. Wide input AND operation

MUXCY gates can combine the 4-input LUTs outputs across the slices and can cascade them into a chain to provide a wideAND functionality [13]. Fig. 10 describes the 16 input AND gate implementation. The technique utilizes the 4-input LUT toprovide the SELECT signal for the MUXCY. The SELECT signal is a simple AND operation of 4 inputs. The VCC at the bottomreaches the output only when all of the input signals are at logic high. This use of carry logic helps to perform AND functionat high speed and saves hardware resources. In standard implementation which is LUT based, 16-bit AND gate would require5 LUTs. However, this technique utilizes only 4 LUTs, hence saves 1 LUT.

5.2. Wide input OR operation

As discussed in previous section the same technique may be utilized to provide a wide OR functionality. Fig. 11 describesthe 16 input OR gate implementation. The technique utilizes the 4-input LUT to provide the SELECT signal for the MUXCY.

Fig. 10. 16-Bit AND gate implementation.

Fig. 11. 16-Bit OR gate implementation.

Page 8: Optimal utilization of available reconfigurable hardware resources

1050 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

The SELECT signal is simple NOR operation of 4 inputs. The GND at the bottom reaches the output only when all of the inputsignals are at logic low. This use of carry logic helps to perform OR functions at high speed and saves hardware resources. Instandard implementation using LUT based synthesis 16-bit OR gate would require 5 LUTs. However, this technique utilizesonly 4 LUTs, hence saves 1 LUT. The reason behind using NOR operation for each LUT instead of OR, is to propagate the signal

Fig. 12. Sum of product (SOP) function using cascaded AND gates.

Fig. 13. MUXF5 and MUXFX multiplexers [14].

Page 9: Optimal utilization of available reconfigurable hardware resources

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1051

through Ci input instead of Di input of MUXCY. This will reduce propagation delay of the signal. We will discuss this aspect indetail later in the paper.

5.3. Sum of product function (SOP)

The output of cascaded AND gates (Fig. 10) can be combined with the dedicated ORCY gate to produce a sum of products(SOP) function [13]. Several number of slices can be used to provide the sum of products depending upon the width of de-sired data. Fig. 12 describes the SOP of 64 bit wide inputs using 4 cascaded 16-bit AND operations. In standard implemen-tation using LUT based synthesis 64-bit SOP function would require 23 LUTs. However, this technique utilizes only 16 LUTs,hence it saves 7 LUTs. Timing results are also improved as shown in Section 7.

5.4. Wide input MUX operation

In addition to MUXCY and ORCY gates modern Xilinx FPGAs also contain MUXFX multiplexers dedicated for the design ofwide input multiplexers [11]. Virtex-II architecture contain two dedicated MUXs per slice MUXF5 and MUXFX. The MUXFXmultiplexer implements the MUXF6, MUXF7, or MUXF8, as shown in Fig. 13. Each CLB element has two MUXF6 multiplexers,one MUXF7 multiplexer and one MUXF8 multiplexer.

Using these MUXs each slice can implement a 4:1 multiplexer, each CLB can implement a 16:1 multiplexer and two CLBscan implement a 32:1 multiplexer. However, a 4 input LUT can support a maximum of 2:1 MUX as shown in Fig. 14.

Fig. 15 shows how 8:1 and 16:1 multiplexers can be implemented using these dedicated MUXFXs. Table 2 summarizes thehardware required to implement a particular multiplexer.

Fig. 14. 4-Input LUT as a 2:1 MUX.

Fig. 15. Multiplexer Implementation using MUXFX [13].

Page 10: Optimal utilization of available reconfigurable hardware resources

Table 2Hardware requirements for different multiplexers.

Hardware resources MUX

2 LUTs + MUXF5 4:12 Slices + MUXF6 8:14 Slices + MUXF7 16:12 CLBs + MUXF8 32:1

1052 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

5.5. General wide gate input functions

MUXFX logic can also be used to implement other wider input functions. These dedicated MUXs are purposefully namedso that their name describes their functionality. MUXF6 can implement any function of 6 inputs. Likewise, MUXF7 can imple-ment 7-input functions and MUXF8 can implement 8-input functions. Using MUXFX logic we can implement a custom Bool-ean function of up to 39 inputs within a single CLB or 79 inputs wide function using two CLBs. Fig. 16 shows an example of 39inputs wide custom Boolean function implementation using a single CLB.

6. Some useful findings

While using MUXCY in cascade operation, it is better to use Ci input not Di, to propagate the signals. The delay from Diinput to output of MUXCY is quite a bit larger than Ci input to output delay. Ci input of MUXCY and its routing nets aredesigned such that carry signals may propagate faster than normal FPGA routing resources.

Dedicated carry logic resources are routed such that, in cascade operations SliceL carry logic resources will always be rou-ted to next SliceL carry logic in the next CLB. That is a MUXCY output from SliceL will be connected to MUXCY of SliceL in thenext CLB. From this fact carry logic resources of a CLB cannot be routed in-between SliceL and SliceM. Fig. 3 shows the slicestructure of Virtex-4 CLB. Routing details are showing how slices are connected within and outside the CLB.

Inputs of MULT_AND gates are routed through the LUT inputs, as shown earlier in Fig. 6. Hence, a slice’s LUT output cannotbe connected to MULT_AND gate of the same slice.

Fig. 16. 39-Input wide custom Boolean function in a CLB [13].

Page 11: Optimal utilization of available reconfigurable hardware resources

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1053

7. Results

As our emphasis is on utilizing the dedicated hardware resources to improve area consumption of designs in FPGAs wewill now present some results for hardware utilization of the two approaches. One approach is to design an architectureusing LUT based approach which we have referred to as the conventional approach and the other involves utilization of ded-icated hardware, referred to as the optimized approach. Now we will compare and show the slice and LUT utilization for dif-ferent Boolean function architectures and selected ISCAS’85 and ISCAS’89 benchmark circuits, using these two approaches.The critical path length or time delay of the design is also presented as T in ns (nano seconds).

Table 3 shows the results from different implementations we have discussed in previous sections. The first alphabeticalname of the design represents the Boolean operation performed by the design and second numeric value represents thenumber of bits the design consists of. For example, MUX16 is a 16 bit multiplexer and SOP64 is a 64 bit sum of productsfunction.

It can be observed from Table 3 that number of slices and LUT utilization may be cut down to almost half in each designusing dedicated resources. The timing results of the design are also improved in each case. Fig. 17 summarizes these resultsin a graphical view.

Table 4 shows the results from implementations of some ISCAS’85 and ISCAS’89 benchmark circuits [15,16]. The firstalphabetical name of the benchmark represents whether the circuit is combinational or sequential. The second numericvalue represents the number of nets involved in original benchmark design. Functionality description and Verilog constructsof benchmark circuits are taken from [16].

Again from Table 4 and Fig. 18 it can be clearly observed that number of slices and LUT utilization may be cut down tolower numbers in each case using dedicated resources. The timing results of the designs are also improved in each case. Thebenchmark circuit C17 is an exception, the circuit is small enough (consisting of only a few gates) that synthesis tool itselfmaps it to a maximally optimized level. Hence, utilization of dedicated resources results in same performance values as ofthe conventional design approach. In case of benchmark circuits C499 and C1355, optimized designs utilize more slices thanconventional ones. This is due to the limited amount of dedicated resources available within a single slice. These areexceptional cases where circuits require more dedicated hardware resources than actually available within slices. To fulfill

Table 3Implementation results for different Boolean function architectures.

Architecture Conventional design Optimized design

Slices LUTs T (ns) Slices LUTs T (ns)

MUX8 4 7 6.029 2 4 5.135MUX16 9 15 6.800 4 8 5.451MUX32 18 31 7.694 8 16 5.891MUX64 36 63 8.478 17 33 6.675SOP64 13 23 6.772 8 16 5.370AND40 7 13 6.029 5 10 5.585OR40 7 13 6.029 5 10 5.585

Fig. 17. Results comparison for different Boolean function architectures.

Page 12: Optimal utilization of available reconfigurable hardware resources

Table 4Implementation results for selected ISCAS’85 and ISCAS’89 benchmark circuits.

Benchmark Conventional design Optimized design

Slices LUTs T (ns) Slices LUTs T (ns)

S27 3 6 5.314 1 1 4.213C17 1 2 4.591 1 2 4.591C499 49 86 8.711 61 48 7.213S298 22 42 3.112 17 33 2.680C1355 56 98 10.043 61 48 7.213

Fig. 18. Results comparison for selected ISCAS’85 and ISCAS’89 benchmarks.

1054 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

the requirement of dedicated resources optimized design span over more slices, but additional slices are only used for theirdedicated hardware and their LUT resources are not used. This can be established by looking at reduced number of LUTs ineach case. Fig. 18 summarizes these results in a graphical representation.

Tables 5 and 6 show the percentage of resource reduction or improvement in the design by migrating from conventionalapproach to the proposed optimized approach. The slice and LUT utilization reduction is almost 50% in most of the cases. Thetiming results are more promising on higher bit lengths.

Table 5Performance improvement comparison of Boolean function architectures.

Architecture Parameter Conventional design Optimized design Reduction/Improvement (%)

LUT 7 4 42.85MUX8 Slice 4 2 50.00

T (ns) 6.029 5.135 14.83LUT 15 8 46.67

MUX16 Slice 9 4 55.55T (ns) 6.800 5.451 19.84LUT 31 16 48.39

MUX32 Slice 18 8 55.55T (ns) 7.694 5.891 23.43LUT 63 33 47.62

MUX64 Slice 36 17 52.78T (ns) 8.478 6.675 21.27LUT 23 16 30.43

SOP64 Slice 13 8 38.46T (ns) 6.772 5.370 20.70LUT 13 10 23.07

AND40 Slice 7 5 28.57T (ns) 6.029 5.585 07.36LUT 13 10 23.07

OR40 Slice 7 5 28.57T (ns) 6.029 5.585 07.36

Page 13: Optimal utilization of available reconfigurable hardware resources

Table 6Performance improvement comparison of benchmark circuits.

Benchmark Parameter Conventional design Optimized design Reduction/Improvement (%)

LUT 6 1 83.33S27 Slice 3 1 66.67

T (ns) 5.314 4.213 20.72LUT 2 2 –

C17 Slice 1 1 –T (ns) 4.591 4.591 –LUT 86 48 44.19

C499 Slice 49 61 –T (ns) 8.771 7.213 17.76LUT 42 33 21.43

S298 Slice 22 17 22.73T (ns) 3.112 2.680 13.88LUT 98 48 51.02

C1355 Slice 56 61 –T (ns) 10.043 7.213 28.19

Fig. 19. Performance factor comparison in terms of LUTs.

Fig. 20. Performance factor comparison in terms of slices.

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1055

Performance factor of the hardware designs is usually calculated as chip area consumed and critical path delay product.

Performance factorðPFÞ ¼ Chip Area � T ð1Þ

where T is in nano seconds (ns). In case of FPGAs, Chip Area may be measured in several units, like number of LUTs, slices andCLBs consumed. Another way to represent the consumption of area is in terms of equivalent gate count of the design. Forefficient design the value of performance factor must be as low as possible. As much smaller the PF value, the design willbe considered more competent and resourceful.

Page 14: Optimal utilization of available reconfigurable hardware resources

Fig. 21. Performance factor comparison in terms of equivalent gates.

1056 K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057

In our case we have calculated performance factor of the designs as LUT and path delay (T) product, slice and T productand equivalent gate count and T product. Equivalent gate count is reported by Xilinx mapping report after running place androute process within Xilinx ISE software.

Figs. 19–21 show the performance factor comparison of the designs in terms of LUTs, slices and equivalent gate count,respectively. It can be observed that performance factor of our optimized designs are far better than the conventionaldesigns.

8. Conclusions

In this paper we have presented some useful techniques to effectively and efficiently utilize the FPGA hardware resources.By utilizing the proposed techniques not only the utilized area of the FPGA can be minimized but the critical path lengths ofdesigns can also be reduced. Consequently, the designs can run at higher clock rates and more logic may be added to thesame FPGA chip. Xilinx FPGAs dedicated hardware resources are discussed to minimize the reliance of designs on LUT basedarchitectures, which will be helpful in reducing area consumption and realizing more timing efficient architectures. Nor-mally, utilized chip area of FPGA is calculated in terms of CLBs count. However, conventional mapping techniques do notutilize the internal hardware resources of a CLB optimally. Therefore, counting area in terms of CLBs does not representthe actual utilized area of the chip. Using optimized techniques CLB’s hardware resources may be utilized optimally andmore logic may be added to a single CLB. Hence, it reduces the overall CLB count and more realistically represents the actualutilized chip area while counting in terms of CLBs.

This work will be extremely helpful for researchers involved in specific application implementations on reconfigurableplatforms. The techniques presented here will not only help in implementing bigger designs in smaller chips but also assurehigher clock rates.

References

[1] National Instruments, Introduction to FPGA Technology: Top Five Benefits, December 2010. p. 1–4. http://zone.ni.com/devzone/cda/tut/p/id/6984.[2] BDTI Focus Report: FPGAs for DSP. 2nd ed. BDTI Benchmarking; 2006.[3] Worchel Jerry. The field programmable gate array (FPGA): expanding its boundaries. In-Stat Market Research; 2006. p. 1–42. http://www.instat.com/

abstract.asp?id=68&SKU=IN0603187-SI.[4] Thompson Mike. FPGAs accelerate time to market for industrial designs, EE Times, July 2004. pp. 1–1. http://www.eetimes.com/

showArticle.jhtml?articleID=221-02798.[5] McGrath D. FPGA market to pass $2.7 billion by ’10. In-Stat Says, EE Times, 2006. p. 1–1.[6] Latif Kashif, Aziz Arshad, Mahboob Athar. Efficient resource utilization of FPGAs. In: FIT’09 Proceedings of sixth international conference on Frontiers of

Information Technology, ISBN: 978-1-60558-642-7. New York, USA: ACM; 2009. p. 1–5.[7] Ciletti MD. Advanced digital design with the Verilog HDL. PEARSON, Prentice Hall; 2007.[8] National Instruments, FPGAs – Under the Hood, April 2008. p. 1–16. http://zone.ni.com/devzone/cda/tut/p/id/6983.[9] Xilinx, Virtex-4 User Guide UG070 V2.6, December 2008. http://www.xilinx.com/.

[10] Xilinx, Spartan-3 FPGA Family: Complete Data Sheet, April 2008. http://www.xilinx.com/.[11] Xilinx, Virtex-II Platform FPGAs: Complete Data Sheet DS083 V4.3, November 2007. http://www.xilinx.com/.[12] Xilinx, Project Navigator, View/Edit Routed Design (FPGA Editor), Xilinx ISE Design Suite 10.1.[13] Krueger R, Przybus B. Virtex variable-input LUT architecture. Xilinx White Paper: Virtex and Virtex-II Series FPGAs, January 2004. p. 1–6.[14] Xilinx, Virtex-II Platform FPGAs: User Guide UG012 V4.2, November 2007. http://www.xilinx.com/.[15] Brglez F, Fujiwara H. A neutral netlist of 10 combinational circuits, IEEE international symposium circuits and systems. Piscataway, NJ: IEEE Press;

1985. p. 695–8.[16] Hansen MC, Yalcin H, Hayes JP. Unveiling the ISCAS-85 benchmarks: a case study in reverse engineering, design and test of computers. IEEE

1999;16(3):72–80.

Page 15: Optimal utilization of available reconfigurable hardware resources

K. Latif et al. / Computers and Electrical Engineering 37 (2011) 1043–1057 1057

Kashif Latif obtained B.E. in Industrial Electronics from N.E.D. University of Engineering and Technology, Karachi, Pakistan and M.S. degree in ElectricalEngineering from National University of Sciences and Technology, Pakistan in 2002 and 2008, respectively. He has been involved in various R&D assign-ments since 2002. Presently he is Ph.D. candidate at National University of Sciences and Technology, Pakistan. His research interests include InformationSecurity and Cryptography, FPGA based Systems Designs, Hardware Solutions of Cryptographic Applications and Digital Systems Design.

Arshad Aziz obtained B.E. and M.E. degrees in Computer Engineering from Sir Syed University of Engineering and Technology, Karachi, Pakistan in 1998 and2002, respectively. He obtained his Ph.D. in Electrical Engineering from National University of Sciences and Technology, Pakistan in 2007. He is currently anAssociate Professor in Electrical Engineering at the National University of Sciences and Technology, Pakistan. His research interests include Computer andNetwork Security, Cryptography, Computer Networks and Internetworking, TCP/IP Protocol suite, FPGA Based Systems Design, Computer Architectures andthe Operating Systems.

Athar Mahboob obtained B.S. and M.S. degrees in Electrical Engineering from Florida State University at Tallahassee, Florida, USA in 1992 and 1995,respectively. He obtained his Ph.D. in Electrical Engineering from National University of Sciences and Technology, Pakistan in 2005. He is currently anAssociate Professor in Electrical Engineering at the National University of Sciences and Technology, Pakistan. His research interests include implementingEnterprise Information Services using Linux, Information Security and Cryptology, Computer Networks and Internetworking using TCP/IP Protocols, DigitalSystems Design and Computer Architectures.