Post on 19-Jul-2020
29
CHAPTER-II
LITERATURE REVIEW
Over the last decade, FPGAs have become one of the key digital circuit implementation
media. FPGAs are pre-fabricated silicon devices that can be electrically programmed to
become almost any kind of digital circuit or system. They provide a number of
advantages over fixed-function ASIC technologies. ASICs typically take months to
fabricate and cost thousands to millions of dollars to obtain the first device; FPGAs are
configured in less than a second and can often be reconfigured, if required, and cost vary
anywhere from a few dollars to a few thousand dollars only. However, the flexible nature
of an FPGA comes at a significant cost in area, delay, and power consumption.
In this chapter the overall detailed literature survey and review of the research papers,
technical papers and manuals of FPGAs vendors regarding FPGA based designs in
general and in the area of improving its overall performance in particular, are presented.
The observations and proposals given by the contributors in the area to improve the
performance of FPGAs along with concluding remarks are summarized.
2.1 Introduction
FPGA devices are programmable devices capable of implementing any digital logic
circuit. They offer a designer the flexibility of creating a wide array of logic circuits at a
low cost, because it is not necessary to manufacture a new custom made integrated circuit
each time. However, the FPGA devices are bigger and consume more power than their
ASIC counterparts [Kuan and Rose (2007)]. The main drawback of FPGAs is that they
are less efficient than ASICs due to the added circuitry needed to make them
reconfigurable. As a result FPGAs have been found to be a practical platform for medium
and low volume applications only. The area overhead, combined with research and
development costs, increases the per-unit cost of FPGAs, which makes them less suited
for high-volume applications. Moreover, the speed and power overhead precludes the use
30
of FPGAs for high-speed or low-power applications. In more than 20 years since the
introduction of FPGA, research and development has produced dramatic improvements
in FPGA speed and area efficiency, narrowing the gap between FPGAs and ASICs and
making FPGAs the platform of choice for implementing digital circuits.
A significant number of studies include focus on faster and more area efficient
programmable routing resources. Some important advancements have also been made in
respect of CAD tools that are used to map applications onto the programmable fabric of
FPGA. The Versatile Place and Route (VPR) tool described by Betz, Rose and Marquardt
(1999), yields significant improvements in performance by improving on the existing
clustering, placement, and routing algorithms. Logic-to-memory mapping tools,
described by Cong and Xu (1998), Wilton (1998) and inInternational Technology
Roadmap for Semiconductors (ITRS), shows improvement in the area efficiency of
FPGAs with embedded memories wherein parts of the application are packed into unused
memories before mapping the rest of the application into logic elements. In recent years,
the main focus of the research has been shifting to lower the power consumption. Power
consumption is an important part of equation determining the product size, weight and
efficiency. Unfortunately, the advantages of FPGAs are offset in many cases by their
high power consumption and area. The improved reliability, lower operating and cooling
costs, and the ever-growing demand for low-power portable communications and
computer systems, is motivating new low power techniques, especially for FPGAs, which
dissipate significantly more power than fixed-logic implementations. Indeed, the ITRS
has identified low-power design techniques as a critical technology need.
2.2 Literature Survey
The first modern-era FPGA was introduced by Xilinx in 1984 as stated by Carter, et.al,
(1986). It contained the classic array of Configurable Logic Blocks. From that first FPGA
which contained 64 logic blocks and 58 inputs and outputs, FPGAs have grown
enormously in complexity. Modern FPGAs now can contain approximately 3,30,000
equivalent logic blocks and around 1100 inputs and outputs [Altera Corporation
handbook (2006), Xilinx- Virtex-5 user guide (2006)] in addition to a large number of
31
more specialized blocks that have greatly expanded the capabilities of FPGAs. These
massive increases in capabilities have been accompanied by significant architectural
changes.
This section covers the detailed survey of work done on FPGA architecture in the area of
its programming technology, architecture of logic blocks, routing architecture and
input/output architecture
2.2.1 Programming Technologies
Every FPGA relies on an underlying programming technology which is used to control
the programmable switches that give programmability to FPGAs. There are a number of
programming technologies and their differences have a significant effect on
programmable logic architecture. The approaches that have been used historically include
EPROM [Dentchkowsky (1971)], EEPROM [Cuppens, et.al, (1985)], flash [Guterman,
et.al, (1979)], static memory [Carter, et.al, (1986)], and anti-fuses [Birkner, et.al (1992)].
Of these approaches, only the flash, static memory and anti-fuse approaches are widely
used in modern FPGAs. All these programming technologies have been reviewed below:
i) Static Memory Programming Technology
Static memory cells are the basis for SRAM programming technology that is widely used
and can be found in devices from Xilinx [Xilinx Virtex-4 family overview (2005)],
Lattice [Lattice SC family data sheet (2007)], and Altera [Altera Corporation handbook
(2006)]. This technology has become the dominant approach for FPGAs because of its
two primary advantages: re-programmability and the use of standard CMOS process
technology. From a practical point of view, an SRAM cell can be programmed an
indefinite number of times. The dedicated circuitry on the FPGA itself initializes all the
SRAM bits on power up and configures the bits with a user-supplied
configuration.Unlike other programming technologies, the use of SRAM cells requires no
special integrated circuit processing steps beyond standard CMOS. As a result, SRAM-
based FPGAs can use the latest CMOS technology available and, therefore, benefit from
the increased integration, the higher speeds and the lower dynamic power consumption of
new processes with smaller minimum geometries. However, there are a number of
32
drawbacks associated with SRAM-based programming technologies in respect of size,
volatility, security and electrical properties of the transistors.
ii)Flash/EEPROM Programming Technology
Some of the shortcomings of SRAM based technology have been addressed by the use of
floating gate programming technologies that inject charge onto a gate that “floats” above
the transistor. This approach is used in flash or EEPROM memory cells. These cells are
non-volatile; they do not lose information when the device is powered off. This flash-
based programming technology offers several unique advantages and most importantly is
its non-volatility. This feature eliminates the need for the external resources required to
store and load configuration data when SRAM-based programming technology is used. A
flash-based device can also function immediately upon power-up without waiting for the
loading of configuration data. This approach is also more area efficient than SRAM-
based technology. But one disadvantage with flash-based devices is that they cannot be
reprogrammed an infinite number of times.
One trend that has emerged is the use of flash storage in combination with SRAM
programming technology [Leventis, e.al, (2004) andLattice XP family data sheet (2005].
The devices from Altera, Xilinx and Lattice, use on-chip flash memory to provide non-
volatile storage while SRAM cells are still used to control the programmable elements in
the design. By this, while maintaining the infinite reconfigurability of SRAM-based
devices, the problems associated with the volatility of pure-SRAM approaches, such as
the cost of additional storage devices or the possibility of configuration data interception
can be addressed.
iii) Anti-fuse Programming Technology
Antifuse programming technology [Birkner, et.al (1992)]is an alternative to SRAM and
floating gate-based technologies. This technology is based on structures which exhibit
very high-resistance under normal circumstances but can be programmed “blown”
(practically connected) to create a low resistance link. Unlike SRAM or floating gate
programming technologies, this link is permanent. The programmable element, an anti-
fuse, is directly used for transmitting FPGA signals. The primary advantage of anti-fuse
33
programming technology is its low area. No silicon area is required to make connections
with metal-to-metal anti-fuses and thus decreases the area overhead of
programmability.However, this decrease is slightly offset by the need for large
programming transistors that supply the large currents needed to program the anti-fuse.
Anti-fuses have an additional advantage of havinglesser on-resistances and parasitic
capacitances than other programming technologies.
With the low area, resistance and capacitance of the fuses, it is possible to include more
switches per device as compared to other technologies. Non-volatility also means that the
device works instantly once it is programmed and thereforeit also allows the FPGA to be
used in situations that require operation immediately upon power up. This lowers system
costs since additional memory for storing the programming information is not required.
There are also some significant disadvantages to this programming technology. In
particular, since anti-fuse-based FPGAs require a nonstandard CMOS process, they are
typically well behind in the manufacturing processes that they can be adopted compared
to SRAM-based FPGAs. Furthermore, the fundamental mechanism of programming
involves significant changes to the properties of the materials in the fuse, which leads to
scaling challenges when new IC fabrication processes are considered.
Out of the three programming technologies reviewed in this section that are used in
modern devices, SRAM-based programming technology has become the most widely
used. An ideal technology would be non-volatile and reprogrammable using a standard
CMOS process and offer low on-resistances and low parasitic capacitances. It is also
clear that none of the technologies satisfies all these requirements. Use of the standard
CMOS manufacturing processes is one of the primary reasons that SRAM technology has
dominated and its dominance can be expected to continue for the foreseeable future of
CMOS technology.
2.2.2 Architecture of Configurable Logic Blocks
FPGAs consist of CLBsthat implement logic functions, programmable routing to
interconnect these functions and I/O blocks for makingthe chip connections. Although
many of the fundamental challenges and issues in FPGAs involve programmable routing
34
circuit design and architecture but the logic block architecture of an FPGA is also
extremely important because it has a dramatic effect on how much programmable routing
is required. A logic block in an FPGA provides the basic computation and storage
elements used in digital logic systems. The fine-grained logic block requires the use of
large amounts of programmable interconnect to create any typical logic function. As a
result an FPGA is bound to suffer from: area-inefficiency, low performance and high
power consumption. At the other extreme, a logic block could be an entire processor.
This approach exists in the commercial space, although processors are mixed with some
more fine grained logic blocks in a device as described by Triscend Corporation (2001)
and Xilinx Data Sheet (2007). Such a logic block on its own would not have the
performance gains that come from customizable hardware. In between these extremes is a
spectrum of logic block choices ranging from fine to coarse-grain logic blocks. FPGA
architects over the last two decades have selected basic logic blocks made of
transistors[Marple and Cooke (1992)], NAND gates [Plassey Semiconductors data Sheet
(1989)], an interconnection of multiplexers [Gamal, et.al, (1989)], lookup tables [Carter,
et.al, (1986)], and PAL-style wide-input gates [Tsu, et.al, (1999)].
The research foundations have focused on the effect of logic block functionality on the
three key metrics: area, speed, and power. Wong, et.al, (1989) have given a more detailed
survey on the specifications of the logic blocks. Many modern FPGAs contain a
heterogeneous mixture of different blocks, some of which can only be used for very
specific functions, such as dedicated memory blocks or multipliers. These structures are
very efficient at implementing specific functions, but on the other side these blocks go
waste if unused.
FPGA area-efficiency is one of the key metrics because the size of the FPGA die controls
a significant portion of its cost, particularly for devices with a large logic capacity. The
works of Rose, et.al, (1993) and Ahmad (2001) first explore the effect of lookup table
(LUT) size on area and speed performance. Figure 2.1 illustrates the basic trade-off for
area. ItsX-axis represents the size of the lookup table (or K, the number of inputs to the
lookup table). For this architecture, a “cluster” size of 1 was used, which means that each
logic block contained exactly one LUT and flip–flop. Its left hand Y -axis dashed line
35
represents the area of the logic block and its surrounding routing while the right-hand Y -
axis solid line represents the geometric average of the number of K-input LUT/flip–flop
blocks needed to implement the 28 circuits used in the experiment. This experiment
illustrates that as the LUT size (K) increases, the number of LUTs required to implement
the circuits significantly decreases. However, the area cost of implementing the logic and
routing for each block increases significantly with K due to the following reasons:
(1) The number of programming bits in a K-input lookup table is 2K, indicating an
exponential area increase with K, and
(2) The number of routing tracks surrounding the logic required for successful routing
increases as the number of pins connecting into the logic block increases as determined
by K.
The curve as shown in Figure 2.2shows the total area that is obtained when the two
curves of Figure 2.1 are multiplied. This curve shows that, at first, a reduction in block
count reduces the total area, but then an increase in block size leads to an area increase as
the LUT size increases. This curve is typical of any area versus granularity experiment in
FPGA architecture.
Figure:2.1: Number of logic blocks and area/block vs. logic block functionality [Rose,
et.al, (1993)].
36
Figure:2.2:Total area of FPGA vs. LUT size [Rose, et.al, (1993)]
Rose, et.al, (1993) furthers states that less number of logic blocks are used on the critical
path of a given circuit as the functionality of the logic block increases. It results in the
need for fewer logic levels and higher overall speed performance. A reduction in logic
levels reduces the required amount of inter-logic block routing that contributes a
substantial portion of the overall delay. Moreover, as the functionality of the logic block
increases, its internal delay also increases.
Total FPGA delay as a function of LUT size includes the routing delay for each level of
logic. Recent trends in commercial architectures have indeed moved toward larger LUT
sizes to capture these gains [Xilinx- Virtex-5 user guide (2006)] and the study by Ahmed
and Rose (2004) reveals that increase in both LUT and cluster size decreases the critical
path delay monotonically with diminishing returns. There are significant returns of
increasing LUT size up to six and cluster size up to three or four.
Like all integrated circuits, power consumption in FPGAsis generally divided into two
categories: dynamic power and static power. Dynamic power is the power consumed by
the transitioning of signals on the device. Even in the absence of signal transitions, power
continues to be consumed and that power consumption is known as static or leakage
37
power. Results of the study by Lewis, et.al, (2005) suggest that the best logic block
architectures for area are also the best logic block architectures for power consumption.
Li et al. (2005) concludes that the best LUT and cluster sizes in terms of area-efficiency
described byRose, et.al, (1993) are also the best sizes for minimized dynamic power
consumption. Cheng et al. (2007) showed how to optimize logic block architecture in
along with dynamic and static power reduction techniques. It has been shown that how
sleep transistors and threshold voltage settings can be used to achieve significant power
consumption reductionsfor a fixed, standard 4-LUT architecture.
2.2.3 Routing Architecture
To complete a user-designed circuit, the programmable routing in an FPGA that consists
of wires and programmable switches, provides connections among logic blocks and I/O
blocks. Certain common characteristics of these designs exert a strong influence on the
architecture of FPGA routing despite of the fact that the routing demand of logic circuits
varies from design to design. Moreover, a number of signals such as clocks and resets
available in the circuits need to be widely distributed across the FPGA. All Modern
FPGAs contain dedicated interconnect networks to handle the distribution of these
signals.
FPGA global routing architectures can be characterized as either hierarchical [Cheng et
al. (2007)] or island-style [Aggarwal and Lewis (1994)]. Hierarchical routing
architectures separate FPGA logic blocks into distinct groups [Cheng et al. (2007), Betz,
Rose and Marquardt (1999)]. The connections between logic blocks within a group can
be made using wire segments at the lowest level of the routing hierarchy and connections
between logic blocks in distant groups require the traversal of one or more levels of the
hierarchy of routing segments. Whereas the island-style FPGAs logic blocks are arranged
in a two dimensional mesh with routing resources evenly distributed throughout the
mesh. An island-style global routing architecture typically has routing channels on all
four sides of the logic blocks. Most commercial available SRAM-based FPGA
architectures as per Altera Corporation handbook (2006) and Xilinx- Virtex-5 user guide
(2006), use island-style architectures.
38
Several researchers have attempted to determine FPGA segmentation by routing a series
of designs and examining wire lengths. Brown et al. [1996] used global routing followed
by detailed routing to complete the FPGA design. Although this study questioned the
need for segment lengths of greater than length 2 or 3, the two-step router increased the
difficulty of wire sharing and limited the use of longer segments.
Betz et al. [1999] used a contemporary FPGA router which combines global and detailed
routing into one step to evaluate segmentation [Betz and Rose (1999A)]. This study
verified the importance of including significant medium length segments which span
between 4 and 6 logic blocks in an island-style routing architecture. As described by
Lewis, et.al (2003), this finding was validated during the development of the Stratix
architecture, which contains significant length 4 and length 8 segments.
As stated by Brown, Khellah and Vranesic (1996), Lemieux and Lewis (2002) and Sheng
and Rose (2001), many FPGA architectures have been developed that use pass transistors
and tri-state buffers as routing switches and numerous commercial FPGAs allow for
direct connections between logic blocks to avoid the need to drive the interconnect fabric.
The work by Roopchansingh and Rose (2002) shows that these connections, which avoid
delays in traversing connection blocks and switch blocks for very near neighbor
connections, can improve speed by 6.4% at a small area cost of 3.8%.
In modern IC fabrication technologies, the proximity of two routing tracks gives rise to a
capacitive effect known as crosstalk. Several researchers have attempted to improve the
performance of interconnect wires through increased wire spacing.This effect can be
reduced, resulting in reduced capacitance on the wire and increased speedby putting
spacing wires farther apart. The work by Betz and Rose (1999B) determines that a 13%
circuit speedup could be achieved by using 5 times minimum wire spacing on 20% of the
routing tracks in each island-style channel. Increased track spacing was implemented in a
commercial architecture by Hutton, et.al, (2002) which assigns 20% of routing wires to
these fast routing resources.
39
Another circuit-level technique to improve performance involves the use of routing
multiplexers that contain fast paths. The number of pass transistors required to traverse
different paths in the multiplexer are imbalanced leading to fast paths for critical inputs
and slower paths for regular inputs. This technique has been integrated into the routing
architecture for Altera Stratix II devicesas reported in the work by Lewis et al,(2005).
Like the spacing approach, critical paths are assigned to fast routing resources by the
FPGA router. It was found that the availability of imbalanced multiplexers improved
design performance by 3% without impacting device area.
Recent FPGA system clock speeds although approach 200–400 MHz, they still lag far
behind their counterparts like microprocessors. Moreover, a specific microprocessor
operates at the same frequency for each application whereas FPGA operating frequencies
vary from application to application. In general, the long and variable interconnect delays
associated with FPGA routing are responsible for both of these issues. The research work
by Singh and Brown (2001) and by Weaver, Hauser and Wawrzynek (2004) have
examined adding pipeline registers to FPGA interconnect to address these concerns. On
one side these registers allow for enhanced raw clock rates but on the other side they
complicate the FPGA routing problem since the number of flip–flops on paths which
converge on a logic block must be matched to allow for causal behavior.
Brown, Khellah and Vranesic (1996) added flip–flops to all interconnect switches and
logic block inputs and outputs for a routing network organized in the hierarchical
topology. This approach of pipelining segment-to-segment connections and logic block
I/O allows all designs mapped to the FPGA to run at the same system clock frequency.
To account for routing paths which traverse different counts of interconnect flip–flops, an
adjustable value of up to seven flip–flops is allocated per logic block input and the
inclusion of the routing flip–flops leads to a 50% increase in overall routing area.
2.2.4 Input/output Architecture
The I/O pad and surrounding supporting logic and circuitry are referred as an
input/output cell. These cells are also important components of an FPGA for two reasons:
40
a) this interface sets the rate for external communication;b) these cells along with their
supporting peripherals consume a significant portion of an FPGA’s area. For example, in
the Altera Stratix 1S20 and the Altera Cyclone 1C20, I/O’s and peripheral circuitry
occupy 43% and 30% of the total silicon area, respectively [Leventis, et.al, (2003)]. The
major challenge in input/output architecture design is the great diversity in input/output
standards; for example, different standards may require different input voltage thresholds
and output voltage levels. To support these differences, different I/O supply voltages are
often needed for each standard. They may also require a reference voltage to compare
against the input voltages.
Most modern FPGAs have adopted an I/O banking scheme in which input/output cells
are grouped into predefined banks [Lattice ECP/EC family data sheet (2001)]. Each bank
shares supply and reference voltage supplies. A single bank therefore cannot support all
the standards simultaneously, but different banks can have different supplies to support
otherwise incompatible standards. In some FPGA families, the number of I/Os per bank
are relatively constant for all device sizes at 64 pins per bank [Xilinx Virtex-4 family
overview (2005)] or 40 pins per bank [Xilinx Virtex-5 user guide (2006)]. Some FPGA
families, at the other extreme, adopt a fixed number of banks across all the devices of the
FPGA family [Lattice SC family data sheet (2007)]. This latter approach means that the
number of pins per bank will be significantly larger for the largest members of a device
family. This can be very restrictive when using these large devices. In Altera Corporation
handbook (2006), a hybrid approach of having a variable number of banks with a variable
number of pins per bank has been proposed. Devices with more I/O pins have more
banks but the number of pins per bank are allowed to increase as well. Besides bank
sizing, it is necessary to determine whether independent banks will be functionally
equivalent. Each bank could independently support every I/O standard supported by the
device.
Kuan and Rose (2007) stated that FPGAs are approximately 3 times slower, 20 times
larger, and 12 times less power efficient compared to ASICs because its programmable
switches controlled by configuration memory occupy a large area and add a significant
41
amount of parasitic capacitance and resistance to the logic and routing resources. Since
the introduction of FPGA, research and development has produced dramatic
improvements in FPGA speed and area efficiency, narrowing the gap between FPGAs
and ASICs and making FPGAs the platform of choice for implementing digital circuits.
At present time FPGAs hold significant promise as a fast to market replacement for
ASICs in many applications. As shown in figure 2.3, there are three performance
parameters for a given designed circuit using an FPGA architecture following a particular
FPGA CAD flow: area, speed and power.
Figure 2.3: Performance Parameters for FPGA Design
In the following sections the prior work done by some of the contributors to reduce these
three main performance indicators: area, delay and power have been summarized.
2.3 Prior work related to reduction of Area and Delay in FPGAs
There has been focus on faster and more area efficient programmable routing resources in
by the researchers. As already mentioned above the VPR tool described by Betz, Rose
and Marquardt (1999) gives significant improvements in performance by improving on
the existing clustering, placement and routing algorithms. Logic-to-memory mapping
tools, described by Cong and Xu (1998), Wilton (1998) shows improvement in the area
efficiency of FPGAs with embedded memories wherein parts of the application are
packed into unused memories before mapping the rest of the application into logic
elements. The contributions of some of the research scholars in this area have been
summarized below in this section:
42
The study by Lemieux and Lewis (2001) determined that at least half of the connections
between cluster inputs and logic element inputs can be removed and between 50% and
75% of the feedback connections from logic element outputs to logic element inputs can
be removed with no impact on delay or the number of logic clusters required. This switch
depopulation results in about a 10% area reduction for FPGAs with cluster sizes similar
to commercial offerings.
As described in Xilinx Inc Synthesis and Simulation Design User Guide (2008), resource
sharing is an optimization technique that uses a single functional block to implement
several operators in the HDL code and it is known as Time Division Multiplexing
(TDM). The device area for the design gets reduced because of resource sharing. It adds
additional logic levels to multiplex the inputs to implement more than one function and
therefore is not recommended for arithmetic functions that are part of design’s time
critical path.
Garrault and Phiofsky (2006) also suggested for describingthe designs behaviorally as
much as possible. The use of reset and type of reset can have serious implications on the
design performance of FPGA. An improper reset strategy can create an unnecessarily
large design. Sub optimal reset strategy can prevent:
The use of a device library component, such as shift register look-up tables (SRL)
The use of synchronous elements of dedicated hardware blocks
Optimization of logic inside fabric
If the reset is used the function is implemented with generic logic resources and occupy
more area. Similarly using of asynchronous reset is avoided to prevent packing of
additional registers into dedicated resources. For area optimal design it is recommended
to avoid set and reset whenever possible.
Hu, et al, (2008) proposed a new resynthesis algorithm for FPGA area reduction. In
contrast to existing resynthesis techniques, which consider only single-output Boolean
functions and the combinational portion of a circuit, they considered multioutput
functions and retiming, and developed effective algorithms that incorporate recent
improvements to SAT-based Boolean matching. It has shown that with the optimal logic
depth, the resynthesis considering multioutput functions reduces area by up to 0.4 %
43
compared to the one considering single-output functions, and the sequential resynthesis
reduces area by up to 10 % compared to combinational resynthesis when both consider
multi-output functions.
Kobata, et al,(2007) proposed a clustering technique for a cluster-based FPGA to
optimize routability of outer cluster nets. In order to reduce the routing resources used in
FPGA, this technique uses two evaluation functions. One evaluation function reduces the
routing resources in the outer cluster. The second evaluation function utilizes various
characteristics of the local routing resources in the inner cluster. The clustering technique
proposed by them has the unique ability to optimize routing resources concurrently.
Amit Singh, et al, (2002) utilized Rent’s rule as an empirical measure for efficient
clustering and placement of circuits in clustered FPGAs. They have shown that careful
matching of resource availability and design complexity during the clustering and
placement processes can contribute to spatial uniformity in the placed design, leading to
overall device decongestion after routing. They presented experimental results to show
that appropriate logic depopulation during clustering can have a positive impact on the
overall FPGA device area. They claim that the clustering and placement techniques
proposed by them can improve the overall device routing area by as much as 62%, 35%
on average, for the same array size, when compared to state-of-the-art FPGA clustering,
placement, and routing tools.
Muhammad Khellah, et al. (1994) stated that speed can be improved through enhancing
the interconnects in FPGAs. Both perspectives of improving the routing architecture and
chip as well as CAD tools used to rout circuits were studied and it has been concluded
that length of interconnect dramatically affect speed performance and it is crucial to limit
the number of programmable switch that signal pass through in series, the impact of
decision made by CAD routing tools is very significant and the CAD tool should consider
both speed performance and area utilization and not just not focus on one goal.
Yuzo, et al. (2010) stated that in new process technologies interconnections dominate the
delays in FPGAs because of increased RC delay and proposed a novel routing structure
44
small world network for interconnections for FPGAs. This network leads to short
distances between nodes and high connectivity between neighbors. The proposed routing
structure has a few random wire that connect distant blocks and act as shortcuts. The
results of an evaluation indicate that the proposed routing structure optimizes the critical
path delay.
Theresearch projects carried by Singh and Brown (2001) and by Weaver, Hauser and
Wawrzynek (2004) have examined the effect of adding pipeline registers to FPGA
interconnect to address the problem of increasing the maximum clock operating
frequency in FPGAs. On one side these registers allow for enhanced raw clock rates but
on the other side they complicate the FPGA routing problem since the number of flip–
flops on paths which converge on a logic block must be matched to allow for causal
behavior.
The work by Roopchansingh and Rose (2002) shows that direct connections between
logic blocks avoid delays in traversing connection blocks and switch blocks for very near
neighbor connections, can improve speed by 6.4% at a small area cost of 3.8%.
2.4 Prior Work Related to Reduction in Power Consumption
The ever-growing demand for low-power portable communications and computer
systems is motivating new low power techniques, especially for FPGAs, which dissipate
significantly more power than fixed-logic implementations. Indeed, the ITRS has
identified low-power design techniques as a critical technology need.
2.4.1 Types of Power Consumptions
Like all integrated circuits, FPGAs also dissipate two types of power i.e. static and
dynamic power. Static power is consumed due to transistor leakage and is dissipated
when current leaks from the power supply to ground through transistors that are in the
“off-state” due to three types of leakages : sub-threshold leakage (from source to drain),
gate-induced drain leakage, and gate direct-tunneling leakage. Dynamic power is
consumed mainly by toggling nodes as a function of voltage, frequency, and capacitance
45
and is dissipated when capacitances are charged and discharged during the operation of
the circuitand consumed during switching events in the core or I/O of FPGA. As
described by Shang, Kaviani and Bathala (2002), the dynamic power consumption is
generally modeled as below:
iii
i fVCP .. 2
where C,V and f represent capacitance, the voltage swing, and clock frequency of the
resource i, respectively. The total dynamic power consumed by a device is the summation
of the dynamic power of each resource. Because of programmability of FPGA the
dynamic power is design-dependent and the factors that contribute to the dynamic power
are: the effective capacitance of resources, the resources utilization, and the switching
activity of resources [Shang, Kaviani and Bathala (2002), Degalahal and Taun (2005).
The effective capacitance corresponds to the sum of parasitic effects due to
interconnection wires and transistors. Since FPGA architecture usually provides more
resources than required to implement a particular design, some resources are not used
after chip configuration and they do not consume the dynamic power (this is referred to
as resource utilization). Switching activity represents the average number of signal
transitions in a clock cycle. Though generally it depends on the clock itself, it may also
depend on other factors (e.g. temporal patterns of input signals). Hence, the above
equation as stated by Shang, Kaviani and Bathala (2002)can be rewritten as:
i
iii SUCfVP ....2
where V is the supply voltage, f is the clock frequency, and C , U , and S , are the
effective capacitance, the utilization, and the switching activity of each resource,
respectively.
FPGAs consume much more power than its counterpart ASICs because they have a large
number of transistors per logic function in order to program the device. FPGA contains a
large number of configuration bits, both within each logic element and in the
programmable routing used to connect logic elements. This extra circuitry provides
flexibility but it affects both the static and dynamic power dissipated by the FPGA.
46
Tuan and Lai in [2002] examined leakage in the Xilinx Spartan-3 FPGA, a 90nm
commercial FPGA. Figure 2.4 (a) shows the breakdown of leakage in a Spartan-3 CLB,
which is similar to the Virtex-4 CLB. Leakage is dominated by that consumed in the
interconnect, configuration SRAM cells, and to a lesser extent, LUTs. These combined
three structures of FPGA account for 88% of total leakage.
A number of recent papers have considered the breakdown of dynamic power
consumption in FPGAs. Shang, Kaviani and Bathala (2002) studied the breakdown of
power consumption in the Xilinx Virtex-II commercial FPGA. The results are
summarized in Figure 2.4 (b). Interconnect, logic, clocking, and the I/Os were found to
account for 60%, 16%, 14%, and 10% of Virtex-II dynamic power, respectively. A
similar breakdown was observed by Poon, Yan and Wilton (2002) The FPGA power
breakdown differs from that of custom ASICs, in which the clock network is often a
major source of power dissipation.
Figure 2.4 (a) Leakage Power Figure 2.4 (b) Dynamic Power
Breakdown Breakdown
The some of the contributions of the research scholars in the area of reducing the power
consumption in FPGA based design have been summarized as below:
2.4.2 Leakage and Static Power Reduction
Vendors such as Altera and Xilinx in their latest FPGA devices, incorporate various low-
power device-level technologies. Traditional FPGAs and ASICs used only two oxide
thicknesses (dual oxide): a thin oxide for core transistors and a thick oxide for I/O
47
transistors. Moving toward high-performance 90 nm FPGAs, Xilinx integrated circuit
(IC) designers started to adopt the use of a third-gate oxide thickness (triple oxide) of
midox in the transistors of the 90 nm Virtex™-4 FPGAs that allows a substantial
reduction in overall leakage and static power, compared to other competitive FPGAs.
Subsequent versions of Virtex-5 FPGAs and above continue to deploy the triple oxide
technology in the 65 nm process nodes to enable a significant lower leakage current of
about 38% lower than that for a 65 nm device. At the device level, Altera and Xilinx both
utilize triple gate oxide technology, which provides a choice of three different gate
thicknesses, to trade-off between performance and static power [Altera Handbook (2007),
Xilinx Handbook (2007)].
Calhoun, et al,proposed the creation of fine-grained “sleep regions", making it possible
for unused LUTs and flip-flops of a logic block to be put to sleep independently.
Gayasen, et al, (2004) proposed a more coarse-grained sleep strategy which partitions
FPGA into entire regions of logic blocks, such that each region can be put to sleep
independently. The authors restricted the placement of the implemented design to fall
within a minimal number of the pre-specified regions and presented the effect of the
placement restrictions on design performance.
Rahman, et al (2004) addressed leakage in FPGA interconnects and applied the well-
known leakage reduction techniques to interconnect multiplexers and proposed four
different techniques. In first technique, extra configuration SRAM cells were introduced
to allow for multiple OFF transistors on unselected multiplexer paths. The intent was to
take advantage of the “stack effect". A second technique described the laying out of the
multiplexer in separate wells, allowing body-bias techniques to be used to raise the VTH
of multiplexer transistors that are not part of the selected signal path. As a third
technique, they proposed negatively biasing the gate terminals of OFF multiplexer
transistors. The negative gate bias leads to a significant drop in sub threshold leakage.
Finally, the authors proposed using dual-VTH techniques, wherein a subset of multiplexer
transistors are assigned high-VTH (slow/low leakage) and the remainder of transistors are
assigned low-VTH (fast/leaky). The dual-VTH idea, impacts FPGA router complexity, as
48
the router must assign delay-critical signals to low-VTH multiplexer paths. Ciccarelli,
Lodi and Canegallo (2004) applied dual-VTH techniques to the routing switch buffers in
addition to the multiplexers.
Meng, Sherwood and Kastner (2006) proposed a CAD technique to reduce leakage power
dissipation in FPGA embedded memory bits by adding path traversal and location
assignment techniques in the embedded memory mapping. The authors assumed that all
the embedded memory cells can support the drowsy mode by having the ability to
connect to two supply voltages VDDH and VDDL, a high and low supply voltage
respectively.The cell still retains the stored data even while the memory bit is operating at
the low supply voltage but the bit will consume less leakage power as leakage power is
proportional to the supply voltage. This scheme is referred to as drowsy memoryfor
memory bits.
They also proposed three different modes: sleep mode, drowsy mode, and live mode. The
sleep mode is used for unused memory entries by shutting down the supply voltage from
the unused memory bits. In the study the authors showed that just by putting the unused
memory entries in the sleep mode (used-active), one can save an average of 36% of the
memory leakage power without utilizing any scheme for dynamically waking up(or
putting to sleep) the used memory entries. Moreover,in the embedded memories, onan
average about 75% of leakage power savings can be achieved just by using the minimum
number of memory entries and turning off the unused entries (min-entry). It is noticed
that the drowsy-long scheme offers an additional 10% leakage power savings over the
min-entry scheme. Moreover, the path-place algorithm on an average achieves about 95%
leakage power savings. It has been concluded that the two best memory layout techniques
are the min-entry and path place techniques. The min-entry scheme offers very good
leakage power savings in terms of both computational time and extra circuitry needed by
the FPGA since it only supports active and off modes. On the other hand, the path-place
scheme supports three memory modes: active, low leakage with data retention, and off
modes.
49
Kumar and Anis (2007) proposed two architectures i.e. homogeneous and heterogeneous
architectures. The homogeneous architecture uses the inside cluster sub blocks of
different VTH, while the heterogeneous architecture uses interleaved two types of clusters,
where one of the clusters is composed of low VTH logic cells and the other consists of low
and high VTH logic cells. The authors proposed a CAD framework that starts by assigning
the whole design to high VTH logic cells. Then the algorithm starts assigning the logic
cells into low VTH cells as long as the cell has positive slack and the new path slack does
not become negative. The algorithm clusters the logic cells into the clusters that
correspond to the architecture being used in the next stage. Finally, constrained
placement is used to place the clustered designs into the FPGA architecture. It was
noticed that both the homogeneous and heterogeneous architectures result in very close
leakage power savings with almost equal delay penalties. Lewis, et al. (2009) proposed the use of body biasing in FPGAs to slow down the cells on
non critical paths to achieve a reduction in the sub threshold leakage power. The authors
concluded that using a granularity that is equal to two clusters results in considerably
sufficient amount of leakage power savings without incurring big penalties on both the
delay and area of the FPGA.
2.4.2 Dynamic Power Reduction As stated by Kusse and Rabaey (1999), George, Zhang and Rabaey (1999) and
GeorgeandRabaey (2001), the first comprehensive effort to develop a low-energy FPGA
was by a group of researchers at UC Berkeley and power reductions were achieved
through following significant changes in the logic and routing fabrics:
- Larger, 5-input LUTs were used rather than 4-LUTs, allowing more connections
to be captured within LUTs instead of being routed through the power- dominant
interconnect.
- A new routing architecture was deployed, combining ideas from a 2-dimensional
mesh, nearest-neighbor interconnects, and an inverse clustering scheme.
- Specialized transmitter and receiver circuitry were incorporated into each logic
block, allowing low-swing signaling to be used.
50
- Double-edge-triggered flip-flops were used in the logic blocks, allowing the clock
frequency to be halved, and reducing clock power.
The main limitations of the work were:
- The proposed architecture represents a “point solution" and in that the effect of
the architectural changes on the area-efficiency, performance, and routability of
real circuits was not considered
- The basis of the architecture is the Xilinx XC4000, which was introduced in the
late 1980s and differs considerably from current FPGAs
- The focus was primarily on dynamic power and leakage was not a major
consideration. Li, et al. (2003) considered power trade-offs at the architectural level that examined the
effect of routing architecture, LUT size, and cluster size i.e. the number of LUTs in a
logic block, on FPGA power-efficiency. Using the metric of power-delay product,
authors suggested that 4-input LUTs are the most power-efficient, and that logic blocks
should contain twelve 4-LUTs. In these studies, despite their focus on power, power-
aware CAD tools were not used in the architectural evaluation experiments. The
architectures evaluated in the UC Berkeley workare somewhat out-of-step with current
commercial FPGAs. Li, et al. (2003) suggested that a mix of buffered and un-buffered
bidirectional routing switches should be used but the modern commercial FPGAs no
longer use un-buffered routing switches; rather, they employ unidirectional buffered
switches.
Li, et al. (2004A) applied the dual-VDD concept to FPGAs and proposed heterogeneous
architecture in which some logic blocks are fixed to operate at high-VDD (high speed) and
some are fixed to operate at low-VDD (low-power, but slower). The power benefits of the
heterogeneous fabric were found to be minimal mainly due to the rigidity of the fixed
fabric and the performance penalty associated with mandatory use of low-VDD in certain
cases. Subsequently the authors Li, et al. (2004B) extended their dual-VDD FPGA work to
allow logic blocks to operate at either high or low-VDD [74] and by using such
“configurable" dual-VDD schemes, power reductions of 9-14% (versus single-VDD
FPGAs) were reported. A limitation of work by Li, et al. (2004A) andLi, et al. (2004B)is
51
that the dual-VDD concepts were applied only to logic and not to interconnect, where most
power is consumed and was assumed to always operate at high-VDD.
Gayasen, et al, (2004) overcame this limitation which apply dual-VDD to both logic and
interconnect. A dual-VDD FPGA presents a more complex problem to FPGA CAD tools.
CAD tools need to select specific LUTs to operate at each supply voltage, and then assign
these LUTs to logic blocks with the appropriate supply. Chen, et al, (2004) developed
algorithms for dual- VDD mapping and clustering to address these issues in conjunction
with the architecture work mentioned above.
According to Lee, et al, (2003), the followingarethree major strategies in FPGA power
consumption reduction:
- First, changes can be done at the system level (e.g. simplification of the
algorithms used).
- Secondly, if the architecture of FPGA is already fixed, a designer may change the
logic partitioning, mapping, placement and routing and
- Finally, if no changes at all are possible, enhancing operating conditions of the
device may be still promising (this includes changes in the capacitance, the supply
voltage, and the clock frequency). Following basic techniques have been explored so far at system level design:
Kuon and Rose (2007) suggested to use coarse-grained embedded blocks rather than the
fine-grained configurable logic blocks in an FPGA, since the former are more power
efficient than the latter for the same function. However, it is to be ensured that power
consumption for routing would not increase significantly for using course-grained
FPGAs.
Osborne, et al(2008) used clock gating as a simple and effective method for reducing
dynamic power consumption. It reduces the dynamic power by eliminating unnecessary
toggling on the outputs of flip-flops of a circuit, gates in the fan-out of the flip-flops, and
clock signals. Clock gating can be used to reduce dynamic power consumption to prevent
52
signal transitions by disabling the clock for the inactive regions. The circuitry in an
operator is gated when not in use if it can be combined with word-length optimization.
Wilton, Ang and Luk (2004) found that, at a given clock speed, pipelining which is a
simple and effective way of reducing glitching can reduce the amount of energy per
operation by between 40% and 90% for applications such as integer multiplication,
CORDIC, triple DES, and FIR filters.
Chow, et al,(2005) observed that power reduction between 4% and 54% can be achieved
for various arithmetic circuits by using dynamic voltage scaling to adapt the dynamic
supply voltage to the FPGA as the temperature changes.
Tessier, et al (2007) described that power is also minimized by optimizing the mapping to
the embedded memories and to the embedded DSP blocks. They proposed a power-
efficient RAM mapping algorithm for embedded memory blocks. In ISE, power is
minimized during placement and routing by minimizing the capacitance of high-activity
signals. Dynamic power dissipation is further minimized by strategically setting the
configuration bits within partially used LUTs to minimize switching activity. A number of studies have investigated low-power FPGA architecture design:
George, Zhang and Rabaey (1999) described energy-efficient FPGA routing architectures
and low-swing signaling techniques to reduce power and proposed a new FPGA routing
architecture that utilizes a mixture of hardwired and traditional programmable switches.
Sivaswamy, et al.(2005) proposed a new FPGA routing architecture that utilizes a
mixture of hardwired and traditional programmable switches. This reduces static and
dynamic power by reducing the number of configurable routing elements. As the
architecture and the circuit-level implementation of the FPGA directly affects the
efficiency of mapping applications to FPGA resources and the amount of circuitry to
implement these resources, these implementations are the main keys in reducing power.
Kusse and Rabaey (1999) introduced the energy-efficient modules for embedded
components in FPGAs to reduce power by optimizing the number of connections
between the module and the routing resources, and by using reduced supply voltage
53
circuit techniques. They presented a novel FPGA routing switch with high-speed, low-
power, or sleep modes. The switch reduces dynamic power for non timing critical logic
and standby power for logic when it is not being used. Anderson and Najm (2004)
reported lower energy up to 3.6 times than an ARM7 device, and up to 6 times lower
energy than a C55X DSP, by using several power reduction techniques, such as register
file elimination and efficient instruction fetch that are proposed for a coarse-grain
reconfigurable cell-based architecture.
Lin, Li and He (2005) used the power-gating to reduce dynamic power that is applied to
the switches in the routing resources to reduce static power and duplicate routing
resources that use either high or low Vdd.
A recent study Lamoureux, Lemieux and Wilton (2008) suggests that glitching accounts
for 31% of dynamic power dissipation in FPGAs. Glitching occurs when values at the
inputs of a LUT toggle at different times due to uneven propagation delays of those
signals. Lamoureux and others propose a method for minimizing glitching by adding
configurable delay elements to the inputs to each logic element in the FPGA. On an
average, the proposed technique eliminates 87% of the glitching that reduces overall
FPGA power by 17% at the cost of the overall FPGA area by 6% and critical-path delay
by less than 1% due to the added circuitry increases.
Dynamic power is a result of signal transitions between logic-0 and logic-1. These
transitions can be split into two types: functional transitions and glitches. Functional
transitions are those which are necessary for the correct operation of the circuit. Glitches,
on the other hand, are transitions that arise from unbalanced delays to the inputs of a
logic gate, causing the gate’s output to transition briefly to an intermediate state.
Although glitches do not adversely affect the functionality of a synchronous circuit as
they settle before the next clock edge but they have a significant effect on power
consumption.
Lamoureux, Lemieux and Wilton (2008)described that spurious transitions can be
produced at the LUT output, if the arrival times are far enough apartas shown in
Figure2.5 (a). Detailed timing information is used to configure these delay elements after
54
place and route, so as to align the arrival times at the inputs of each logic element and this
eliminates glitches as long as the arrival times can be aligned closely enough, as shown in
Figure 2.5 (b).
The authors proposed a method for minimizing glitching that involves adding
configurable delay elements to the inputs to each logic element in the FPGA. The amount
of elimination of glitching depends on several factors like resolution, maximum delay,
location and amount of the programmable delay elements. On an average, the proposed
technique eliminates 87% of the glitching that reduces overall FPGA power by 17% at
the cost of the overall FPGA area by 6% and critical-path delay by less than 1% due to
the added circuitry increases.
Figure 2.5 (a) Circuit with Glitch
Figure 2.5 (b) Glitch removed by delay input
Glitch reduction techniques can be applied at various stages in the CAD flow. Since
glitches are caused by unbalanced path delays to LUT inputs, it is natural to design
algorithms that attempt to balance the delays. Cheng, Chen and Wong (2007) proposed
that mapping is chosen based on glitch-aware switching activities at the technology
mapping stage, whereas Dinh, Chen, M. Wong (2009) operated at the routing stage, in
which the faster arriving inputs to a LUT are delayed by extending their path through the
routing network. Delay balancing can also be done at the architectural level. However,
these approaches all incur an area or performance cost.
55
Some works use flip-flop insertion or pipelining to break up deep combinational logic
path which are the root of high glitch power. Wilton, Ang and Luk (2004) described that
circuits with higher degrees of pipelining tend to have lower glitch power because they
have fewer logic levels, thus reducing the opportunity for delay imbalance. Lim, et al.
(2005) proposed to insert flip flops with shifted-phase clocks to block the propagation of
glitches. Tomasz, et al. (2007) used negative edge-triggered flip-flops in a similar
fashion, but without the extra cost of generating additional clock signals. Fischer, et al.
(2005) explored the possibility to apply retiming to the circuit by moving flip-flops to
block glitches.
Shum and Anderson (2011) presented a glitch reduction optimization algorithm based on
don’t-cares that sets the output values for the don’t-cares of logic functions in such a way
that reduces the amount of glitching. The authors performed the process afterplacement
and routing, using timing simulation data to guide the algorithm. The algorithm achieved
an average total dynamic power reduction of 4.0%, with a peak reduction of 12.5%;
glitch power was reduced by up to 49.0%, and 13.7% on average.
Gupta, Anderson and Wang (2009) observed that the dynamic power consumption is
supposed to increase linearly with changes of clock frequency and size of a design. It was
also observed that with the decrease in clock frequency, the effect of the design size on
power consumption gets decreased They mentioned that as long as the device operates at
low frequencies, FPGA designs can be enlarged with a disproportionally low dynamic
power increase. Only at the highest frequencies, the dynamic power changes
proportionally to the design area.
Following three parts described in the research works by Lamoureux, Lemieux and
Wilton(2008) examine the trade-off between the flexibility of FPGA clock networks and
overall power consumption.
- A parameterized framework for describing a wide range of FPGA clock
networks.
- A comparison of clock aware placement techniques to determine their
effectiveness: since clock networks impose hard constraints on the placement
56
of logic blocks within the FPGA, a good clock-aware placement algorithm
must obey these constraints and also optimize for speed, routability, and
power consumption.
- Several techniques for combining these objectives are evaluated, in terms of
their ability to find a placement that is fast, energy efficient, and legal.
A lot of design performance like FPGA area utilization and power consumption get
affected by coding style using HDL. Dollas et al. (2004) in a case study of rapid
prototyping of hardware system presented the effect of CAD tools capabilities, design
flows and design styles and reported very interesting results by demonstrating how an
HDL behavioral approach leads to more efficient implementations comparing to
structural descriptions.
2.5 Conclusions
From the above literature survey it can be concluded that keeping in view the importance
of the three parameters of area, delay and power in the FPGA based system designs, a lot
of work has been carried out by the research scholars and have presented different design
techniques for optimizing these parameters. Some of the different design techniques
proposed for optimizing the area, delay/speed and power can be summarized as below:
The area can be optimized by using:
Versatile Place and Route Tool (VPR)
Switch Depopulation
Clustering Technique by careful matching of resources availability and design
complexity
Resource Sharing i.e. Time Division Multiplexing (TDM)
Proper Reset Strategy
The delay can be optimized by using:
Pipelining
Parallel Processing
Register Balancing
57
Improving Routing Architecture
Improving CAD tools
The power can be optimized by using:
Clock Gating
Asynchronous Design
Reducing Clock Speed
Finite State Machine Proper Encoding
Dynamic Voltage Scaling
Power Gating
Dual Vdd
Reducing Glitches
Area Minimization
From the literature survey, it reveals that a lot of work has been done on various
techniques and methods to reduce either of three parameters of area, delay and power, but
hardly any literature is available to develop an approach or methodology which can be
applied on any FPGA based designed system that can reduce all these three parameters to
give the best trade-off for a particular FPGA platform. The next chapter devotes to design
a FPGA-based digital system for a very comprehensive 32-bit Floating Point Arithmetic
Unit (FPAU) using VHDL. This is used as base digital system design for further
developing a systematic approach that shall be applied on this designed system which can
take care of reducing all these three parameters to give the best trade-off among these
parameters.