chapter-ii literature review -...

29

CHAPTER-II

LITERATURE REVIEW

Over the last decade, FPGAs have become one of the key digital circuit implementation

media. FPGAs are pre-fabricated silicon devices that can be electrically programmed to

become almost any kind of digital circuit or system. They provide a number of

advantages over fixed-function ASIC technologies. ASICs typically take months to

fabricate and cost thousands to millions of dollars to obtain the first device; FPGAs are

configured in less than a second and can often be reconfigured, if required, and cost vary

anywhere from a few dollars to a few thousand dollars only. However, the flexible nature

of an FPGA comes at a significant cost in area, delay, and power consumption.

In this chapter the overall detailed literature survey and review of the research papers,

technical papers and manuals of FPGAs vendors regarding FPGA based designs in

general and in the area of improving its overall performance in particular, are presented.

The observations and proposals given by the contributors in the area to improve the

performance of FPGAs along with concluding remarks are summarized.

2.1 Introduction

FPGA devices are programmable devices capable of implementing any digital logic

circuit. They offer a designer the flexibility of creating a wide array of logic circuits at a

low cost, because it is not necessary to manufacture a new custom made integrated circuit

each time. However, the FPGA devices are bigger and consume more power than their

ASIC counterparts [Kuan and Rose (2007)]. The main drawback of FPGAs is that they

are less efficient than ASICs due to the added circuitry needed to make them

reconfigurable. As a result FPGAs have been found to be a practical platform for medium

and low volume applications only. The area overhead, combined with research and

development costs, increases the per-unit cost of FPGAs, which makes them less suited

for high-volume applications. Moreover, the speed and power overhead precludes the use

30

of FPGAs for high-speed or low-power applications. In more than 20 years since the

introduction of FPGA, research and development has produced dramatic improvements

in FPGA speed and area efficiency, narrowing the gap between FPGAs and ASICs and

making FPGAs the platform of choice for implementing digital circuits.

A significant number of studies include focus on faster and more area efficient

programmable routing resources. Some important advancements have also been made in

respect of CAD tools that are used to map applications onto the programmable fabric of

FPGA. The Versatile Place and Route (VPR) tool described by Betz, Rose and Marquardt

(1999), yields significant improvements in performance by improving on the existing

clustering, placement, and routing algorithms. Logic-to-memory mapping tools,

described by Cong and Xu (1998), Wilton (1998) and inInternational Technology

Roadmap for Semiconductors (ITRS), shows improvement in the area efficiency of

FPGAs with embedded memories wherein parts of the application are packed into unused

memories before mapping the rest of the application into logic elements. In recent years,

the main focus of the research has been shifting to lower the power consumption. Power

consumption is an important part of equation determining the product size, weight and

efficiency. Unfortunately, the advantages of FPGAs are offset in many cases by their

high power consumption and area. The improved reliability, lower operating and cooling

costs, and the ever-growing demand for low-power portable communications and

computer systems, is motivating new low power techniques, especially for FPGAs, which

dissipate significantly more power than fixed-logic implementations. Indeed, the ITRS

has identified low-power design techniques as a critical technology need.

2.2 Literature Survey

The first modern-era FPGA was introduced by Xilinx in 1984 as stated by Carter, et.al,

(1986). It contained the classic array of Configurable Logic Blocks. From that first FPGA

which contained 64 logic blocks and 58 inputs and outputs, FPGAs have grown

enormously in complexity. Modern FPGAs now can contain approximately 3,30,000

equivalent logic blocks and around 1100 inputs and outputs [Altera Corporation

handbook (2006), Xilinx- Virtex-5 user guide (2006)] in addition to a large number of

31

more specialized blocks that have greatly expanded the capabilities of FPGAs. These

massive increases in capabilities have been accompanied by significant architectural

changes.

This section covers the detailed survey of work done on FPGA architecture in the area of

its programming technology, architecture of logic blocks, routing architecture and

input/output architecture

2.2.1 Programming Technologies

Every FPGA relies on an underlying programming technology which is used to control

the programmable switches that give programmability to FPGAs. There are a number of

programming technologies and their differences have a significant effect on

programmable logic architecture. The approaches that have been used historically include

EPROM [Dentchkowsky (1971)], EEPROM [Cuppens, et.al, (1985)], flash [Guterman,

et.al, (1979)], static memory [Carter, et.al, (1986)], and anti-fuses [Birkner, et.al (1992)].

Of these approaches, only the flash, static memory and anti-fuse approaches are widely

used in modern FPGAs. All these programming technologies have been reviewed below:

i) Static Memory Programming Technology

Static memory cells are the basis for SRAM programming technology that is widely used

and can be found in devices from Xilinx [Xilinx Virtex-4 family overview (2005)],

Lattice [Lattice SC family data sheet (2007)], and Altera [Altera Corporation handbook

(2006)]. This technology has become the dominant approach for FPGAs because of its

two primary advantages: re-programmability and the use of standard CMOS process

technology. From a practical point of view, an SRAM cell can be programmed an

indefinite number of times. The dedicated circuitry on the FPGA itself initializes all the

SRAM bits on power up and configures the bits with a user-supplied

configuration.Unlike other programming technologies, the use of SRAM cells requires no

special integrated circuit processing steps beyond standard CMOS. As a result, SRAM-

based FPGAs can use the latest CMOS technology available and, therefore, benefit from

the increased integration, the higher speeds and the lower dynamic power consumption of

new processes with smaller minimum geometries. However, there are a number of

32

drawbacks associated with SRAM-based programming technologies in respect of size,

volatility, security and electrical properties of the transistors.

ii)Flash/EEPROM Programming Technology

Some of the shortcomings of SRAM based technology have been addressed by the use of

floating gate programming technologies that inject charge onto a gate that “floats” above

the transistor. This approach is used in flash or EEPROM memory cells. These cells are

non-volatile; they do not lose information when the device is powered off. This flash-

based programming technology offers several unique advantages and most importantly is

its non-volatility. This feature eliminates the need for the external resources required to

store and load configuration data when SRAM-based programming technology is used. A

flash-based device can also function immediately upon power-up without waiting for the

loading of configuration data. This approach is also more area efficient than SRAM-

based technology. But one disadvantage with flash-based devices is that they cannot be

reprogrammed an infinite number of times.

One trend that has emerged is the use of flash storage in combination with SRAM

programming technology [Leventis, e.al, (2004) andLattice XP family data sheet (2005].

The devices from Altera, Xilinx and Lattice, use on-chip flash memory to provide non-

volatile storage while SRAM cells are still used to control the programmable elements in

the design. By this, while maintaining the infinite reconfigurability of SRAM-based

devices, the problems associated with the volatility of pure-SRAM approaches, such as

the cost of additional storage devices or the possibility of configuration data interception

can be addressed.

iii) Anti-fuse Programming Technology

Antifuse programming technology [Birkner, et.al (1992)]is an alternative to SRAM and

floating gate-based technologies. This technology is based on structures which exhibit

very high-resistance under normal circumstances but can be programmed “blown”

(practically connected) to create a low resistance link. Unlike SRAM or floating gate

programming technologies, this link is permanent. The programmable element, an anti-

fuse, is directly used for transmitting FPGA signals. The primary advantage of anti-fuse

33

programming technology is its low area. No silicon area is required to make connections

with metal-to-metal anti-fuses and thus decreases the area overhead of

programmability.However, this decrease is slightly offset by the need for large

programming transistors that supply the large currents needed to program the anti-fuse.

Anti-fuses have an additional advantage of havinglesser on-resistances and parasitic

capacitances than other programming technologies.

With the low area, resistance and capacitance of the fuses, it is possible to include more

switches per device as compared to other technologies. Non-volatility also means that the

device works instantly once it is programmed and thereforeit also allows the FPGA to be

used in situations that require operation immediately upon power up. This lowers system

costs since additional memory for storing the programming information is not required.

There are also some significant disadvantages to this programming technology. In

particular, since anti-fuse-based FPGAs require a nonstandard CMOS process, they are

typically well behind in the manufacturing processes that they can be adopted compared

to SRAM-based FPGAs. Furthermore, the fundamental mechanism of programming

involves significant changes to the properties of the materials in the fuse, which leads to

scaling challenges when new IC fabrication processes are considered.

Out of the three programming technologies reviewed in this section that are used in

modern devices, SRAM-based programming technology has become the most widely

used. An ideal technology would be non-volatile and reprogrammable using a standard

CMOS process and offer low on-resistances and low parasitic capacitances. It is also

clear that none of the technologies satisfies all these requirements. Use of the standard

CMOS manufacturing processes is one of the primary reasons that SRAM technology has

dominated and its dominance can be expected to continue for the foreseeable future of

CMOS technology.

2.2.2 Architecture of Configurable Logic Blocks

FPGAs consist of CLBsthat implement logic functions, programmable routing to

interconnect these functions and I/O blocks for makingthe chip connections. Although

many of the fundamental challenges and issues in FPGAs involve programmable routing

34

circuit design and architecture but the logic block architecture of an FPGA is also

extremely important because it has a dramatic effect on how much programmable routing

is required. A logic block in an FPGA provides the basic computation and storage

elements used in digital logic systems. The fine-grained logic block requires the use of

large amounts of programmable interconnect to create any typical logic function. As a

result an FPGA is bound to suffer from: area-inefficiency, low performance and high

power consumption. At the other extreme, a logic block could be an entire processor.

This approach exists in the commercial space, although processors are mixed with some

more fine grained logic blocks in a device as described by Triscend Corporation (2001)

and Xilinx Data Sheet (2007). Such a logic block on its own would not have the

performance gains that come from customizable hardware. In between these extremes is a

spectrum of logic block choices ranging from fine to coarse-grain logic blocks. FPGA

architects over the last two decades have selected basic logic blocks made of

transistors[Marple and Cooke (1992)], NAND gates [Plassey Semiconductors data Sheet

(1989)], an interconnection of multiplexers [Gamal, et.al, (1989)], lookup tables [Carter,

et.al, (1986)], and PAL-style wide-input gates [Tsu, et.al, (1999)].

The research foundations have focused on the effect of logic block functionality on the

three key metrics: area, speed, and power. Wong, et.al, (1989) have given a more detailed

survey on the specifications of the logic blocks. Many modern FPGAs contain a

heterogeneous mixture of different blocks, some of which can only be used for very

specific functions, such as dedicated memory blocks or multipliers. These structures are

very efficient at implementing specific functions, but on the other side these blocks go

waste if unused.

FPGA area-efficiency is one of the key metrics because the size of the FPGA die controls

a significant portion of its cost, particularly for devices with a large logic capacity. The

works of Rose, et.al, (1993) and Ahmad (2001) first explore the effect of lookup table

(LUT) size on area and speed performance. Figure 2.1 illustrates the basic trade-off for

area. ItsX-axis represents the size of the lookup table (or K, the number of inputs to the

lookup table). For this architecture, a “cluster” size of 1 was used, which means that each

logic block contained exactly one LUT and flip–flop. Its left hand Y -axis dashed line

35

represents the area of the logic block and its surrounding routing while the right-hand Y -

axis solid line represents the geometric average of the number of K-input LUT/flip–flop

blocks needed to implement the 28 circuits used in the experiment. This experiment

illustrates that as the LUT size (K) increases, the number of LUTs required to implement

the circuits significantly decreases. However, the area cost of implementing the logic and

routing for each block increases significantly with K due to the following reasons:

(1) The number of programming bits in a K-input lookup table is 2K, indicating an

exponential area increase with K, and

(2) The number of routing tracks surrounding the logic required for successful routing

increases as the number of pins connecting into the logic block increases as determined

by K.

The curve as shown in Figure 2.2shows the total area that is obtained when the two

curves of Figure 2.1 are multiplied. This curve shows that, at first, a reduction in block

count reduces the total area, but then an increase in block size leads to an area increase as

the LUT size increases. This curve is typical of any area versus granularity experiment in

FPGA architecture.

Figure:2.1: Number of logic blocks and area/block vs. logic block functionality [Rose,

et.al, (1993)].

36

Figure:2.2:Total area of FPGA vs. LUT size [Rose, et.al, (1993)]

Rose, et.al, (1993) furthers states that less number of logic blocks are used on the critical

path of a given circuit as the functionality of the logic block increases. It results in the

need for fewer logic levels and higher overall speed performance. A reduction in logic

levels reduces the required amount of inter-logic block routing that contributes a

substantial portion of the overall delay. Moreover, as the functionality of the logic block

increases, its internal delay also increases.

Total FPGA delay as a function of LUT size includes the routing delay for each level of

logic. Recent trends in commercial architectures have indeed moved toward larger LUT

sizes to capture these gains [Xilinx- Virtex-5 user guide (2006)] and the study by Ahmed

and Rose (2004) reveals that increase in both LUT and cluster size decreases the critical

path delay monotonically with diminishing returns. There are significant returns of

increasing LUT size up to six and cluster size up to three or four.

Like all integrated circuits, power consumption in FPGAsis generally divided into two

categories: dynamic power and static power. Dynamic power is the power consumed by

the transitioning of signals on the device. Even in the absence of signal transitions, power

continues to be consumed and that power consumption is known as static or leakage

37

power. Results of the study by Lewis, et.al, (2005) suggest that the best logic block

architectures for area are also the best logic block architectures for power consumption.

Li et al. (2005) concludes that the best LUT and cluster sizes in terms of area-efficiency

described byRose, et.al, (1993) are also the best sizes for minimized dynamic power

consumption. Cheng et al. (2007) showed how to optimize logic block architecture in

along with dynamic and static power reduction techniques. It has been shown that how

sleep transistors and threshold voltage settings can be used to achieve significant power

consumption reductionsfor a fixed, standard 4-LUT architecture.

2.2.3 Routing Architecture

To complete a user-designed circuit, the programmable routing in an FPGA that consists

of wires and programmable switches, provides connections among logic blocks and I/O

blocks. Certain common characteristics of these designs exert a strong influence on the

architecture of FPGA routing despite of the fact that the routing demand of logic circuits

varies from design to design. Moreover, a number of signals such as clocks and resets

available in the circuits need to be widely distributed across the FPGA. All Modern

FPGAs contain dedicated interconnect networks to handle the distribution of these

signals.

FPGA global routing architectures can be characterized as either hierarchical [Cheng et

al. (2007)] or island-style [Aggarwal and Lewis (1994)]. Hierarchical routing

architectures separate FPGA logic blocks into distinct groups [Cheng et al. (2007), Betz,

Rose and Marquardt (1999)]. The connections between logic blocks within a group can

be made using wire segments at the lowest level of the routing hierarchy and connections

between logic blocks in distant groups require the traversal of one or more levels of the

hierarchy of routing segments. Whereas the island-style FPGAs logic blocks are arranged

in a two dimensional mesh with routing resources evenly distributed throughout the

mesh. An island-style global routing architecture typically has routing channels on all

four sides of the logic blocks. Most commercial available SRAM-based FPGA

architectures as per Altera Corporation handbook (2006) and Xilinx- Virtex-5 user guide

(2006), use island-style architectures.

38

Several researchers have attempted to determine FPGA segmentation by routing a series

of designs and examining wire lengths. Brown et al. [1996] used global routing followed

by detailed routing to complete the FPGA design. Although this study questioned the

need for segment lengths of greater than length 2 or 3, the two-step router increased the

difficulty of wire sharing and limited the use of longer segments.

Betz et al. [1999] used a contemporary FPGA router which combines global and detailed

routing into one step to evaluate segmentation [Betz and Rose (1999A)]. This study

verified the importance of including significant medium length segments which span

between 4 and 6 logic blocks in an island-style routing architecture. As described by

Lewis, et.al (2003), this finding was validated during the development of the Stratix

architecture, which contains significant length 4 and length 8 segments.

As stated by Brown, Khellah and Vranesic (1996), Lemieux and Lewis (2002) and Sheng

and Rose (2001), many FPGA architectures have been developed that use pass transistors

and tri-state buffers as routing switches and numerous commercial FPGAs allow for

direct connections between logic blocks to avoid the need to drive the interconnect fabric.

The work by Roopchansingh and Rose (2002) shows that these connections, which avoid

delays in traversing connection blocks and switch blocks for very near neighbor

connections, can improve speed by 6.4% at a small area cost of 3.8%.

In modern IC fabrication technologies, the proximity of two routing tracks gives rise to a

capacitive effect known as crosstalk. Several researchers have attempted to improve the

performance of interconnect wires through increased wire spacing.This effect can be

reduced, resulting in reduced capacitance on the wire and increased speedby putting

spacing wires farther apart. The work by Betz and Rose (1999B) determines that a 13%

circuit speedup could be achieved by using 5 times minimum wire spacing on 20% of the

routing tracks in each island-style channel. Increased track spacing was implemented in a

commercial architecture by Hutton, et.al, (2002) which assigns 20% of routing wires to

these fast routing resources.

39

Another circuit-level technique to improve performance involves the use of routing

multiplexers that contain fast paths. The number of pass transistors required to traverse

different paths in the multiplexer are imbalanced leading to fast paths for critical inputs

and slower paths for regular inputs. This technique has been integrated into the routing

architecture for Altera Stratix II devicesas reported in the work by Lewis et al,(2005).

Like the spacing approach, critical paths are assigned to fast routing resources by the

FPGA router. It was found that the availability of imbalanced multiplexers improved

design performance by 3% without impacting device area.

Recent FPGA system clock speeds although approach 200–400 MHz, they still lag far

behind their counterparts like microprocessors. Moreover, a specific microprocessor

operates at the same frequency for each application whereas FPGA operating frequencies

vary from application to application. In general, the long and variable interconnect delays

associated with FPGA routing are responsible for both of these issues. The research work

by Singh and Brown (2001) and by Weaver, Hauser and Wawrzynek (2004) have

examined adding pipeline registers to FPGA interconnect to address these concerns. On

one side these registers allow for enhanced raw clock rates but on the other side they

complicate the FPGA routing problem since the number of flip–flops on paths which

converge on a logic block must be matched to allow for causal behavior.

Brown, Khellah and Vranesic (1996) added flip–flops to all interconnect switches and

logic block inputs and outputs for a routing network organized in the hierarchical

topology. This approach of pipelining segment-to-segment connections and logic block

I/O allows all designs mapped to the FPGA to run at the same system clock frequency.

To account for routing paths which traverse different counts of interconnect flip–flops, an

adjustable value of up to seven flip–flops is allocated per logic block input and the

inclusion of the routing flip–flops leads to a 50% increase in overall routing area.

2.2.4 Input/output Architecture

The I/O pad and surrounding supporting logic and circuitry are referred as an

input/output cell. These cells are also important components of an FPGA for two reasons:

40

a) this interface sets the rate for external communication;b) these cells along with their

supporting peripherals consume a significant portion of an FPGA’s area. For example, in

the Altera Stratix 1S20 and the Altera Cyclone 1C20, I/O’s and peripheral circuitry

occupy 43% and 30% of the total silicon area, respectively [Leventis, et.al, (2003)]. The

major challenge in input/output architecture design is the great diversity in input/output

standards; for example, different standards may require different input voltage thresholds

and output voltage levels. To support these differences, different I/O supply voltages are

often needed for each standard. They may also require a reference voltage to compare

against the input voltages.

Most modern FPGAs have adopted an I/O banking scheme in which input/output cells

are grouped into predefined banks [Lattice ECP/EC family data sheet (2001)]. Each bank

shares supply and reference voltage supplies. A single bank therefore cannot support all

the standards simultaneously, but different banks can have different supplies to support

otherwise incompatible standards. In some FPGA families, the number of I/Os per bank

are relatively constant for all device sizes at 64 pins per bank [Xilinx Virtex-4 family

overview (2005)] or 40 pins per bank [Xilinx Virtex-5 user guide (2006)]. Some FPGA

families, at the other extreme, adopt a fixed number of banks across all the devices of the

FPGA family [Lattice SC family data sheet (2007)]. This latter approach means that the

number of pins per bank will be significantly larger for the largest members of a device

family. This can be very restrictive when using these large devices. In Altera Corporation

handbook (2006), a hybrid approach of having a variable number of banks with a variable

number of pins per bank has been proposed. Devices with more I/O pins have more

banks but the number of pins per bank are allowed to increase as well. Besides bank

sizing, it is necessary to determine whether independent banks will be functionally

equivalent. Each bank could independently support every I/O standard supported by the

device.

Kuan and Rose (2007) stated that FPGAs are approximately 3 times slower, 20 times

larger, and 12 times less power efficient compared to ASICs because its programmable

switches controlled by configuration memory occupy a large area and add a significant

41

amount of parasitic capacitance and resistance to the logic and routing resources. Since

the introduction of FPGA, research and development has produced dramatic

improvements in FPGA speed and area efficiency, narrowing the gap between FPGAs

and ASICs and making FPGAs the platform of choice for implementing digital circuits.

At present time FPGAs hold significant promise as a fast to market replacement for

ASICs in many applications. As shown in figure 2.3, there are three performance

parameters for a given designed circuit using an FPGA architecture following a particular

FPGA CAD flow: area, speed and power.

Figure 2.3: Performance Parameters for FPGA Design

In the following sections the prior work done by some of the contributors to reduce these

three main performance indicators: area, delay and power have been summarized.

2.3 Prior work related to reduction of Area and Delay in FPGAs

There has been focus on faster and more area efficient programmable routing resources in

by the researchers. As already mentioned above the VPR tool described by Betz, Rose

and Marquardt (1999) gives significant improvements in performance by improving on

the existing clustering, placement and routing algorithms. Logic-to-memory mapping

tools, described by Cong and Xu (1998), Wilton (1998) shows improvement in the area

efficiency of FPGAs with embedded memories wherein parts of the application are

packed into unused memories before mapping the rest of the application into logic

elements. The contributions of some of the research scholars in this area have been

summarized below in this section:

42

The study by Lemieux and Lewis (2001) determined that at least half of the connections

between cluster inputs and logic element inputs can be removed and between 50% and

75% of the feedback connections from logic element outputs to logic element inputs can

be removed with no impact on delay or the number of logic clusters required. This switch

depopulation results in about a 10% area reduction for FPGAs with cluster sizes similar

to commercial offerings.

As described in Xilinx Inc Synthesis and Simulation Design User Guide (2008), resource

sharing is an optimization technique that uses a single functional block to implement

several operators in the HDL code and it is known as Time Division Multiplexing

(TDM). The device area for the design gets reduced because of resource sharing. It adds

additional logic levels to multiplex the inputs to implement more than one function and

therefore is not recommended for arithmetic functions that are part of design’s time

critical path.

Garrault and Phiofsky (2006) also suggested for describingthe designs behaviorally as

much as possible. The use of reset and type of reset can have serious implications on the

design performance of FPGA. An improper reset strategy can create an unnecessarily

large design. Sub optimal reset strategy can prevent:

The use of a device library component, such as shift register look-up tables (SRL)

The use of synchronous elements of dedicated hardware blocks

Optimization of logic inside fabric

If the reset is used the function is implemented with generic logic resources and occupy

more area. Similarly using of asynchronous reset is avoided to prevent packing of

additional registers into dedicated resources. For area optimal design it is recommended

to avoid set and reset whenever possible.

Hu, et al, (2008) proposed a new resynthesis algorithm for FPGA area reduction. In

contrast to existing resynthesis techniques, which consider only single-output Boolean

functions and the combinational portion of a circuit, they considered multioutput

functions and retiming, and developed effective algorithms that incorporate recent

improvements to SAT-based Boolean matching. It has shown that with the optimal logic

depth, the resynthesis considering multioutput functions reduces area by up to 0.4 %

43

compared to the one considering single-output functions, and the sequential resynthesis

reduces area by up to 10 % compared to combinational resynthesis when both consider

multi-output functions.

Kobata, et al,(2007) proposed a clustering technique for a cluster-based FPGA to

optimize routability of outer cluster nets. In order to reduce the routing resources used in

FPGA, this technique uses two evaluation functions. One evaluation function reduces the

routing resources in the outer cluster. The second evaluation function utilizes various

characteristics of the local routing resources in the inner cluster. The clustering technique

proposed by them has the unique ability to optimize routing resources concurrently.

Amit Singh, et al, (2002) utilized Rent’s rule as an empirical measure for efficient

clustering and placement of circuits in clustered FPGAs. They have shown that careful

matching of resource availability and design complexity during the clustering and

placement processes can contribute to spatial uniformity in the placed design, leading to

overall device decongestion after routing. They presented experimental results to show

that appropriate logic depopulation during clustering can have a positive impact on the

overall FPGA device area. They claim that the clustering and placement techniques

proposed by them can improve the overall device routing area by as much as 62%, 35%

on average, for the same array size, when compared to state-of-the-art FPGA clustering,

placement, and routing tools.

Muhammad Khellah, et al. (1994) stated that speed can be improved through enhancing

the interconnects in FPGAs. Both perspectives of improving the routing architecture and

chip as well as CAD tools used to rout circuits were studied and it has been concluded

that length of interconnect dramatically affect speed performance and it is crucial to limit

the number of programmable switch that signal pass through in series, the impact of

decision made by CAD routing tools is very significant and the CAD tool should consider

both speed performance and area utilization and not just not focus on one goal.

Yuzo, et al. (2010) stated that in new process technologies interconnections dominate the

delays in FPGAs because of increased RC delay and proposed a novel routing structure

44

small world network for interconnections for FPGAs. This network leads to short

distances between nodes and high connectivity between neighbors. The proposed routing

structure has a few random wire that connect distant blocks and act as shortcuts. The

results of an evaluation indicate that the proposed routing structure optimizes the critical

path delay.

Theresearch projects carried by Singh and Brown (2001) and by Weaver, Hauser and

Wawrzynek (2004) have examined the effect of adding pipeline registers to FPGA

interconnect to address the problem of increasing the maximum clock operating

frequency in FPGAs. On one side these registers allow for enhanced raw clock rates but

on the other side they complicate the FPGA routing problem since the number of flip–

flops on paths which converge on a logic block must be matched to allow for causal

behavior.

The work by Roopchansingh and Rose (2002) shows that direct connections between

logic blocks avoid delays in traversing connection blocks and switch blocks for very near

neighbor connections, can improve speed by 6.4% at a small area cost of 3.8%.

2.4 Prior Work Related to Reduction in Power Consumption

The ever-growing demand for low-power portable communications and computer

systems is motivating new low power techniques, especially for FPGAs, which dissipate

significantly more power than fixed-logic implementations. Indeed, the ITRS has

identified low-power design techniques as a critical technology need.

2.4.1 Types of Power Consumptions

Like all integrated circuits, FPGAs also dissipate two types of power i.e. static and

dynamic power. Static power is consumed due to transistor leakage and is dissipated

when current leaks from the power supply to ground through transistors that are in the

“off-state” due to three types of leakages : sub-threshold leakage (from source to drain),

gate-induced drain leakage, and gate direct-tunneling leakage. Dynamic power is

consumed mainly by toggling nodes as a function of voltage, frequency, and capacitance

45

and is dissipated when capacitances are charged and discharged during the operation of

the circuitand consumed during switching events in the core or I/O of FPGA. As

described by Shang, Kaviani and Bathala (2002), the dynamic power consumption is

generally modeled as below:

iii

i fVCP .. 2

where C,V and f represent capacitance, the voltage swing, and clock frequency of the

resource i, respectively. The total dynamic power consumed by a device is the summation

of the dynamic power of each resource. Because of programmability of FPGA the

dynamic power is design-dependent and the factors that contribute to the dynamic power

are: the effective capacitance of resources, the resources utilization, and the switching

activity of resources [Shang, Kaviani and Bathala (2002), Degalahal and Taun (2005).

The effective capacitance corresponds to the sum of parasitic effects due to

interconnection wires and transistors. Since FPGA architecture usually provides more

resources than required to implement a particular design, some resources are not used

after chip configuration and they do not consume the dynamic power (this is referred to

as resource utilization). Switching activity represents the average number of signal

transitions in a clock cycle. Though generally it depends on the clock itself, it may also

depend on other factors (e.g. temporal patterns of input signals). Hence, the above

equation as stated by Shang, Kaviani and Bathala (2002)can be rewritten as:

i

iii SUCfVP ....2

where V is the supply voltage, f is the clock frequency, and C , U , and S , are the

effective capacitance, the utilization, and the switching activity of each resource,

respectively.

FPGAs consume much more power than its counterpart ASICs because they have a large

number of transistors per logic function in order to program the device. FPGA contains a

large number of configuration bits, both within each logic element and in the

programmable routing used to connect logic elements. This extra circuitry provides

flexibility but it affects both the static and dynamic power dissipated by the FPGA.

46

Tuan and Lai in [2002] examined leakage in the Xilinx Spartan-3 FPGA, a 90nm

commercial FPGA. Figure 2.4 (a) shows the breakdown of leakage in a Spartan-3 CLB,

which is similar to the Virtex-4 CLB. Leakage is dominated by that consumed in the

interconnect, configuration SRAM cells, and to a lesser extent, LUTs. These combined

three structures of FPGA account for 88% of total leakage.

A number of recent papers have considered the breakdown of dynamic power

consumption in FPGAs. Shang, Kaviani and Bathala (2002) studied the breakdown of

power consumption in the Xilinx Virtex-II commercial FPGA. The results are

summarized in Figure 2.4 (b). Interconnect, logic, clocking, and the I/Os were found to

account for 60%, 16%, 14%, and 10% of Virtex-II dynamic power, respectively. A

similar breakdown was observed by Poon, Yan and Wilton (2002) The FPGA power

breakdown differs from that of custom ASICs, in which the clock network is often a

major source of power dissipation.

Figure 2.4 (a) Leakage Power Figure 2.4 (b) Dynamic Power

Breakdown Breakdown

The some of the contributions of the research scholars in the area of reducing the power

consumption in FPGA based design have been summarized as below:

2.4.2 Leakage and Static Power Reduction

Vendors such as Altera and Xilinx in their latest FPGA devices, incorporate various low-

power device-level technologies. Traditional FPGAs and ASICs used only two oxide

thicknesses (dual oxide): a thin oxide for core transistors and a thick oxide for I/O

47

transistors. Moving toward high-performance 90 nm FPGAs, Xilinx integrated circuit

(IC) designers started to adopt the use of a third-gate oxide thickness (triple oxide) of

midox in the transistors of the 90 nm Virtex™-4 FPGAs that allows a substantial

reduction in overall leakage and static power, compared to other competitive FPGAs.

Subsequent versions of Virtex-5 FPGAs and above continue to deploy the triple oxide

technology in the 65 nm process nodes to enable a significant lower leakage current of

about 38% lower than that for a 65 nm device. At the device level, Altera and Xilinx both

utilize triple gate oxide technology, which provides a choice of three different gate

thicknesses, to trade-off between performance and static power [Altera Handbook (2007),

Xilinx Handbook (2007)].

Calhoun, et al,proposed the creation of fine-grained “sleep regions", making it possible

for unused LUTs and flip-flops of a logic block to be put to sleep independently.

Gayasen, et al, (2004) proposed a more coarse-grained sleep strategy which partitions

FPGA into entire regions of logic blocks, such that each region can be put to sleep

independently. The authors restricted the placement of the implemented design to fall

within a minimal number of the pre-specified regions and presented the effect of the

placement restrictions on design performance.

Rahman, et al (2004) addressed leakage in FPGA interconnects and applied the well-

known leakage reduction techniques to interconnect multiplexers and proposed four

different techniques. In first technique, extra configuration SRAM cells were introduced

to allow for multiple OFF transistors on unselected multiplexer paths. The intent was to

take advantage of the “stack effect". A second technique described the laying out of the

multiplexer in separate wells, allowing body-bias techniques to be used to raise the VTH

of multiplexer transistors that are not part of the selected signal path. As a third

technique, they proposed negatively biasing the gate terminals of OFF multiplexer

transistors. The negative gate bias leads to a significant drop in sub threshold leakage.

Finally, the authors proposed using dual-VTH techniques, wherein a subset of multiplexer

transistors are assigned high-VTH (slow/low leakage) and the remainder of transistors are

assigned low-VTH (fast/leaky). The dual-VTH idea, impacts FPGA router complexity, as

48

the router must assign delay-critical signals to low-VTH multiplexer paths. Ciccarelli,

Lodi and Canegallo (2004) applied dual-VTH techniques to the routing switch buffers in

addition to the multiplexers.

Meng, Sherwood and Kastner (2006) proposed a CAD technique to reduce leakage power

dissipation in FPGA embedded memory bits by adding path traversal and location

assignment techniques in the embedded memory mapping. The authors assumed that all

the embedded memory cells can support the drowsy mode by having the ability to

connect to two supply voltages VDDH and VDDL, a high and low supply voltage

respectively.The cell still retains the stored data even while the memory bit is operating at

the low supply voltage but the bit will consume less leakage power as leakage power is

proportional to the supply voltage. This scheme is referred to as drowsy memoryfor

memory bits.

They also proposed three different modes: sleep mode, drowsy mode, and live mode. The

sleep mode is used for unused memory entries by shutting down the supply voltage from

the unused memory bits. In the study the authors showed that just by putting the unused

memory entries in the sleep mode (used-active), one can save an average of 36% of the

memory leakage power without utilizing any scheme for dynamically waking up(or

putting to sleep) the used memory entries. Moreover,in the embedded memories, onan

average about 75% of leakage power savings can be achieved just by using the minimum

number of memory entries and turning off the unused entries (min-entry). It is noticed

that the drowsy-long scheme offers an additional 10% leakage power savings over the

min-entry scheme. Moreover, the path-place algorithm on an average achieves about 95%

leakage power savings. It has been concluded that the two best memory layout techniques

are the min-entry and path place techniques. The min-entry scheme offers very good

leakage power savings in terms of both computational time and extra circuitry needed by

the FPGA since it only supports active and off modes. On the other hand, the path-place

scheme supports three memory modes: active, low leakage with data retention, and off

modes.

49

Kumar and Anis (2007) proposed two architectures i.e. homogeneous and heterogeneous

architectures. The homogeneous architecture uses the inside cluster sub blocks of

different VTH, while the heterogeneous architecture uses interleaved two types of clusters,

where one of the clusters is composed of low VTH logic cells and the other consists of low

and high VTH logic cells. The authors proposed a CAD framework that starts by assigning

the whole design to high VTH logic cells. Then the algorithm starts assigning the logic

cells into low VTH cells as long as the cell has positive slack and the new path slack does

not become negative. The algorithm clusters the logic cells into the clusters that

correspond to the architecture being used in the next stage. Finally, constrained

placement is used to place the clustered designs into the FPGA architecture. It was

noticed that both the homogeneous and heterogeneous architectures result in very close

leakage power savings with almost equal delay penalties. Lewis, et al. (2009) proposed the use of body biasing in FPGAs to slow down the cells on

non critical paths to achieve a reduction in the sub threshold leakage power. The authors

concluded that using a granularity that is equal to two clusters results in considerably

sufficient amount of leakage power savings without incurring big penalties on both the

delay and area of the FPGA.

2.4.2 Dynamic Power Reduction As stated by Kusse and Rabaey (1999), George, Zhang and Rabaey (1999) and

GeorgeandRabaey (2001), the first comprehensive effort to develop a low-energy FPGA

was by a group of researchers at UC Berkeley and power reductions were achieved

through following significant changes in the logic and routing fabrics:

- Larger, 5-input LUTs were used rather than 4-LUTs, allowing more connections

to be captured within LUTs instead of being routed through the power- dominant

interconnect.

- A new routing architecture was deployed, combining ideas from a 2-dimensional

mesh, nearest-neighbor interconnects, and an inverse clustering scheme.

- Specialized transmitter and receiver circuitry were incorporated into each logic

block, allowing low-swing signaling to be used.

50

- Double-edge-triggered flip-flops were used in the logic blocks, allowing the clock

frequency to be halved, and reducing clock power.

The main limitations of the work were:

- The proposed architecture represents a “point solution" and in that the effect of

the architectural changes on the area-efficiency, performance, and routability of

real circuits was not considered

- The basis of the architecture is the Xilinx XC4000, which was introduced in the

late 1980s and differs considerably from current FPGAs

- The focus was primarily on dynamic power and leakage was not a major

consideration. Li, et al. (2003) considered power trade-offs at the architectural level that examined the

effect of routing architecture, LUT size, and cluster size i.e. the number of LUTs in a

logic block, on FPGA power-efficiency. Using the metric of power-delay product,

authors suggested that 4-input LUTs are the most power-efficient, and that logic blocks

should contain twelve 4-LUTs. In these studies, despite their focus on power, power-

aware CAD tools were not used in the architectural evaluation experiments. The

architectures evaluated in the UC Berkeley workare somewhat out-of-step with current

commercial FPGAs. Li, et al. (2003) suggested that a mix of buffered and un-buffered

bidirectional routing switches should be used but the modern commercial FPGAs no

longer use un-buffered routing switches; rather, they employ unidirectional buffered

switches.

Li, et al. (2004A) applied the dual-VDD concept to FPGAs and proposed heterogeneous

architecture in which some logic blocks are fixed to operate at high-VDD (high speed) and

some are fixed to operate at low-VDD (low-power, but slower). The power benefits of the

heterogeneous fabric were found to be minimal mainly due to the rigidity of the fixed

fabric and the performance penalty associated with mandatory use of low-VDD in certain

cases. Subsequently the authors Li, et al. (2004B) extended their dual-VDD FPGA work to

allow logic blocks to operate at either high or low-VDD [74] and by using such

“configurable" dual-VDD schemes, power reductions of 9-14% (versus single-VDD

FPGAs) were reported. A limitation of work by Li, et al. (2004A) andLi, et al. (2004B)is

51

that the dual-VDD concepts were applied only to logic and not to interconnect, where most

power is consumed and was assumed to always operate at high-VDD.

Gayasen, et al, (2004) overcame this limitation which apply dual-VDD to both logic and

interconnect. A dual-VDD FPGA presents a more complex problem to FPGA CAD tools.

CAD tools need to select specific LUTs to operate at each supply voltage, and then assign

these LUTs to logic blocks with the appropriate supply. Chen, et al, (2004) developed

algorithms for dual- VDD mapping and clustering to address these issues in conjunction

with the architecture work mentioned above.

According to Lee, et al, (2003), the followingarethree major strategies in FPGA power

consumption reduction:

- First, changes can be done at the system level (e.g. simplification of the

algorithms used).

- Secondly, if the architecture of FPGA is already fixed, a designer may change the

logic partitioning, mapping, placement and routing and

- Finally, if no changes at all are possible, enhancing operating conditions of the

device may be still promising (this includes changes in the capacitance, the supply

voltage, and the clock frequency). Following basic techniques have been explored so far at system level design:

Kuon and Rose (2007) suggested to use coarse-grained embedded blocks rather than the

fine-grained configurable logic blocks in an FPGA, since the former are more power

efficient than the latter for the same function. However, it is to be ensured that power

consumption for routing would not increase significantly for using course-grained

FPGAs.

Osborne, et al(2008) used clock gating as a simple and effective method for reducing

dynamic power consumption. It reduces the dynamic power by eliminating unnecessary

toggling on the outputs of flip-flops of a circuit, gates in the fan-out of the flip-flops, and

clock signals. Clock gating can be used to reduce dynamic power consumption to prevent

52

signal transitions by disabling the clock for the inactive regions. The circuitry in an

operator is gated when not in use if it can be combined with word-length optimization.

Wilton, Ang and Luk (2004) found that, at a given clock speed, pipelining which is a

simple and effective way of reducing glitching can reduce the amount of energy per

operation by between 40% and 90% for applications such as integer multiplication,

CORDIC, triple DES, and FIR filters.

Chow, et al,(2005) observed that power reduction between 4% and 54% can be achieved

for various arithmetic circuits by using dynamic voltage scaling to adapt the dynamic

supply voltage to the FPGA as the temperature changes.

Tessier, et al (2007) described that power is also minimized by optimizing the mapping to

the embedded memories and to the embedded DSP blocks. They proposed a power-

efficient RAM mapping algorithm for embedded memory blocks. In ISE, power is

minimized during placement and routing by minimizing the capacitance of high-activity

signals. Dynamic power dissipation is further minimized by strategically setting the

configuration bits within partially used LUTs to minimize switching activity. A number of studies have investigated low-power FPGA architecture design:

George, Zhang and Rabaey (1999) described energy-efficient FPGA routing architectures

and low-swing signaling techniques to reduce power and proposed a new FPGA routing

architecture that utilizes a mixture of hardwired and traditional programmable switches.

Sivaswamy, et al.(2005) proposed a new FPGA routing architecture that utilizes a

mixture of hardwired and traditional programmable switches. This reduces static and

dynamic power by reducing the number of configurable routing elements. As the

architecture and the circuit-level implementation of the FPGA directly affects the

efficiency of mapping applications to FPGA resources and the amount of circuitry to

implement these resources, these implementations are the main keys in reducing power.

Kusse and Rabaey (1999) introduced the energy-efficient modules for embedded

components in FPGAs to reduce power by optimizing the number of connections

between the module and the routing resources, and by using reduced supply voltage

53

circuit techniques. They presented a novel FPGA routing switch with high-speed, low-

power, or sleep modes. The switch reduces dynamic power for non timing critical logic

and standby power for logic when it is not being used. Anderson and Najm (2004)

reported lower energy up to 3.6 times than an ARM7 device, and up to 6 times lower

energy than a C55X DSP, by using several power reduction techniques, such as register

file elimination and efficient instruction fetch that are proposed for a coarse-grain

reconfigurable cell-based architecture.

Lin, Li and He (2005) used the power-gating to reduce dynamic power that is applied to

the switches in the routing resources to reduce static power and duplicate routing

resources that use either high or low Vdd.

A recent study Lamoureux, Lemieux and Wilton (2008) suggests that glitching accounts

for 31% of dynamic power dissipation in FPGAs. Glitching occurs when values at the

inputs of a LUT toggle at different times due to uneven propagation delays of those

signals. Lamoureux and others propose a method for minimizing glitching by adding

configurable delay elements to the inputs to each logic element in the FPGA. On an

average, the proposed technique eliminates 87% of the glitching that reduces overall

FPGA power by 17% at the cost of the overall FPGA area by 6% and critical-path delay

by less than 1% due to the added circuitry increases.

Dynamic power is a result of signal transitions between logic-0 and logic-1. These

transitions can be split into two types: functional transitions and glitches. Functional

transitions are those which are necessary for the correct operation of the circuit. Glitches,

on the other hand, are transitions that arise from unbalanced delays to the inputs of a

logic gate, causing the gate’s output to transition briefly to an intermediate state.

Although glitches do not adversely affect the functionality of a synchronous circuit as

they settle before the next clock edge but they have a significant effect on power

consumption.

Lamoureux, Lemieux and Wilton (2008)described that spurious transitions can be

produced at the LUT output, if the arrival times are far enough apartas shown in

Figure2.5 (a). Detailed timing information is used to configure these delay elements after

54

place and route, so as to align the arrival times at the inputs of each logic element and this

eliminates glitches as long as the arrival times can be aligned closely enough, as shown in

Figure 2.5 (b).

The authors proposed a method for minimizing glitching that involves adding

configurable delay elements to the inputs to each logic element in the FPGA. The amount

of elimination of glitching depends on several factors like resolution, maximum delay,

location and amount of the programmable delay elements. On an average, the proposed

technique eliminates 87% of the glitching that reduces overall FPGA power by 17% at

the cost of the overall FPGA area by 6% and critical-path delay by less than 1% due to

the added circuitry increases.

Figure 2.5 (a) Circuit with Glitch

Figure 2.5 (b) Glitch removed by delay input

Glitch reduction techniques can be applied at various stages in the CAD flow. Since

glitches are caused by unbalanced path delays to LUT inputs, it is natural to design

algorithms that attempt to balance the delays. Cheng, Chen and Wong (2007) proposed

that mapping is chosen based on glitch-aware switching activities at the technology

mapping stage, whereas Dinh, Chen, M. Wong (2009) operated at the routing stage, in

which the faster arriving inputs to a LUT are delayed by extending their path through the

routing network. Delay balancing can also be done at the architectural level. However,

these approaches all incur an area or performance cost.

55

Some works use flip-flop insertion or pipelining to break up deep combinational logic

path which are the root of high glitch power. Wilton, Ang and Luk (2004) described that

circuits with higher degrees of pipelining tend to have lower glitch power because they

have fewer logic levels, thus reducing the opportunity for delay imbalance. Lim, et al.

(2005) proposed to insert flip flops with shifted-phase clocks to block the propagation of

glitches. Tomasz, et al. (2007) used negative edge-triggered flip-flops in a similar

fashion, but without the extra cost of generating additional clock signals. Fischer, et al.

(2005) explored the possibility to apply retiming to the circuit by moving flip-flops to

block glitches.

Shum and Anderson (2011) presented a glitch reduction optimization algorithm based on

don’t-cares that sets the output values for the don’t-cares of logic functions in such a way

that reduces the amount of glitching. The authors performed the process afterplacement

and routing, using timing simulation data to guide the algorithm. The algorithm achieved

an average total dynamic power reduction of 4.0%, with a peak reduction of 12.5%;

glitch power was reduced by up to 49.0%, and 13.7% on average.

Gupta, Anderson and Wang (2009) observed that the dynamic power consumption is

supposed to increase linearly with changes of clock frequency and size of a design. It was

also observed that with the decrease in clock frequency, the effect of the design size on

power consumption gets decreased They mentioned that as long as the device operates at

low frequencies, FPGA designs can be enlarged with a disproportionally low dynamic

power increase. Only at the highest frequencies, the dynamic power changes

proportionally to the design area.

Following three parts described in the research works by Lamoureux, Lemieux and

Wilton(2008) examine the trade-off between the flexibility of FPGA clock networks and

overall power consumption.

- A parameterized framework for describing a wide range of FPGA clock

networks.

- A comparison of clock aware placement techniques to determine their

effectiveness: since clock networks impose hard constraints on the placement

56

of logic blocks within the FPGA, a good clock-aware placement algorithm

must obey these constraints and also optimize for speed, routability, and

power consumption.

- Several techniques for combining these objectives are evaluated, in terms of

their ability to find a placement that is fast, energy efficient, and legal.

A lot of design performance like FPGA area utilization and power consumption get

affected by coding style using HDL. Dollas et al. (2004) in a case study of rapid

prototyping of hardware system presented the effect of CAD tools capabilities, design

flows and design styles and reported very interesting results by demonstrating how an

HDL behavioral approach leads to more efficient implementations comparing to

structural descriptions.

2.5 Conclusions

From the above literature survey it can be concluded that keeping in view the importance

of the three parameters of area, delay and power in the FPGA based system designs, a lot

of work has been carried out by the research scholars and have presented different design

techniques for optimizing these parameters. Some of the different design techniques

proposed for optimizing the area, delay/speed and power can be summarized as below:

The area can be optimized by using:

Versatile Place and Route Tool (VPR)

Switch Depopulation

Clustering Technique by careful matching of resources availability and design

complexity

Resource Sharing i.e. Time Division Multiplexing (TDM)

Proper Reset Strategy

The delay can be optimized by using:

Pipelining

Parallel Processing

Register Balancing

57

Improving Routing Architecture

Improving CAD tools

The power can be optimized by using:

Clock Gating

Asynchronous Design

Reducing Clock Speed

Finite State Machine Proper Encoding

Dynamic Voltage Scaling

Power Gating

Dual Vdd

Reducing Glitches

Area Minimization

From the literature survey, it reveals that a lot of work has been done on various

techniques and methods to reduce either of three parameters of area, delay and power, but

hardly any literature is available to develop an approach or methodology which can be

applied on any FPGA based designed system that can reduce all these three parameters to

give the best trade-off for a particular FPGA platform. The next chapter devotes to design

a FPGA-based digital system for a very comprehensive 32-bit Floating Point Arithmetic

Unit (FPAU) using VHDL. This is used as base digital system design for further

developing a systematic approach that shall be applied on this designed system which can

take care of reducing all these three parameters to give the best trade-off among these

parameters.

chapter-ii literature review -...

Documents