vdtt.iitd.ac.in research projects thesis jvl072175
TRANSCRIPT
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
1/50
Power Optimization for Datapathsin FPGAs
A thesis submitted in partial fulfillmentof the requirements for the degree of
MASTER OF TECHNOLOGY
in
VLSI DESIGN TOOLS & TECHNOLOGY
by
Haldule Prasad Charudatta
Entry No. 2007JVL2175
under the guidance of
Prof. M BalakrishnanMr. Madhav Chikodikar (Synplicity India)
Dr. G. Chandramouli (Synplicity India)
VLSI Design Tools And Technology,
Indian Institute of Technology Delhi.
May 2009
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
2/50
Certificate
This is to certify that the thesis titled Power Optimization for Data-
paths in FPGAs being submitted by Haldule Prasad Charudatta for
the award ofMaster of Technology in VLSI Design Tools & Technol-
ogyis a record of bonafide work carried out by him under my guidance and
supervision at the Department of Computer Science & Engineering.
The work presented in this thesis has not been submitted elsewhere either in
part or full, for the award of any other degree or diploma.
Prof. M Balakrishnan
Department of Computer Science and Engineering
Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
3/50
Acknowledgments
I would like to sincerely thank my supervisors Prof. M Balakrishnan, Mr.
Madhav Chikodikar and Dr. G. Chandramouli for their constant guidance
and invaluable suggestions through out the project.
I am also indebted to research scholar Neeraj Goel from IIT Delhi and
Tarun Kumar, Sriram C. from Synplicity, India for their support and help.
Finally, I would thank my parents and all my friends for their co-operation
and encouragement.
Prasad C. Haldule
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
4/50
Abstract
Hardware designs targeting communication and DSP applications consists
of a large number of data path elements such as adders, multipliers, com-
parators, shifter etc. These elements are main contributors to the power
consumption of digital circuits. Power reduction is now becoming very im-
portant for implementation of designs on FPGAs as well. At present, most of
the popular FPGA hardware synthesis tools give higher priority to delay. So
the FPGA synthesis tools tend to generate data path architecture for faster
implementation. With increasing importance of power reduction on FPGA,it is becoming necessary to evaluate different data path architectures from the
point of view of both delay and power.
This work is aimed at characterizing various architectures implementa-
tions of common operators for power, delay and area for a target FPGA and
selecting a particular low power architecture where delay is not critical.
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
5/50
Contents
1 Introduction and Motivation 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 FPGA And Operator Architectures 4
2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Operator Architecture . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 6
Carry Select Adder . . . . . . . . . . . . . . . . . . . . 7Carry Look-Ahead Adder 1 - CLA1 . . . . . . . . . . . 7
Carry Look-Ahead Adder 2 - CLA2 . . . . . . . . . . . 8
Carry Look-Ahead Adder 3 - CLA3 . . . . . . . . . . . 9
Brent and Kung Adder . . . . . . . . . . . . . . . . . . 9
2.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 10
Non-Booth Encoding . . . . . . . . . . . . . . . . . . . 11
Booth Encoding . . . . . . . . . . . . . . . . . . . . . . 11
Wallace Tree reduction . . . . . . . . . . . . . . . . . . 12
2.2.3 Equality Comparators . . . . . . . . . . . . . . . . . . 14
3 Experiments Setup and Power Plots 15
3.1 Power Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Power, delay and area plots of operators . . . . . . . . . . . . 18
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
6/50
CONTENTS ii
4 Power Analysis 23
4.1 Dynamic Power Dissipation . . . . . . . . . . . . . . . . . . . 23
4.2 Power analysis of operators . . . . . . . . . . . . . . . . . . . 24
4.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 General Observations . . . . . . . . . . . . . . . . . . . . . . . 29
5 Power-Aware Architecture Selection and Results 32
5.1 Power-Delay-Area database . . . . . . . . . . . . . . . . . . . 32
5.2 Power-Aware Arithmetic Block Selection . . . . . . . . . . . . 32
5.2.1 Minimum power . . . . . . . . . . . . . . . . . . . . . . 32
5.2.2 Minimum power with speed constraint . . . . . . . . . 33
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusion 40
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References 41
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
7/50
List of Figures
2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Flash-based switch . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Carry Select Adder . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Carry Look-Ahead Adder-1 . . . . . . . . . . . . . . . . . . . 8
2.6 Carry Look-Ahead Adder-2 . . . . . . . . . . . . . . . . . . . 9
2.7 8-bit Brent and Kung Adder Tree Diagram . . . . . . . . . . . 10
2.8 A 4 4 multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 An equality comparator . . . . . . . . . . . . . . . . . . . . . 14
3.1 Power Analysis Flow . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Power, delay and area plots for Adder architectures . . . . . . 19
3.3 Power, delay and area plots for Multiplier architectures . . . . 21
3.4 Power, delay and area plots for Comparator architectures . . . 22
4.1 Ripple carry Adder Placement . . . . . . . . . . . . . . . . . . 30
5.1 Algorithm for selection of low power architecture . . . . . . . 34
5.2 Default and Power-Aware comparison of Adders . . . . . . . . 36
5.3 Default and Power-Aware comparison of Multiplier . . . . . . 37
5.4 Power and delay area plots for Comparator architectures . . . 38
5.5 Comparison of power for 16-bit adder with given delay con-
straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
8/50
List of Tables
2.1 Booth recoding . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Relative Comparison of Adder architectures . . . . . . . . . . 26
4.2 Relative Comparison of Multiplier architectures . . . . . . . . 27
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
9/50
Chapter 1
Introduction and Motivation
1.1 IntroductionThe use of FPGAs is rapidly increasing owing to its low cost and smaller
time to market. Also, FPGAs are more suitable for building prototypes and
in reconfigurable applications. However, FPGAs are not power efficient. A
couple of reasons for FPGAs to consume more power are,
FPGAs consist of large number of interconnects and programmable
switches
generic logic structures in FPGAs consume more power than the dedi-cated circuitry in ASICs
A number of applications like space and military where FPGAs are used
demand low power features. A major component of FPGA power is the
datapathand so there is a scope for reduction in power consumption there.
Many applications using FPGAs involve computations that use arithmetic
operators like adders, multipliers, etc. The present FPGA synthesis tools
like Synplify Pro targetperformanceof these operators and no consideration
is given topower.
These arithmetic operators can be implemented in different architectures.
The power consumed by these operators can be decreased by proper selection
of a particular operator architecture over the other. The power varies not
only because of the different operator architectures but also because of its
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
10/50
1.2 Related Work 2
interconnects, that depend on the FPGA architecture and technology. For
this, synthesis tool must have detailed knowledge of parameters like power,
delay and area of operators of different implementations for different bit
widths.
1.2 Related Work
The idea of characterizing arithmetic operators and selection of an implemen-
tation for minimum power is been discussed lately. A methodology to form
characteristic lookup table and a couple of approaches to achieve low power
implementation is discussed in [1]. [2] extends the idea to select a particular
architecture by simulated annealing considering power, delay as well as area
of the operators. [3] analyzes the power-delay-area tradeoff among differentarchitectures of adders and multipliers for Actel ProASIC FPGAs while a
similar study for adders is done for Altera FPGA in [4].
1.3 Objective
The project is aimed at developing a power aware utility to generate an
FPGA implementation for a data path operator given its word length and
expected frequency of operation. It can be divided in two parts:
Analyzing the power for different operators and study a power-time-
area relation among them.
Develop a utility that will go through all the implementation and come
up with one having minimum power for an operator at the desired
frequency.
1.4 ApproachFor analyzing power of different architectures, first the arithmetic operators
of interest, their types and word lengths for which analysis is to be done are
identified. Then we develop complete power flow that will take the param-
eters, invoke different tools and will directly give the corresponding power.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
11/50
1.5 Organization 3
As power analysis needs to be done repeatedly, this step is crucial. The ar-
chitectures for operators are studied from power point of view. Finally we
develop a script that uses the Synopsys internal module generator to gener-
ate right architecture for delay and power requirements and compare these
power values by those when default architectures are used.
1.5 Organization
The following gives a general overview of this report.
Chapter 2 describes a generic FPGA architecture and also the different
operator architectures are would be part of our analysis. Chapter 3 explains
the power flow set up for the analysis and also gives the power, delay and
area graphs for operator implementations. In Chapter 4 a detailed analysisis presented for the architectures from power point of view and trade offs are
discussed. Chapter 5 gives a basic way of selecting a low power implementa-
tion and compares results ofdefault and poweraware implementations while
Chapter 6 presents the conclusions.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
12/50
Chapter 2
FPGA And Operator
Architectures
2.1 FPGA Architecture
A generic FPGA architecture is shown in the figure below.
Figure 2.1: FPGA Architecture
An FPGA consists of a two-dimensional array of logic blocks connected
by general interconnection resources [5]. FPGA includes three main com-
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
13/50
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
14/50
2.2 Operator Architecture 6
delay and area for different bit widths. For this, the architectures are selected
from those available in Synopsyss internal module generator. The operators
targeted are adders, multipliers and comparators.
2.2.1 Adders
The module generator offer a wide variety of adder architectures that can be
selected according to the power and performance needs of the design. The
implementations include ripple carry adder, carry select, Carry Look-Ahead
adders and Brent and Kung adder.
A carry can either be generated in an adder stage or propagated from the
previous one. CLAs use these generate (G) and propagate (P) values of all
previous stages to determine the carry of a particular stage. This combination
of previous stage GP values can be done in many ways depending on the
architecture used. Accordingly there are three variants of Carry Look-Ahead
adders having power-delay-area trade-offs.
Ripple Carry Adder
This is the simplest implementation of adder which is formed by cascading
full adders in series as shown in figure 2.3. A full adder computes the sum
and carry for each stage. The CarryOutof a full adder stage is applied as a
CarryInto the next stage [7].
The main advantage of this architecture is its small area; however, it has
large delay that increases linearly with the bit width. Both, delay and area
are of order O (n).
Figure 2.3: Ripple Carry Adder
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
15/50
2.2 Operator Architecture 7
Carry Select Adder
Figure 2.4: Carry Select Adder
A carry select adder for n bits is divided into n/sstages ofs bits each.
The sum and carry for each stage is first computed separately considering
input carry as 0 as well as 1 and then one of these is selected depending
on the actualCarryInfrom the previous stage [7].
This circuit has more area (almost twice) than ripple carry adder as each
stage consists of an extra s-bit adder and a multiplexer. Carry Select has a
lower delay than ripple carry as the carry and sum are computed for each
block in advance.
A block diagram for n-bit carry select adder with s-bit blocks is shown in
Figure 2.4. Here, 0-Carry and 1-Carry are s-bit adders with input carry as 0
and 1 respectively.
Carry Look-Ahead Adder 1 - CLA1
This is a conventional form of CLA architecture. In this the entire word
length is divided into blocks of, say, 4-bits each. The generate-propagate
(GP) values for individual bits (denoted by G0,P0, and so on) and further,for a pair and then for combination for 4-bits are generated in the first three
stages of the circuit as,
Stage 1: G0= X0 Y0,G1= X1 Y1, P1= X1 Y1 . . .
Stage 2: (G1:0= G1+G0P1), (G3:2= G3+G2P3),
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
16/50
2.2 Operator Architecture 8
(G5:4= G5+G4P5). . .
(P3:2= P2P3), (P5:4= P4P5). . .
Stage 3: (G3:0= G3:2+G1:0P3:2= G3+G2P3+G1P2P3+G0P1P2P3),
(G7:4= G7+G6P7+G5P6P7+G4P5P6P7). . .
(P7:4= P4P5P6P7). . .
In each latter stage, a 4-bit GP block gets combined with the complete
lower GP block to finally realize the complete carry equation. This is achieved
by forming a skewed tree-like structure as shown in figure 2.5 given below.
S
T
AG
E
S
1
to
3
GP
Combination
Block
GP
Combination
Block
GP
Combination
Block
GP 3:0
GP 13:12
GP 11:8
GP 7:4
GP 7:0
GP 11:0
GP 13:0
Figure 2.5: Carry Look-Ahead Adder-1
The GP combination blocks are simple and can be realized by simple
AND-OR logic. This architecture has small area as the GP combination
blocks are simple. In this implementation the GP values of lower bits ripple
across the design, but, in batches. So, this circuit has greater delay than
other CLA implementation
Carry Look-Ahead Adder 2 - CLA2
This is a faster version of CLA architecture. First three stages of generation of
GP values are similar to that of CLA1. Further, the GP blocks are combined
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
17/50
2.2 Operator Architecture 9
in more aggressive manner in lesser logic levels thereby decreasing the delay.
The architecture has a balanced tree structure.
Figure 2.6: Carry Look-Ahead Adder-2
As seen from the above figure 2.6, the combining of GP values is simul-
taneous at lower bits and higher bits, unlike in previous CLA where it was
from lower bits toward higher bits. Hence, it requires lesser logic levels. More
gates and logic is required to generate GP blocks of higher bit width. So,
area of FCLA circuit is larger.
Carry Look-Ahead Adder 3 - CLA3
One of the Carry Look-Ahead implementations seen before is small(but slow)
while other one is fast(but large). This third variant of CLA architecture is
a compromise between the earlier two. It has only one logic level more than
CLA2 but about 30% less area.
Brent and Kung Adder
Brent and Kung architecture is a parallel prefix tree based algorithm to
develop a binary adder circuit optimized for time and area [8]. A tree diagram
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
18/50
2.2 Operator Architecture 10
for an 8-bit Brent and Kung adder is shown in figure 2.7. It has a black
operator to compute the GP values of higher order from lower order values.
For a black operator(),
Gi:j =Gi:k+Pi:k Gk1:j
Pi:j =Pi:k Pk1:j.
Figure 2.7: 8-bit Brent and Kung Adder Tree Diagram
The addition of n-bit numbers can be performed in time O (logn), while
the area complexity is O (nlogn). Brent and Kung architecture is considered
as the most area-efficient parallel prefix adder as it needs very few logic
elements.
2.2.2 Multipliers
Multiplication is an important arithmetic operation. Multipliers are much
larger than adders and also more power consuming. Also, multiplication is
a slow operation Multiplication is actually a process of addition of multi-
ple partial-products. These partial products are formed by operating the
multiplicand by each bit/bits of the multiplier. There are different multi-
plier architectures formed depending on the method of generation of partial
product and its addition to finally give the complete product.
The partial products can be computed using the normal ANDing opera-
tion or by using some encoding method. By encoding, the number of partial
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
19/50
2.2 Operator Architecture 11
products is reduced. Booth encoding is used for this purpose. Partial prod-
uct generation is followed by their addition that is done using Wallace tree
optimization in tree-like format. These techniques can be used in different
ways with different options to give various architectures.
Non-Booth Encoding
This is the simplest way of generating partial products. In this, the multi-
plicandis ANDed by each bit of the multiplierto form a partial product [9].
The total number of partial products needed to be added to get the final
product is not reduced.
Non-booth encoding is preferred for lower bit multipliers as there is no
overhead of encoding themultiplier.
Booth Encoding
Booth Encoding is used to decrease the number of partial products. Smaller
the number of partial products, better the performance. This is achieved by
recoding the multiplierbits so that now a group of bits decide the partial
product.
The basic Booth-encoding technique recodes two multiplier bits to 1, -1 or
0 depending on the value of the pair of bits [9]. The multiplicand is assigned
accordingly to the partial product as given in table 2.1.
Multiplier bits Recoded bits Corresponding Partial Product
00 0 001 +1 +Multiplicand10 1 Multiplicand11 0 0
Table 2.1: Booth recoding
Amodified Booth encodingis also used in which three multiplier bits arerecoded. Booth encoding is very useful, especially for higher bit widths. It
helps to decrease the area as well as delay of the next stage i.e addition of
partial products.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
20/50
2.2 Operator Architecture 12
Wallace Tree reduction
Wallace tree reduction is a technique of arranging partial products in a tree-
like manner. It is a way in which partial products are added in parallel. The
number of adders required decrease by this technique [10, 11]. It uses twooperators, a full adder that has three inputs and gives two outputs and the
other one is a half adder that has two inputs and two outputs. Using this 3:2
and 2:2 operators, the required reduction is achieved. The routing in Wallace
tree multipliers is much irregular.
An n-bit multiplier consists of n n-bit partial products. These partial
products need to be summed to get the final product. This of addition on n
n-bit operands is converted into an addition of 2 2n-bit operands.
Consider a 4 4 multiplier. Fig. 2.8 shows Wallace tree architecture
for 4 4 multiplier. Assuming non-Booth encoding, partial products will be
computed as shown in Fig. 2.8(a), where Pij =Xi Yj .
(a) Partial Products Generation
Addition of these partial products will give the final 8bitproduct. This
addition is done in two stages.
1. First stage consists of fast carry save addition of partial products using
full adders and half adders (Fig. 2.8(b)). The carries generated at
one layer are passed on to other layer for addition. An advantage of
this is that only two layers are sufficient for addition of 16-bits (not
included the last stage). Thus, both the number of adders as well
as critical path is reduced. This reduction is significant for multipliers
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
21/50
2.2 Operator Architecture 13
FA FA
FA
HA
HAHA
P00P01P10P02 P11 P20P21 P03P30 P12P22P31 P13
P23P32
P33
LAST STAGE ADDER
Z7 Z2 Z1 Z0Z3Z4Z5Z6
(b) Wallace Tree Architecture
Figure 2.8: A 4
4 multiplier
with higher word length. Because of the use of full adders (also referred
as 3 : 2 compressor) the delay of the tree is O(log3/2(N)).
2. The last stage consists of a fast 2n-bit adder (8-bit, in this case). Usu-
ally a carry look-ahead adder is used here.
The Wallace tree reduction can be implemented in different ways. Synop-
syss internal module generator provides various options for the same. The
order ofcompressorcan be varied to 3 or 4. The other few options availableallow to optimize eitherareaor timingof Wallace tree.
All these options are used to generate different implementations of mul-
tipliers. The architectures with booth and non-booth encoding start with
bwt andnbwt respectively. The latter part denotes the type of Wallace tree
optimization used. These are,
array: This architecture gives the traditional array form of multiplier
architecture. It has smaller area. But, as the addition of the partial
products is in array form, this circuit has larger delay. It produces a
dense layout.
tree : This is a regular tree-like implementation of multipliers. It has
lesser delay. The 4:2 compressorsare used in this implementation.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
22/50
2.2 Operator Architecture 14
bitopt : The bit-optimized solution is the minimum gate count and
minimum delay circuit.
2.2.3 Equality ComparatorsEquality comparators check the equality between two numbers and output
1 if equal, else 0. The simplest way of doing so is to bitwise XNOR two
inputs and then to AND the outputs of the XNORs to get final result. A
generic equality comparator is shown in Fig. 2.9
A0
A
A
A1
B0
B
B1
Y
B
n-2
n-2
n-1
n-1
Figure 2.9: An equality comparator
The polarity of logic and XOR gates can be set to either trueorinverting
in Synopsyss module generator. However, the basic architecture remains the
same.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
23/50
Chapter 3
Experiments Setup and Power
Plots
3.1 Power Analysis Setup
For power analysis of architectures of different operators at various bit widths,
a power flow is set up using tools provided by different vendors.
The power analysis flow is as follows:
1. Synopsys internal module generator is used to generate Verilog netlist
for a particular architecture of an operator. It has two inputs files,one to specify the functionality of the design/operator and the other
to select among the options provided that helps to select a particular
architecture. The .lib library file of the target FPGA is also loaded.
The module generator selects appropriate library components to export
the netlist.
Registers are placed at the inputs and outputs of the operators. First
reason is that timing analysis tools provide only clock-to-clock delay.
Delay of a pure combinational circuit is not calculated. The second
reason is that I/O pads have high delay values. This delay adds up with
combinational circuit delay. Placing registers breaks this direct link.
2. The verilog netlist from previous stage is then synthesized in Synplify
Pro, a logic synthesis tool for FPGAs. The details of target FPGA
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
24/50
3.1 Power Analysis Setup 16
Figure 3.1: Power Analysis Flow
are specified and constraints (if any) are given. For the power analy-
sis experiments frequency is kept at auto-constrain mode. Automatic
constraints generate the fastest design implementation.
If frequency constraint is set to a value greater than that derived when
set to auto-constrain, then Synplify Pro tries to optimize the circuit for
higher frequency. In the process, it adds more gates, makes it larger
and thus distorts the original architecture. This causes the circuit to
consume more power. So, frequency is set to auto-constrain to preserve
the architecture and to meet required timing with minimum power.
Synplify Pro generates post-synthesis.edn(EDIF) and verilog netlists.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
25/50
3.1 Power Analysis Setup 17
3. The post-synthesis verilog netlist is simulated in VCS, a Verilog sim-
ulator, using a testbench. First, a pseudo-random number generator
dumps a large number of binary numbers of specified width in a file.
The testbench then reads these values in pairs and applies it to the
operator module.
A .vcd (Value Change Dump) file is generated during simulation us-
ing $dumpvars system task in the testbench. It contains the switching
activity for all the signals in the module. This is used for power es-
timation. The arithmetic operation (addition, multiplication, etc) is
performed on about 10,000 pairs of random numbers. Greater the
number of input vectors, more accurate the results can be expected.
4. The post-synthesis EDIF netlist is passed through the Place and Route(layout) tool of the target FPGA vendor. Any constraints for place-
ments like defining a region and placing all or a group of instances in
the design within that region can be specified here using.pdc(Physical
Design Constraint) file.
A status report is generated after compilation of EDIF netlist. It con-
tains information about number of combinational and sequential in-
stances used in the design and number of high fanout nets and their
respective fanouts.
5. A vendor-specific power estimation/analysis tool is used to calculate
the operator power. This is immediately followed after PnR. It uses
the VCD file for power estimation. This is a recommended method for
power estimation as the switching activity information from VCD file
is more accurate.
A VCD file dumped by VCS may not be directly used for power estima-
tion as signals are specified in vector forms. So, these signals should be
scalarized. This is done by using a vcdpostutility that gives out a VCDfile that contains value change for each bit of vector signal, recorded as
a separate signal.
The power report gives breakdown of power by type viz. Net, Gate,
I/O, Core Static and Bank Static. It also contains breakdown by in-
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
26/50
3.2 Power, delay and area plots of operators 18
stances in which gives power for every instance/gate and net in the
design. Details of number of transitions at each net can also be stud-
ied.
3.2 Power, delay and area plots of operators
The power analysis flow was carried on architectures of adders, multipliers
and comparators for different bit widths. The total dynamic power (net +
gate), delay and number of cores utilized are plotted accordingly. Following
are the respective graphs and their analysis is done in the next chapter.
0
5
10
15
20
25
0 10 20 30 40 50 60 70
Power(Net+Gate)mW
bit width
bnkcarrySelect
cla1cla2cla3
ripple
(a) Power
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
27/50
3.2 Power, delay and area plots of operators 19
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50 60 70
Delay(ns)
bit width
bnkcarrySelect
cla1cla2cla3
(b) Delay
0
100
200
300
400
500
600
700
800
0 10 20 30 40 50 60 70
Area(#ofcores)
bit width
bnkcarrySelect
cla1cla2cla3
ripple
(c) Area
Figure 3.2: Power, delay and area plots for Adder architectures
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
28/50
3.2 Power, delay and area plots of operators 20
0
50
100
150
200
250
300
350
400
450
0 5 10 15 20 25 30 35
Power(Net+Gate)mW
bit width
bwt-areabwt-bitoptbwt-timingnbwt-area
nbwt-bitoptnbwt-timing
(a) Power
10
20
30
40
50
60
70
0 5 10 15 20 25 30 35
Delay(ns)
bit width
bwt-areabwt-bitoptbwt-timingnbwt-area
nbwt-bitopt
nbwt-timing
(b) Delay
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
29/50
3.2 Power, delay and area plots of operators 21
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 5 10 15 20 25 30 35
Area(#cores)
bit width
bwt-areabwt-bitoptbwt-timingnbwt-area
nbwt-bitoptnbwt-timing
(c) Area
Figure 3.3: Power, delay and area plots for Multiplier architectures
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70
Power(Net+Gate)mW
bit width
invertedLogictrueLogic
(a) Power
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
30/50
3.2 Power, delay and area plots of operators 22
4.5
5
5.5
6
6.5
7
7.5
8
8.5
0 10 20 30 40 50 60 70
delay(ns)
bit width
invertedLogictrueLogic
(b) Delay
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70
Area(#cores)
bit width
invertedLogictrueLogic
(c) Area
Figure 3.4: Power, delay and area plots for Comparator architectures
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
31/50
Chapter 4
Power Analysis
4.1 Dynamic Power DissipationDynamic power dissipation is caused by signal transitions in the circuit [12].
A higher operating frequency leads to more frequent signal transitions and
results in increased power dissipation. The most significant source of dynamic
power consumption in CMOS circuits is the charging and discharging of
capacitance. This power is given as,
P =
iCiV2i fi,
where Ci,Vi and fi are capacitance, voltage and frequency respectively
for any instance i.
Another source of dynamic power consumption isshort-circuit power. But
this power has smaller contribution in FPGAs.
Thus the total dynamic power consumption is based on following factors,
the length, fanout and effective capacitance of the interconnection wires
and switches,
number of resources utilized, as more the resources, more will be thepower consumed,
and switching activity of different nets and instances in the design.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
32/50
4.2 Power analysis of operators 24
4.2 Power analysis of operators
4.2.1 Adders
Ripple Carry Adder
Ripple carry adder has the smallest circuit in terms of number of nets and
gates. The nets have fanout not more than 2. So, it is also expected to
consume less power.
The gate power of ripple carry adder is the least. However, PnR tool
does not identify the architecture and so the placement for this architecture
is spread out that increases the net length. Also, the number of logic levels is
much more than any other architecture. The effect of these two scenarios is
that the net power is not as less as expected. The increase of power becauseof increase in logic levels is more dominant for higher word lengths.
The major drawback of this architecture is that its delay is too high and
increases linearly with word length (Hence, not shown in Fig. 3.2(a)).
Carry Select Adder
Carry Select architecture is formed by duplication of blocks of ripple carry
adder. So, its area is around twice that of the ripple carry adder circuit.
But it is smaller than the CLAs which employ a more complex circuitry. It
has very high fanout nets, even more than CLA2. This cause carry select to
consume more power.
Its delay is comparable to that of Brent and Kung adder, and is much
less than ripple carry adder as sum and carry for each block is computed
beforehand.
Carry Look-Ahead Adder 1 - CLA1
CLA1 is the slowest variant of CLAs. Its circuit size is 50% of that of CLA2
and the number of logic levels smaller than ripple carry and Brent and Kung
adder. As a result of small size and less levels, CLA1 circuits consume less
power. One more advantage is that it has low fanout nets. So the Net power
is less. But there are more spurious transitions in this circuit since it is a
skewed-tree like architecture.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
33/50
4.2 Power analysis of operators 25
The circuit is slower than other CLAs but still better than ripple carry
and BnK architectures. So, this architecture can be a good choice when all
the three parameters viz. delay, power and area are equally critical.
During placement and routing, the PnR tool performs some logic com-
bining optimization on the post-synthesis netlist. The number of logics com-
bined is given as a part of Netlist Optimization Report in Status report. It
is observed that there is more scope for such post-synthesis logic combining
for CLA1 architecture than others. There is about 10% logic combining for
CLA1 at this stage further reducing its area.
Carry Look-Ahead Adder 2 - CLA2
The CLA2 architecture is the most power-consuming of all as it is more
complex and having more number of gates and nets than other architectures.
This architecture has many interconnections and dependencies in the circuit.
Hence, it has a well-packed layout and so, though the number of nets is more,
they are of smaller lengths. CLA2 has high fanout nets.
The logic depth of the circuit is very less. It does more computations in
lesser levels. This makes CLA2 fastest adder architecture. The number of
logic levels remains constant for a range bit widths. With this, the delay also
tends to be flat.
Carry Look-Ahead Adder 3 - CLA3
CLA3 circuit is only one level slower than CLA2. At the same time, it
is about 70% smaller than CLA2. As a result, power consumption of this
architecture is also about 75-80% less than that of CLA2.
So, this circuit can be a better alternative than CLA2, in terms of area
and power. However, cases in which timing is critical CLA2 should be used.
For higher bit widths, though CLA3 has one level more than CLA2, it uses
gates with lower delay. So, its delay becomes almost comparable to CLA2.
Brent And Kung Adder
Brent and Kung architecture is slightly better in power than other archi-
tectures. The main reason being that the number of gates and nets in this
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
34/50
4.2 Power analysis of operators 26
architecture is much less than in others. Only ripple carry adder has gates
and nets lesser than Brent and Kung. However, Brent and Kung have the
advantage that its number of logic levels is of the order log n, whereas for
ripple carry adder its of order n. Also its layout is scattered to some extent
as for ripple carry adder.
Performance wise, Brent and Kung is slower than the CLAs as it has
a deeper logic. It is more suitable from power point of view at higher bit
widths but then its delay is 1.5 times of that of CLA2 for 32 and 64 bits.
The relative comparison of architectures of adders is given in Table 4.1
and the graphs are given in Fig. 3.2.
No. of nets/gates No. of logic levels fanout
ripple Low Very High Low
Carry Select High High HighCLA1 Moderate Moderate ModerateCLA2 Very High Low HighCLA3 High Low High
Brent & Kung Low High Moderate
Table 4.1: Relative Comparison of Adder architectures
4.2.2 Multipliers
The power, delay and area graphs for different multiplier architectures are
given in Fig. 3.3. The trade offs in these three parameters are discussed here.
Area
The architecture employing booth encoding have more area because of addi-
tional circuitry for encoding. This booth circuit vary along with the encoding
method. So, for lower bit multipliers, non-booth encoding architectures have
smaller area.
However, as seen in Fig. 3.3(c), the booth variant of any implementation
has smaller area than its non-booth counterpart at higher bit widths. This is
because at higher bit widths, the hardware overhead of the booth encoding
part is much lesser than the amount of circuitry saved in the latter stage
because of using booth encoding. Thebwt areaarchitecture has considerably
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
35/50
4.2 Power analysis of operators 27
smaller area than other architectures for multipliers more than 24-bit wide
(even for 48 and 68-bits multiplier not shown in the graph).
It is observed that logic combining during PnR stage is more for booth
encoded circuits. This is because the PnR tool can identify the booth encoding
circuits and optimize it accordingly.
Delay
The delay ofarraymultipliers is, as expected, much greater than other ar-
chitectures for all range of word lengths. This is because carry generated by
addition of partial products are rippled along the row and column in array.
So the number of logic levels increases, in the worst case, linearly with bit
width, for non booth encoding implementation. For circuit with booth en-
coding, as the number of partial products to be summed are reduced, delay
is lesser.
Thetreeandbitoptarchitectures have similar delay characteristics,bitopt
fairing slightly better than corresponding tree implementations. Below 28-
bits, non-booth implementation is faster while for higher width multipliers
bit optimized booth implementation gives minimum delay.
Power
The power of the architectures depend on three major factors, number of
nets and gates in the design, number of logic levels (that contribute to the
spurious transitions increasing the glitch power), and number of high fanout
nets and their fanout.
A relative analysis of these three factors is given in table 4.2.
No. of nets/gates No. of logic levels fanout
bwt array Low High Highbwt bitopt Low Low High
bwt tree High Low Highnbwt array High High Lownbwt bitopt Low/Moderate Low Low
nbwt tree High Low Low
Table 4.2: Relative Comparison of Multiplier architectures
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
36/50
4.2 Power analysis of operators 28
The fanout for booth encoding circuits is very high. Mostly, the input
nets are high fanout nets and hence consume more power.
There are two types of transitions in a circuit, functional (intended) and
spurious (unintended). More the number of logic levels, higher are the spu-
rious transitions (orglitches). These transitions increase the dynamic power
consumption. The number of logic levels, as seen before is higher for array
implementations. The tree and bitopt implementations have smaller logic
depth. Thus the contribution of glitch power due to spurious transitions is
lesser in these circuits.
Finally, power consumption increase with the number of nets and gates.
Gates consume switching as well as short-circuit power. Nets, as seen earlier,
are also expensive in terms of power due to their length and fanouts. The bit
optimized architectures are smaller. So, they have lesser power consumingresources.
As seen from table 4.2, the nbwt bitoptarchitecture does well in all these
three fronts. Hence, it gives minimum power while the array based circuits
consume more power (Fig. 3.3(a)).
4.2.3 Comparators
The two comparator architectures implemented are : invertedLogic and tru-
eLogic, which, as the names suggest are according to the type of logic imple-mented.
As seen from the power, delay and area plots of comparators given in Fig.
3.4, the values of these parameters are almost similar for the two implemen-
tations. This is because, as comparator architecture is fairly simple and there
is no scope for further optimization, there is no structural difference between
the two implementations. Only the library elements used are different.
This comparator architecture can be optimized if arrival times of inputs
are known. However, this is not a general case and may vary with design.
The comparator circuits are scalable. Hence, the area and power of the
circuits increase linearly with bit width (Fig.3.4(a) & (c)).
But the delay, as shown in Fig. 3.4(b), increase with bit width in steps.
This is because number of logic levels is same to compute the comparator
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
37/50
4.3 General Observations 29
output for bit widths in the range, 2n1 < bitwidth 2n.
4.3 General Observations
The power due to nets is about 3 to 4 times greater than the gate power
for all the architectures. This is primarily because nets in FPGA are not
contiguous. FPGA nets consist of net segments and switching elements. Net
power consists of switching power of these interconnecting switches as well.
So, total power mostly depends on number of nets, their length, and fanout.
If the PnR tool is made to do placement and routing without any con-
straints, then power indirectly depends on the placement too. Unconstrained
placement may cause a circuit to disseminate over the available region which
increases the net length. With this, power also varies contradicting the ex-pected power values. One important observation made in this regards was
that when power for ripple carry adder is calculated after unconstrained PnR,
it is almost comparable to power of CLA2 circuit. This contradicts with ex-
pectation as CLA2 is more complex and bigger than ripple carry adder circuit,
and hence is expected to have more power. It is after some constraints are
imposed that the power behaves in accordance with the architecture.
(a) Before Constraints
The constraints can be added indirectly by assigning pins to the ports or
directly by giving some area restrictions. Applying pin constraints does not
serve the purpose, the reasons being,
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
38/50
4.3 General Observations 30
(b) After Constraints
Figure 4.1: Ripple carry Adder Placement
When a particular adder, or any operator, is part of a bigger circuit, it
will have to be placed together like a single module. On the other hand,
if we impose pin constraints, the input-output registers are placed near
the pins and the core circuit away from it. This decreases the clock net
length but may increase lengths of other nets. This could introduce
errors in power values.
Also, providing pin constraints is difficult for varying circuits of varying
bit widths. It is mainly because the pin assignments must follow DesignRule Check (DRC).
So, for the power analysis experiments, area constraints are given to PnR
tool. A region is defined such that it is close to the CLK pad. Then using
a Physical Design Constraints (PDC) command assign net macros all the
instances connected to the clock net (input-output registers) are confined
within that region. This causes the other circuitry to gather around it. If
assign net macros is applied to all nets, so as to accumulate entire circuit
within the region, then it will be required to define a larger region to occupyoperators of higher bit widths. This may again lead the smaller circuits to
disperse. Hence, the constraint is only applied to instances connected to the
clock net.
Fig 4.1 shows the placement of 16-bit ripple carry adder before and after
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
39/50
4.3 General Observations 31
applying area constraints.
Even after such constraints on area are applied, the circuits of ripple carry
and Brent and Kung adder is scattered to some extent. This increases the
net length and thus contributing to power.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
40/50
Chapter 5
Power-Aware Architecture
Selection and Results
5.1 Power-Delay-Area database
The present synthesis tools select the best performance (minimum delay)
implementation of any arithmetic operator. To select an architecture detailed
knowledge of power, delay and area values for different bit widths is needed
that gives an idea of which architecture will be optimum at given bit width.
Having a database can serve this purpose. It can be created by running the
power analysis flow (Section 3.1) on different architectures.This is a one time characterization and lookup tables can be created that
can be used to select low power implementation.
5.2 Power-Aware Arithmetic Block Selection
An optimum power architecture can be selected from among available archi-
tectures in the following ways,
5.2.1 Minimum power
Select a minimum power implementation, irrespective of delay and area.
Generally, a low power implementation also have a smaller area. Such selec-
tion is useful when that operator is not on the critical path.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
41/50
5.3 Results 33
This gives the maximum amount of power that can be saved if a low
power architecture of an operator is implemented rather than a default one.
5.2.2 Minimum power with speed constraintThe algorithm for selection of a minimum power implementation with delay
within a given constraint is shown in Fig.5.1.
According to it, given a speed requirement, select a minimum power im-
plementation first. If speed requirements are met, choose it. Else, the next
low power implementation is selected. This is repeated such that an imple-
mentation with delay within the given constraint is selected with minimum
possible power. If none of the architecture gives delay less than the con-
strained value, then finally an architecture with minimum delay is selected
irrespective of power.
This algorithm can be used if an operator is on the critical path, or, even
if it is not on the critical path, may violate timing requirements if a low power
but high delay architecture gets selected. For this, the synthesis tool may
need to extract timing details of the paths on which operators are present
and to deduce the maximum delay requirement for the operators.
5.3 Results
An utility is developed to select minimum power architecture, irrespective of
delay and area. It has two inputs, the operator type and its bit width. It
looks for all the available implementations of that operator and selects the
one with minimum power.
This poweraware architecture is compared to the default operator ar-
chitecture generated by Synplify Pro by synthesizing HDL commands like
c= a+b for adder,c = a b for multipliers, etc
Power in powerawaremode for adders is 50% less than that in defaultmode. Area till 32-bit is about 30% while for higher width adders it is 50%
of the area defaultarchitecture. However, delay for lower bit adders is too
high inpowerawarearchitecture. This is because as timing is not considered,
ripple carry adder is selected here which has higher delay.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
42/50
5.3 Results 34
Get a list of
architectures
Select architec-ture with minimum
power
Check if
delay constraints
are met
Yes
No
Choose
that
architecture
Remove that
architecture from
the list
Is list
empty?
No
Yes
Select
architecture with
minimum delay
End
Figure 5.1: Algorithm for selection of low power architecture
Even for multipliers (Fig. 5.3), powerawarearchitectures have about 60-
70% power of that of default architectures. The delay at lower bit widths
is smaller for poweraware, while defaultarchitectures fair better in delay for
wider multipliers(even for 48 and 64-bit multipliers not shown in the graph).
As expected, power (Fig. 5.4 a) and area for comparators are almost the
same for poweraware and default implementations as the architectures are
not much different. But, as seen from Fig. 5.4(b), delays of these form a
typical pattern,powerawarehaving lower delay for bit widths that are powers
of 2. This is because after that the number of logic levels increase by 1 and
delay increases that defaultimplementation.
This comparison tells us that power consumption can be reduced if a low
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
43/50
5.3 Results 35
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
Power(Net+Ga
te)mW
bit width
defaultpoweraware
(a) Power
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
delay(ns)
bit width
defaultpoweraware
(b) Delay
power implementation of an operator is selected where delay is not critical.
Fig. 5.5 gives a comparison of power ofdefaultandpoweraware architectures
for a 16-bit adder with given delay constraints. Thepoweraware architecture
is selected using the algorithm given in Fig. 5.1. The ripple carry adder,
CLA2 and CLA1 architectures are selected by the algorithm for delay con-
straint of 20 ns, 15 ns and 10 ns respectively. Fig. 5.5 shows that if delayconstraints are not stringent, the our algorithm is able to select a low power
implemetation with given speed requirement.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
44/50
5.3 Results 36
0
100
200
300
400
500
600
700
800
900
0 10 20 30 40 50 60 70
Area(#ofcores)
bit width
defaultpoweraware
(c) Area
Figure 5.2: Default and Power-Aware comparison of Adders
0
50
100
150
200
250
0 5 10 15 20 25 30 35
Power(Net+Gate)mW
bit width
defaultpoweraware
(a) Power
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
45/50
5.3 Results 37
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35
delay(ns)
bit width
defaultpoweraware
(b) Delay
0
500
1000
1500
2000
2500
3000
3500
4000
0 5 10 15 20 25 30 35
Area(#ofcores)
bit width
defaultpoweraware
(c) Area
Figure 5.3: Default and Power-Aware comparison of Multiplier
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
46/50
5.3 Results 38
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70
Power(Net+Gate)mW
bit width
defaultpoweraware
(a) Power
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
0 10 20 30 40 50 60 70
delay(ns)
bit width
defaultpoweraware
(b) Delay
Figure 5.4: Power and delay area plots for Comparator architectures
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
47/50
5.3 Results 39
defaultpoweraware
0
0.5
1
1.5
2
2.5
3
3.5
4
201510
Power(mW)
delay constraint (ns)
Figure 5.5: Comparison of power for 16-bit adder with given delay constraints
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
48/50
Chapter 6
Conclusion
6.1 ConclusionThe synthesis tools that by default go for minimum delay implementation of
an arithmetic operator can be made to select a low power alternative where
timing is not critical. This can help to minimize considerable amount of
power.
Thenet power is many times greater than gate power as intercon-
nection wires and switches are more power consuming. The power consumed
by a net depends much on its length. Thus, power does not only depends on
the number of instances but also on their placement as scattered placementincreases the lengths of the nets.
So, for the operator architectures, cells of which are not placed together,
blocks must be created using some manual placement by placing the complete
module together so as to minimize net length. Such blocks can be used during
actual physical synthesis.
Also, for power-aware synthesis, decisions need to be taken at logic syn-
thesisas well as physical synthesisstage. Some guidelines are required to
be set for the PnR tool to make it identify that all instances in a particular
operator module should be placed together.
c 2009, Indian Institute of Technology Delhi
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
49/50
-
8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175
50/50
REFERENCES 42
[9] Hesham Al-Twaijry , Michael Flynn; Performance/Area Tradeoffs in
Booth Multipliers; Technical Report CSL TR-95-684, pp. 1-18, Nov.
1995.
[10] C.S.Wallace; A Suggestion for a Fast Multiplier; IEEE Trans. ElectronicComputers, vol. 13, pp. 14-17, 1964.
[11] J. M. Rabaey, A. Chandrakasan, and B. Nikolic; Digital Integrated
Circuits, A Design Perspective, Second Edition; Prentice-Hall, 2003.
[12] L.Shang, A.Kaviani, and K.Bhadala; Dynamic Power Consumption of
the Virtex-2 FPGA family; ACM International Symposium on Field-
Programmable Gate Arrays, pp. 157-164 , Monterey, CA, 2002.