vdtt.iitd.ac.in research projects thesis jvl072175

8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

1/50

Power Optimization for Datapathsin FPGAs

A thesis submitted in partial fulfillmentof the requirements for the degree of

MASTER OF TECHNOLOGY

in

VLSI DESIGN TOOLS & TECHNOLOGY

by

Haldule Prasad Charudatta

Entry No. 2007JVL2175

under the guidance of

Prof. M BalakrishnanMr. Madhav Chikodikar (Synplicity India)

Dr. G. Chandramouli (Synplicity India)

VLSI Design Tools And Technology,

Indian Institute of Technology Delhi.

May 2009


2/50

Certificate

This is to certify that the thesis titled Power Optimization for Data-

paths in FPGAs being submitted by Haldule Prasad Charudatta for

the award ofMaster of Technology in VLSI Design Tools & Technol-

ogyis a record of bonafide work carried out by him under my guidance and

supervision at the Department of Computer Science & Engineering.

The work presented in this thesis has not been submitted elsewhere either in

part or full, for the award of any other degree or diploma.

Prof. M Balakrishnan

Department of Computer Science and Engineering

Indian Institute of Technology Delhi


3/50

Acknowledgments

I would like to sincerely thank my supervisors Prof. M Balakrishnan, Mr.

Madhav Chikodikar and Dr. G. Chandramouli for their constant guidance

and invaluable suggestions through out the project.

I am also indebted to research scholar Neeraj Goel from IIT Delhi and

Tarun Kumar, Sriram C. from Synplicity, India for their support and help.

Finally, I would thank my parents and all my friends for their co-operation

and encouragement.

Prasad C. Haldule


4/50

Abstract

Hardware designs targeting communication and DSP applications consists

of a large number of data path elements such as adders, multipliers, com-

parators, shifter etc. These elements are main contributors to the power

consumption of digital circuits. Power reduction is now becoming very im-

portant for implementation of designs on FPGAs as well. At present, most of

the popular FPGA hardware synthesis tools give higher priority to delay. So

the FPGA synthesis tools tend to generate data path architecture for faster

implementation. With increasing importance of power reduction on FPGA,it is becoming necessary to evaluate different data path architectures from the

point of view of both delay and power.

This work is aimed at characterizing various architectures implementa-

tions of common operators for power, delay and area for a target FPGA and

selecting a particular low power architecture where delay is not critical.


5/50

Contents

1 Introduction and Motivation 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 FPGA And Operator Architectures 4

2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Operator Architecture . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 6

Carry Select Adder . . . . . . . . . . . . . . . . . . . . 7Carry Look-Ahead Adder 1 - CLA1 . . . . . . . . . . . 7

Carry Look-Ahead Adder 2 - CLA2 . . . . . . . . . . . 8

Carry Look-Ahead Adder 3 - CLA3 . . . . . . . . . . . 9

Brent and Kung Adder . . . . . . . . . . . . . . . . . . 9

2.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 10

Non-Booth Encoding . . . . . . . . . . . . . . . . . . . 11

Booth Encoding . . . . . . . . . . . . . . . . . . . . . . 11

Wallace Tree reduction . . . . . . . . . . . . . . . . . . 12

2.2.3 Equality Comparators . . . . . . . . . . . . . . . . . . 14

3 Experiments Setup and Power Plots 15

3.1 Power Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Power, delay and area plots of operators . . . . . . . . . . . . 18

c 2009, Indian Institute of Technology Delhi


6/50

CONTENTS ii

4 Power Analysis 23

4.1 Dynamic Power Dissipation . . . . . . . . . . . . . . . . . . . 23

4.2 Power analysis of operators . . . . . . . . . . . . . . . . . . . 24

4.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 General Observations . . . . . . . . . . . . . . . . . . . . . . . 29

5 Power-Aware Architecture Selection and Results 32

5.1 Power-Delay-Area database . . . . . . . . . . . . . . . . . . . 32

5.2 Power-Aware Arithmetic Block Selection . . . . . . . . . . . . 32

5.2.1 Minimum power . . . . . . . . . . . . . . . . . . . . . . 32

5.2.2 Minimum power with speed constraint . . . . . . . . . 33

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusion 40

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

References 41



7/50

List of Figures

2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Flash-based switch . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Carry Select Adder . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Carry Look-Ahead Adder-1 . . . . . . . . . . . . . . . . . . . 8

2.6 Carry Look-Ahead Adder-2 . . . . . . . . . . . . . . . . . . . 9

2.7 8-bit Brent and Kung Adder Tree Diagram . . . . . . . . . . . 10

2.8 A 4 4 multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9 An equality comparator . . . . . . . . . . . . . . . . . . . . . 14

3.1 Power Analysis Flow . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Power, delay and area plots for Adder architectures . . . . . . 19

3.3 Power, delay and area plots for Multiplier architectures . . . . 21

3.4 Power, delay and area plots for Comparator architectures . . . 22

4.1 Ripple carry Adder Placement . . . . . . . . . . . . . . . . . . 30

5.1 Algorithm for selection of low power architecture . . . . . . . 34

5.2 Default and Power-Aware comparison of Adders . . . . . . . . 36

5.3 Default and Power-Aware comparison of Multiplier . . . . . . 37

5.4 Power and delay area plots for Comparator architectures . . . 38

5.5 Comparison of power for 16-bit adder with given delay con-

straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39



8/50

List of Tables

2.1 Booth recoding . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Relative Comparison of Adder architectures . . . . . . . . . . 26

4.2 Relative Comparison of Multiplier architectures . . . . . . . . 27


9/50

Chapter 1

Introduction and Motivation

1.1 IntroductionThe use of FPGAs is rapidly increasing owing to its low cost and smaller

time to market. Also, FPGAs are more suitable for building prototypes and

in reconfigurable applications. However, FPGAs are not power efficient. A

couple of reasons for FPGAs to consume more power are,

FPGAs consist of large number of interconnects and programmable

switches

generic logic structures in FPGAs consume more power than the dedi-cated circuitry in ASICs

A number of applications like space and military where FPGAs are used

demand low power features. A major component of FPGA power is the

datapathand so there is a scope for reduction in power consumption there.

Many applications using FPGAs involve computations that use arithmetic

operators like adders, multipliers, etc. The present FPGA synthesis tools

like Synplify Pro targetperformanceof these operators and no consideration

is given topower.

These arithmetic operators can be implemented in different architectures.

The power consumed by these operators can be decreased by proper selection

of a particular operator architecture over the other. The power varies not

only because of the different operator architectures but also because of its



10/50

1.2 Related Work 2

interconnects, that depend on the FPGA architecture and technology. For

this, synthesis tool must have detailed knowledge of parameters like power,

delay and area of operators of different implementations for different bit

widths.

1.2 Related Work

The idea of characterizing arithmetic operators and selection of an implemen-

tation for minimum power is been discussed lately. A methodology to form

characteristic lookup table and a couple of approaches to achieve low power

implementation is discussed in [1]. [2] extends the idea to select a particular

architecture by simulated annealing considering power, delay as well as area

of the operators. [3] analyzes the power-delay-area tradeoff among differentarchitectures of adders and multipliers for Actel ProASIC FPGAs while a

similar study for adders is done for Altera FPGA in [4].

1.3 Objective

The project is aimed at developing a power aware utility to generate an

FPGA implementation for a data path operator given its word length and

expected frequency of operation. It can be divided in two parts:

Analyzing the power for different operators and study a power-time-

area relation among them.

Develop a utility that will go through all the implementation and come

up with one having minimum power for an operator at the desired

frequency.

1.4 ApproachFor analyzing power of different architectures, first the arithmetic operators

of interest, their types and word lengths for which analysis is to be done are

identified. Then we develop complete power flow that will take the param-

eters, invoke different tools and will directly give the corresponding power.



11/50

1.5 Organization 3

As power analysis needs to be done repeatedly, this step is crucial. The ar-

chitectures for operators are studied from power point of view. Finally we

develop a script that uses the Synopsys internal module generator to gener-

ate right architecture for delay and power requirements and compare these

power values by those when default architectures are used.

1.5 Organization

The following gives a general overview of this report.

Chapter 2 describes a generic FPGA architecture and also the different

operator architectures are would be part of our analysis. Chapter 3 explains

the power flow set up for the analysis and also gives the power, delay and

area graphs for operator implementations. In Chapter 4 a detailed analysisis presented for the architectures from power point of view and trade offs are

discussed. Chapter 5 gives a basic way of selecting a low power implementa-

tion and compares results ofdefault and poweraware implementations while

Chapter 6 presents the conclusions.



12/50

Chapter 2

FPGA And Operator

Architectures

2.1 FPGA Architecture

A generic FPGA architecture is shown in the figure below.

Figure 2.1: FPGA Architecture

An FPGA consists of a two-dimensional array of logic blocks connected

by general interconnection resources [5]. FPGA includes three main com-



13/50


14/50

2.2 Operator Architecture 6

delay and area for different bit widths. For this, the architectures are selected

from those available in Synopsyss internal module generator. The operators

targeted are adders, multipliers and comparators.

2.2.1 Adders

The module generator offer a wide variety of adder architectures that can be

selected according to the power and performance needs of the design. The

implementations include ripple carry adder, carry select, Carry Look-Ahead

adders and Brent and Kung adder.

A carry can either be generated in an adder stage or propagated from the

previous one. CLAs use these generate (G) and propagate (P) values of all

previous stages to determine the carry of a particular stage. This combination

of previous stage GP values can be done in many ways depending on the

architecture used. Accordingly there are three variants of Carry Look-Ahead

adders having power-delay-area trade-offs.

Ripple Carry Adder

This is the simplest implementation of adder which is formed by cascading

full adders in series as shown in figure 2.3. A full adder computes the sum

and carry for each stage. The CarryOutof a full adder stage is applied as a

CarryInto the next stage [7].

The main advantage of this architecture is its small area; however, it has

large delay that increases linearly with the bit width. Both, delay and area

are of order O (n).

Figure 2.3: Ripple Carry Adder



15/50


Carry Select Adder

Figure 2.4: Carry Select Adder

A carry select adder for n bits is divided into n/sstages ofs bits each.

The sum and carry for each stage is first computed separately considering

input carry as 0 as well as 1 and then one of these is selected depending

on the actualCarryInfrom the previous stage [7].

This circuit has more area (almost twice) than ripple carry adder as each

stage consists of an extra s-bit adder and a multiplexer. Carry Select has a

lower delay than ripple carry as the carry and sum are computed for each

block in advance.

A block diagram for n-bit carry select adder with s-bit blocks is shown in

Figure 2.4. Here, 0-Carry and 1-Carry are s-bit adders with input carry as 0

and 1 respectively.

Carry Look-Ahead Adder 1 - CLA1

This is a conventional form of CLA architecture. In this the entire word

length is divided into blocks of, say, 4-bits each. The generate-propagate

(GP) values for individual bits (denoted by G0,P0, and so on) and further,for a pair and then for combination for 4-bits are generated in the first three

stages of the circuit as,

Stage 1: G0= X0 Y0,G1= X1 Y1, P1= X1 Y1 . . .

Stage 2: (G1:0= G1+G0P1), (G3:2= G3+G2P3),



16/50


(G5:4= G5+G4P5). . .

(P3:2= P2P3), (P5:4= P4P5). . .

Stage 3: (G3:0= G3:2+G1:0P3:2= G3+G2P3+G1P2P3+G0P1P2P3),

(G7:4= G7+G6P7+G5P6P7+G4P5P6P7). . .

(P7:4= P4P5P6P7). . .

In each latter stage, a 4-bit GP block gets combined with the complete

lower GP block to finally realize the complete carry equation. This is achieved

by forming a skewed tree-like structure as shown in figure 2.5 given below.

S

T

AG

E

S

1

to

3

GP

Combination

Block

GP

Combination

Block

GP

Combination

Block

GP 3:0

GP 13:12

GP 11:8

GP 7:4

GP 7:0

GP 11:0

GP 13:0

Figure 2.5: Carry Look-Ahead Adder-1

The GP combination blocks are simple and can be realized by simple

AND-OR logic. This architecture has small area as the GP combination

blocks are simple. In this implementation the GP values of lower bits ripple

across the design, but, in batches. So, this circuit has greater delay than

other CLA implementation


This is a faster version of CLA architecture. First three stages of generation of

GP values are similar to that of CLA1. Further, the GP blocks are combined



17/50


in more aggressive manner in lesser logic levels thereby decreasing the delay.

The architecture has a balanced tree structure.

Figure 2.6: Carry Look-Ahead Adder-2

As seen from the above figure 2.6, the combining of GP values is simul-

taneous at lower bits and higher bits, unlike in previous CLA where it was

from lower bits toward higher bits. Hence, it requires lesser logic levels. More

gates and logic is required to generate GP blocks of higher bit width. So,

area of FCLA circuit is larger.


One of the Carry Look-Ahead implementations seen before is small(but slow)

while other one is fast(but large). This third variant of CLA architecture is

a compromise between the earlier two. It has only one logic level more than

CLA2 but about 30% less area.

Brent and Kung Adder

Brent and Kung architecture is a parallel prefix tree based algorithm to

develop a binary adder circuit optimized for time and area [8]. A tree diagram



18/50


for an 8-bit Brent and Kung adder is shown in figure 2.7. It has a black

operator to compute the GP values of higher order from lower order values.

For a black operator(),

Gi:j =Gi:k+Pi:k Gk1:j

Pi:j =Pi:k Pk1:j.

Figure 2.7: 8-bit Brent and Kung Adder Tree Diagram

The addition of n-bit numbers can be performed in time O (logn), while

the area complexity is O (nlogn). Brent and Kung architecture is considered

as the most area-efficient parallel prefix adder as it needs very few logic

elements.

2.2.2 Multipliers

Multiplication is an important arithmetic operation. Multipliers are much

larger than adders and also more power consuming. Also, multiplication is

a slow operation Multiplication is actually a process of addition of multi-

ple partial-products. These partial products are formed by operating the

multiplicand by each bit/bits of the multiplier. There are different multi-

plier architectures formed depending on the method of generation of partial

product and its addition to finally give the complete product.

The partial products can be computed using the normal ANDing opera-

tion or by using some encoding method. By encoding, the number of partial



19/50


products is reduced. Booth encoding is used for this purpose. Partial prod-

uct generation is followed by their addition that is done using Wallace tree

optimization in tree-like format. These techniques can be used in different

ways with different options to give various architectures.

Non-Booth Encoding

This is the simplest way of generating partial products. In this, the multi-

plicandis ANDed by each bit of the multiplierto form a partial product [9].

The total number of partial products needed to be added to get the final

product is not reduced.

Non-booth encoding is preferred for lower bit multipliers as there is no

overhead of encoding themultiplier.

Booth Encoding

Booth Encoding is used to decrease the number of partial products. Smaller

the number of partial products, better the performance. This is achieved by

recoding the multiplierbits so that now a group of bits decide the partial

product.

The basic Booth-encoding technique recodes two multiplier bits to 1, -1 or

0 depending on the value of the pair of bits [9]. The multiplicand is assigned

accordingly to the partial product as given in table 2.1.

Multiplier bits Recoded bits Corresponding Partial Product

00 0 001 +1 +Multiplicand10 1 Multiplicand11 0 0

Table 2.1: Booth recoding

Amodified Booth encodingis also used in which three multiplier bits arerecoded. Booth encoding is very useful, especially for higher bit widths. It

helps to decrease the area as well as delay of the next stage i.e addition of

partial products.



20/50


Wallace Tree reduction

Wallace tree reduction is a technique of arranging partial products in a tree-

like manner. It is a way in which partial products are added in parallel. The

number of adders required decrease by this technique [10, 11]. It uses twooperators, a full adder that has three inputs and gives two outputs and the

other one is a half adder that has two inputs and two outputs. Using this 3:2

and 2:2 operators, the required reduction is achieved. The routing in Wallace

tree multipliers is much irregular.

An n-bit multiplier consists of n n-bit partial products. These partial

products need to be summed to get the final product. This of addition on n

n-bit operands is converted into an addition of 2 2n-bit operands.

Consider a 4 4 multiplier. Fig. 2.8 shows Wallace tree architecture

for 4 4 multiplier. Assuming non-Booth encoding, partial products will be

computed as shown in Fig. 2.8(a), where Pij =Xi Yj .

(a) Partial Products Generation

Addition of these partial products will give the final 8bitproduct. This

addition is done in two stages.

1. First stage consists of fast carry save addition of partial products using

full adders and half adders (Fig. 2.8(b)). The carries generated at

one layer are passed on to other layer for addition. An advantage of

this is that only two layers are sufficient for addition of 16-bits (not

included the last stage). Thus, both the number of adders as well

as critical path is reduced. This reduction is significant for multipliers



21/50


FA FA

FA

HA

HAHA

P00P01P10P02 P11 P20P21 P03P30 P12P22P31 P13

P23P32

P33

LAST STAGE ADDER

Z7 Z2 Z1 Z0Z3Z4Z5Z6

(b) Wallace Tree Architecture

Figure 2.8: A 4

4 multiplier

with higher word length. Because of the use of full adders (also referred

as 3 : 2 compressor) the delay of the tree is O(log3/2(N)).

2. The last stage consists of a fast 2n-bit adder (8-bit, in this case). Usu-

ally a carry look-ahead adder is used here.

The Wallace tree reduction can be implemented in different ways. Synop-

syss internal module generator provides various options for the same. The

order ofcompressorcan be varied to 3 or 4. The other few options availableallow to optimize eitherareaor timingof Wallace tree.

All these options are used to generate different implementations of mul-

tipliers. The architectures with booth and non-booth encoding start with

bwt andnbwt respectively. The latter part denotes the type of Wallace tree

optimization used. These are,

array: This architecture gives the traditional array form of multiplier

architecture. It has smaller area. But, as the addition of the partial

products is in array form, this circuit has larger delay. It produces a

dense layout.

tree : This is a regular tree-like implementation of multipliers. It has

lesser delay. The 4:2 compressorsare used in this implementation.



22/50


bitopt : The bit-optimized solution is the minimum gate count and

minimum delay circuit.

2.2.3 Equality ComparatorsEquality comparators check the equality between two numbers and output

1 if equal, else 0. The simplest way of doing so is to bitwise XNOR two

inputs and then to AND the outputs of the XNORs to get final result. A

generic equality comparator is shown in Fig. 2.9

A0

A

A

A1

B0

B

B1

Y

B

n-2

n-2

n-1

n-1

Figure 2.9: An equality comparator

The polarity of logic and XOR gates can be set to either trueorinverting

in Synopsyss module generator. However, the basic architecture remains the

same.



23/50

Chapter 3

Experiments Setup and Power

Plots

3.1 Power Analysis Setup

For power analysis of architectures of different operators at various bit widths,

a power flow is set up using tools provided by different vendors.

The power analysis flow is as follows:

1. Synopsys internal module generator is used to generate Verilog netlist

for a particular architecture of an operator. It has two inputs files,one to specify the functionality of the design/operator and the other

to select among the options provided that helps to select a particular

architecture. The .lib library file of the target FPGA is also loaded.

The module generator selects appropriate library components to export

the netlist.

Registers are placed at the inputs and outputs of the operators. First

reason is that timing analysis tools provide only clock-to-clock delay.

Delay of a pure combinational circuit is not calculated. The second

reason is that I/O pads have high delay values. This delay adds up with

combinational circuit delay. Placing registers breaks this direct link.

2. The verilog netlist from previous stage is then synthesized in Synplify

Pro, a logic synthesis tool for FPGAs. The details of target FPGA



24/50

3.1 Power Analysis Setup 16

Figure 3.1: Power Analysis Flow

are specified and constraints (if any) are given. For the power analy-

sis experiments frequency is kept at auto-constrain mode. Automatic

constraints generate the fastest design implementation.

If frequency constraint is set to a value greater than that derived when

set to auto-constrain, then Synplify Pro tries to optimize the circuit for

higher frequency. In the process, it adds more gates, makes it larger

and thus distorts the original architecture. This causes the circuit to

consume more power. So, frequency is set to auto-constrain to preserve

the architecture and to meet required timing with minimum power.

Synplify Pro generates post-synthesis.edn(EDIF) and verilog netlists.



25/50

3.1 Power Analysis Setup 17

3. The post-synthesis verilog netlist is simulated in VCS, a Verilog sim-

ulator, using a testbench. First, a pseudo-random number generator

dumps a large number of binary numbers of specified width in a file.

The testbench then reads these values in pairs and applies it to the

operator module.

A .vcd (Value Change Dump) file is generated during simulation us-

ing $dumpvars system task in the testbench. It contains the switching

activity for all the signals in the module. This is used for power es-

timation. The arithmetic operation (addition, multiplication, etc) is

performed on about 10,000 pairs of random numbers. Greater the

number of input vectors, more accurate the results can be expected.

4. The post-synthesis EDIF netlist is passed through the Place and Route(layout) tool of the target FPGA vendor. Any constraints for place-

ments like defining a region and placing all or a group of instances in

the design within that region can be specified here using.pdc(Physical

Design Constraint) file.

A status report is generated after compilation of EDIF netlist. It con-

tains information about number of combinational and sequential in-

stances used in the design and number of high fanout nets and their

respective fanouts.

5. A vendor-specific power estimation/analysis tool is used to calculate

the operator power. This is immediately followed after PnR. It uses

the VCD file for power estimation. This is a recommended method for

power estimation as the switching activity information from VCD file

is more accurate.

A VCD file dumped by VCS may not be directly used for power estima-

tion as signals are specified in vector forms. So, these signals should be

scalarized. This is done by using a vcdpostutility that gives out a VCDfile that contains value change for each bit of vector signal, recorded as

a separate signal.

The power report gives breakdown of power by type viz. Net, Gate,

I/O, Core Static and Bank Static. It also contains breakdown by in-



26/50

3.2 Power, delay and area plots of operators 18

stances in which gives power for every instance/gate and net in the

design. Details of number of transitions at each net can also be stud-

ied.

3.2 Power, delay and area plots of operators

The power analysis flow was carried on architectures of adders, multipliers

and comparators for different bit widths. The total dynamic power (net +

gate), delay and number of cores utilized are plotted accordingly. Following

are the respective graphs and their analysis is done in the next chapter.

0

5

10

15

20

25

0 10 20 30 40 50 60 70

Power(Net+Gate)mW

bit width

bnkcarrySelect

cla1cla2cla3

ripple

(a) Power



27/50


4

6

8

10

12

14

16

18

20

0 10 20 30 40 50 60 70

Delay(ns)

bit width

bnkcarrySelect

cla1cla2cla3

(b) Delay

0

100

200

300

400

500

600

700

800

0 10 20 30 40 50 60 70

Area(#ofcores)

bit width

bnkcarrySelect

cla1cla2cla3

ripple

(c) Area

Figure 3.2: Power, delay and area plots for Adder architectures



28/50


0

50

100

150

200

250

300

350

400

450

0 5 10 15 20 25 30 35

Power(Net+Gate)mW

bit width

bwt-areabwt-bitoptbwt-timingnbwt-area

nbwt-bitoptnbwt-timing

(a) Power

10

20

30

40

50

60

70

0 5 10 15 20 25 30 35

Delay(ns)

bit width


nbwt-bitopt

nbwt-timing

(b) Delay



29/50


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5 10 15 20 25 30 35

Area(#cores)

bit width


nbwt-bitoptnbwt-timing

(c) Area

Figure 3.3: Power, delay and area plots for Multiplier architectures

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70

Power(Net+Gate)mW

bit width

invertedLogictrueLogic

(a) Power



30/50


4.5

5

5.5

6

6.5

7

7.5

8

8.5

0 10 20 30 40 50 60 70

delay(ns)

bit width


(b) Delay

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70

Area(#cores)

bit width


(c) Area

Figure 3.4: Power, delay and area plots for Comparator architectures



31/50

Chapter 4

Power Analysis

4.1 Dynamic Power DissipationDynamic power dissipation is caused by signal transitions in the circuit [12].

A higher operating frequency leads to more frequent signal transitions and

results in increased power dissipation. The most significant source of dynamic

power consumption in CMOS circuits is the charging and discharging of

capacitance. This power is given as,

P =

iCiV2i fi,

where Ci,Vi and fi are capacitance, voltage and frequency respectively

for any instance i.

Another source of dynamic power consumption isshort-circuit power. But

this power has smaller contribution in FPGAs.

Thus the total dynamic power consumption is based on following factors,

the length, fanout and effective capacitance of the interconnection wires

and switches,

number of resources utilized, as more the resources, more will be thepower consumed,

and switching activity of different nets and instances in the design.



32/50

4.2 Power analysis of operators 24

4.2 Power analysis of operators

4.2.1 Adders

Ripple Carry Adder

Ripple carry adder has the smallest circuit in terms of number of nets and

gates. The nets have fanout not more than 2. So, it is also expected to

consume less power.

The gate power of ripple carry adder is the least. However, PnR tool

does not identify the architecture and so the placement for this architecture

is spread out that increases the net length. Also, the number of logic levels is

much more than any other architecture. The effect of these two scenarios is

that the net power is not as less as expected. The increase of power becauseof increase in logic levels is more dominant for higher word lengths.

The major drawback of this architecture is that its delay is too high and

increases linearly with word length (Hence, not shown in Fig. 3.2(a)).

Carry Select Adder

Carry Select architecture is formed by duplication of blocks of ripple carry

adder. So, its area is around twice that of the ripple carry adder circuit.

But it is smaller than the CLAs which employ a more complex circuitry. It

has very high fanout nets, even more than CLA2. This cause carry select to

consume more power.

Its delay is comparable to that of Brent and Kung adder, and is much

less than ripple carry adder as sum and carry for each block is computed

beforehand.


CLA1 is the slowest variant of CLAs. Its circuit size is 50% of that of CLA2

and the number of logic levels smaller than ripple carry and Brent and Kung

adder. As a result of small size and less levels, CLA1 circuits consume less

power. One more advantage is that it has low fanout nets. So the Net power

is less. But there are more spurious transitions in this circuit since it is a

skewed-tree like architecture.



33/50


The circuit is slower than other CLAs but still better than ripple carry

and BnK architectures. So, this architecture can be a good choice when all

the three parameters viz. delay, power and area are equally critical.

During placement and routing, the PnR tool performs some logic com-

bining optimization on the post-synthesis netlist. The number of logics com-

bined is given as a part of Netlist Optimization Report in Status report. It

is observed that there is more scope for such post-synthesis logic combining

for CLA1 architecture than others. There is about 10% logic combining for

CLA1 at this stage further reducing its area.


The CLA2 architecture is the most power-consuming of all as it is more

complex and having more number of gates and nets than other architectures.

This architecture has many interconnections and dependencies in the circuit.

Hence, it has a well-packed layout and so, though the number of nets is more,

they are of smaller lengths. CLA2 has high fanout nets.

The logic depth of the circuit is very less. It does more computations in

lesser levels. This makes CLA2 fastest adder architecture. The number of

logic levels remains constant for a range bit widths. With this, the delay also

tends to be flat.


CLA3 circuit is only one level slower than CLA2. At the same time, it

is about 70% smaller than CLA2. As a result, power consumption of this

architecture is also about 75-80% less than that of CLA2.

So, this circuit can be a better alternative than CLA2, in terms of area

and power. However, cases in which timing is critical CLA2 should be used.

For higher bit widths, though CLA3 has one level more than CLA2, it uses

gates with lower delay. So, its delay becomes almost comparable to CLA2.

Brent And Kung Adder

Brent and Kung architecture is slightly better in power than other archi-

tectures. The main reason being that the number of gates and nets in this



34/50


architecture is much less than in others. Only ripple carry adder has gates

and nets lesser than Brent and Kung. However, Brent and Kung have the

advantage that its number of logic levels is of the order log n, whereas for

ripple carry adder its of order n. Also its layout is scattered to some extent

as for ripple carry adder.

Performance wise, Brent and Kung is slower than the CLAs as it has

a deeper logic. It is more suitable from power point of view at higher bit

widths but then its delay is 1.5 times of that of CLA2 for 32 and 64 bits.

The relative comparison of architectures of adders is given in Table 4.1

and the graphs are given in Fig. 3.2.

No. of nets/gates No. of logic levels fanout

ripple Low Very High Low

Carry Select High High HighCLA1 Moderate Moderate ModerateCLA2 Very High Low HighCLA3 High Low High

Brent & Kung Low High Moderate

Table 4.1: Relative Comparison of Adder architectures

4.2.2 Multipliers

The power, delay and area graphs for different multiplier architectures are

given in Fig. 3.3. The trade offs in these three parameters are discussed here.

Area

The architecture employing booth encoding have more area because of addi-

tional circuitry for encoding. This booth circuit vary along with the encoding

method. So, for lower bit multipliers, non-booth encoding architectures have

smaller area.

However, as seen in Fig. 3.3(c), the booth variant of any implementation

has smaller area than its non-booth counterpart at higher bit widths. This is

because at higher bit widths, the hardware overhead of the booth encoding

part is much lesser than the amount of circuitry saved in the latter stage

because of using booth encoding. Thebwt areaarchitecture has considerably



35/50


smaller area than other architectures for multipliers more than 24-bit wide

(even for 48 and 68-bits multiplier not shown in the graph).

It is observed that logic combining during PnR stage is more for booth

encoded circuits. This is because the PnR tool can identify the booth encoding

circuits and optimize it accordingly.

Delay

The delay ofarraymultipliers is, as expected, much greater than other ar-

chitectures for all range of word lengths. This is because carry generated by

addition of partial products are rippled along the row and column in array.

So the number of logic levels increases, in the worst case, linearly with bit

width, for non booth encoding implementation. For circuit with booth en-

coding, as the number of partial products to be summed are reduced, delay

is lesser.

Thetreeandbitoptarchitectures have similar delay characteristics,bitopt

fairing slightly better than corresponding tree implementations. Below 28-

bits, non-booth implementation is faster while for higher width multipliers

bit optimized booth implementation gives minimum delay.

Power

The power of the architectures depend on three major factors, number of

nets and gates in the design, number of logic levels (that contribute to the

spurious transitions increasing the glitch power), and number of high fanout

nets and their fanout.

A relative analysis of these three factors is given in table 4.2.

No. of nets/gates No. of logic levels fanout

bwt array Low High Highbwt bitopt Low Low High

bwt tree High Low Highnbwt array High High Lownbwt bitopt Low/Moderate Low Low

nbwt tree High Low Low

Table 4.2: Relative Comparison of Multiplier architectures



36/50


The fanout for booth encoding circuits is very high. Mostly, the input

nets are high fanout nets and hence consume more power.

There are two types of transitions in a circuit, functional (intended) and

spurious (unintended). More the number of logic levels, higher are the spu-

rious transitions (orglitches). These transitions increase the dynamic power

consumption. The number of logic levels, as seen before is higher for array

implementations. The tree and bitopt implementations have smaller logic

depth. Thus the contribution of glitch power due to spurious transitions is

lesser in these circuits.

Finally, power consumption increase with the number of nets and gates.

Gates consume switching as well as short-circuit power. Nets, as seen earlier,

are also expensive in terms of power due to their length and fanouts. The bit

optimized architectures are smaller. So, they have lesser power consumingresources.

As seen from table 4.2, the nbwt bitoptarchitecture does well in all these

three fronts. Hence, it gives minimum power while the array based circuits

consume more power (Fig. 3.3(a)).

4.2.3 Comparators

The two comparator architectures implemented are : invertedLogic and tru-

eLogic, which, as the names suggest are according to the type of logic imple-mented.

As seen from the power, delay and area plots of comparators given in Fig.

3.4, the values of these parameters are almost similar for the two implemen-

tations. This is because, as comparator architecture is fairly simple and there

is no scope for further optimization, there is no structural difference between

the two implementations. Only the library elements used are different.

This comparator architecture can be optimized if arrival times of inputs

are known. However, this is not a general case and may vary with design.

The comparator circuits are scalable. Hence, the area and power of the

circuits increase linearly with bit width (Fig.3.4(a) & (c)).

But the delay, as shown in Fig. 3.4(b), increase with bit width in steps.

This is because number of logic levels is same to compute the comparator



37/50

4.3 General Observations 29

output for bit widths in the range, 2n1 < bitwidth 2n.

4.3 General Observations

The power due to nets is about 3 to 4 times greater than the gate power

for all the architectures. This is primarily because nets in FPGA are not

contiguous. FPGA nets consist of net segments and switching elements. Net

power consists of switching power of these interconnecting switches as well.

So, total power mostly depends on number of nets, their length, and fanout.

If the PnR tool is made to do placement and routing without any con-

straints, then power indirectly depends on the placement too. Unconstrained

placement may cause a circuit to disseminate over the available region which

increases the net length. With this, power also varies contradicting the ex-pected power values. One important observation made in this regards was

that when power for ripple carry adder is calculated after unconstrained PnR,

it is almost comparable to power of CLA2 circuit. This contradicts with ex-

pectation as CLA2 is more complex and bigger than ripple carry adder circuit,

and hence is expected to have more power. It is after some constraints are

imposed that the power behaves in accordance with the architecture.

(a) Before Constraints

The constraints can be added indirectly by assigning pins to the ports or

directly by giving some area restrictions. Applying pin constraints does not

serve the purpose, the reasons being,



38/50


(b) After Constraints

Figure 4.1: Ripple carry Adder Placement

When a particular adder, or any operator, is part of a bigger circuit, it

will have to be placed together like a single module. On the other hand,

if we impose pin constraints, the input-output registers are placed near

the pins and the core circuit away from it. This decreases the clock net

length but may increase lengths of other nets. This could introduce

errors in power values.

Also, providing pin constraints is difficult for varying circuits of varying

bit widths. It is mainly because the pin assignments must follow DesignRule Check (DRC).

So, for the power analysis experiments, area constraints are given to PnR

tool. A region is defined such that it is close to the CLK pad. Then using

a Physical Design Constraints (PDC) command assign net macros all the

instances connected to the clock net (input-output registers) are confined

within that region. This causes the other circuitry to gather around it. If

assign net macros is applied to all nets, so as to accumulate entire circuit

within the region, then it will be required to define a larger region to occupyoperators of higher bit widths. This may again lead the smaller circuits to

disperse. Hence, the constraint is only applied to instances connected to the

clock net.

Fig 4.1 shows the placement of 16-bit ripple carry adder before and after



39/50


applying area constraints.

Even after such constraints on area are applied, the circuits of ripple carry

and Brent and Kung adder is scattered to some extent. This increases the

net length and thus contributing to power.



40/50

Chapter 5

Power-Aware Architecture

Selection and Results

5.1 Power-Delay-Area database

The present synthesis tools select the best performance (minimum delay)

implementation of any arithmetic operator. To select an architecture detailed

knowledge of power, delay and area values for different bit widths is needed

that gives an idea of which architecture will be optimum at given bit width.

Having a database can serve this purpose. It can be created by running the

power analysis flow (Section 3.1) on different architectures.This is a one time characterization and lookup tables can be created that

can be used to select low power implementation.

5.2 Power-Aware Arithmetic Block Selection

An optimum power architecture can be selected from among available archi-

tectures in the following ways,

5.2.1 Minimum power

Select a minimum power implementation, irrespective of delay and area.

Generally, a low power implementation also have a smaller area. Such selec-

tion is useful when that operator is not on the critical path.



41/50

5.3 Results 33

This gives the maximum amount of power that can be saved if a low

power architecture of an operator is implemented rather than a default one.

5.2.2 Minimum power with speed constraintThe algorithm for selection of a minimum power implementation with delay

within a given constraint is shown in Fig.5.1.

According to it, given a speed requirement, select a minimum power im-

plementation first. If speed requirements are met, choose it. Else, the next

low power implementation is selected. This is repeated such that an imple-

mentation with delay within the given constraint is selected with minimum

possible power. If none of the architecture gives delay less than the con-

strained value, then finally an architecture with minimum delay is selected

irrespective of power.

This algorithm can be used if an operator is on the critical path, or, even

if it is not on the critical path, may violate timing requirements if a low power

but high delay architecture gets selected. For this, the synthesis tool may

need to extract timing details of the paths on which operators are present

and to deduce the maximum delay requirement for the operators.

5.3 Results

An utility is developed to select minimum power architecture, irrespective of

delay and area. It has two inputs, the operator type and its bit width. It

looks for all the available implementations of that operator and selects the

one with minimum power.

This poweraware architecture is compared to the default operator ar-

chitecture generated by Synplify Pro by synthesizing HDL commands like

c= a+b for adder,c = a b for multipliers, etc

Power in powerawaremode for adders is 50% less than that in defaultmode. Area till 32-bit is about 30% while for higher width adders it is 50%

of the area defaultarchitecture. However, delay for lower bit adders is too

high inpowerawarearchitecture. This is because as timing is not considered,

ripple carry adder is selected here which has higher delay.



42/50

5.3 Results 34

Get a list of

architectures

Select architec-ture with minimum

power

Check if

delay constraints

are met

Yes

No

Choose

that

architecture

Remove that

architecture from

the list

Is list

empty?

No

Yes

Select

architecture with

minimum delay

End

Figure 5.1: Algorithm for selection of low power architecture

Even for multipliers (Fig. 5.3), powerawarearchitectures have about 60-

70% power of that of default architectures. The delay at lower bit widths

is smaller for poweraware, while defaultarchitectures fair better in delay for

wider multipliers(even for 48 and 64-bit multipliers not shown in the graph).

As expected, power (Fig. 5.4 a) and area for comparators are almost the

same for poweraware and default implementations as the architectures are

not much different. But, as seen from Fig. 5.4(b), delays of these form a

typical pattern,powerawarehaving lower delay for bit widths that are powers

of 2. This is because after that the number of logic levels increase by 1 and

delay increases that defaultimplementation.

This comparison tells us that power consumption can be reduced if a low



43/50

5.3 Results 35

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

Power(Net+Ga

te)mW

bit width

defaultpoweraware

(a) Power

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

delay(ns)

bit width

defaultpoweraware

(b) Delay

power implementation of an operator is selected where delay is not critical.

Fig. 5.5 gives a comparison of power ofdefaultandpoweraware architectures

for a 16-bit adder with given delay constraints. Thepoweraware architecture

is selected using the algorithm given in Fig. 5.1. The ripple carry adder,

CLA2 and CLA1 architectures are selected by the algorithm for delay con-

straint of 20 ns, 15 ns and 10 ns respectively. Fig. 5.5 shows that if delayconstraints are not stringent, the our algorithm is able to select a low power

implemetation with given speed requirement.



44/50

5.3 Results 36

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60 70

Area(#ofcores)

bit width

defaultpoweraware

(c) Area

Figure 5.2: Default and Power-Aware comparison of Adders

0

50

100

150

200

250

0 5 10 15 20 25 30 35

Power(Net+Gate)mW

bit width

defaultpoweraware

(a) Power



45/50

5.3 Results 37

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35

delay(ns)

bit width

defaultpoweraware

(b) Delay

0

500

1000

1500

2000

2500

3000

3500

4000

0 5 10 15 20 25 30 35

Area(#ofcores)

bit width

defaultpoweraware

(c) Area

Figure 5.3: Default and Power-Aware comparison of Multiplier



46/50

5.3 Results 38

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70

Power(Net+Gate)mW

bit width

defaultpoweraware

(a) Power

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9

0 10 20 30 40 50 60 70

delay(ns)

bit width

defaultpoweraware

(b) Delay

Figure 5.4: Power and delay area plots for Comparator architectures



47/50

5.3 Results 39

defaultpoweraware

0

0.5

1

1.5

2

2.5

3

3.5

4

201510

Power(mW)

delay constraint (ns)

Figure 5.5: Comparison of power for 16-bit adder with given delay constraints



48/50

Chapter 6

Conclusion

6.1 ConclusionThe synthesis tools that by default go for minimum delay implementation of

an arithmetic operator can be made to select a low power alternative where

timing is not critical. This can help to minimize considerable amount of

power.

Thenet power is many times greater than gate power as intercon-

nection wires and switches are more power consuming. The power consumed

by a net depends much on its length. Thus, power does not only depends on

the number of instances but also on their placement as scattered placementincreases the lengths of the nets.

So, for the operator architectures, cells of which are not placed together,

blocks must be created using some manual placement by placing the complete

module together so as to minimize net length. Such blocks can be used during

actual physical synthesis.

Also, for power-aware synthesis, decisions need to be taken at logic syn-

thesisas well as physical synthesisstage. Some guidelines are required to

be set for the PnR tool to make it identify that all instances in a particular

operator module should be placed together.



49/50


50/50

REFERENCES 42

[9] Hesham Al-Twaijry , Michael Flynn; Performance/Area Tradeoffs in

Booth Multipliers; Technical Report CSL TR-95-684, pp. 1-18, Nov.

1995.

[10] C.S.Wallace; A Suggestion for a Fast Multiplier; IEEE Trans. ElectronicComputers, vol. 13, pp. 14-17, 1964.

[11] J. M. Rabaey, A. Chandrakasan, and B. Nikolic; Digital Integrated

Circuits, A Design Perspective, Second Edition; Prentice-Hall, 2003.

[12] L.Shang, A.Kaviani, and K.Bhadala; Dynamic Power Consumption of

the Virtex-2 FPGA family; ACM International Symposium on Field-

Programmable Gate Arrays, pp. 157-164 , Monterey, CA, 2002.

vdtt.iitd.ac.in research projects thesis jvl072175

Documents