vdtt.iitd.ac.in research projects thesis jvl072175

Upload: chandramanisingh

Post on 02-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    1/50

    Power Optimization for Datapathsin FPGAs

    A thesis submitted in partial fulfillmentof the requirements for the degree of

    MASTER OF TECHNOLOGY

    in

    VLSI DESIGN TOOLS & TECHNOLOGY

    by

    Haldule Prasad Charudatta

    Entry No. 2007JVL2175

    under the guidance of

    Prof. M BalakrishnanMr. Madhav Chikodikar (Synplicity India)

    Dr. G. Chandramouli (Synplicity India)

    VLSI Design Tools And Technology,

    Indian Institute of Technology Delhi.

    May 2009

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    2/50

    Certificate

    This is to certify that the thesis titled Power Optimization for Data-

    paths in FPGAs being submitted by Haldule Prasad Charudatta for

    the award ofMaster of Technology in VLSI Design Tools & Technol-

    ogyis a record of bonafide work carried out by him under my guidance and

    supervision at the Department of Computer Science & Engineering.

    The work presented in this thesis has not been submitted elsewhere either in

    part or full, for the award of any other degree or diploma.

    Prof. M Balakrishnan

    Department of Computer Science and Engineering

    Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    3/50

    Acknowledgments

    I would like to sincerely thank my supervisors Prof. M Balakrishnan, Mr.

    Madhav Chikodikar and Dr. G. Chandramouli for their constant guidance

    and invaluable suggestions through out the project.

    I am also indebted to research scholar Neeraj Goel from IIT Delhi and

    Tarun Kumar, Sriram C. from Synplicity, India for their support and help.

    Finally, I would thank my parents and all my friends for their co-operation

    and encouragement.

    Prasad C. Haldule

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    4/50

    Abstract

    Hardware designs targeting communication and DSP applications consists

    of a large number of data path elements such as adders, multipliers, com-

    parators, shifter etc. These elements are main contributors to the power

    consumption of digital circuits. Power reduction is now becoming very im-

    portant for implementation of designs on FPGAs as well. At present, most of

    the popular FPGA hardware synthesis tools give higher priority to delay. So

    the FPGA synthesis tools tend to generate data path architecture for faster

    implementation. With increasing importance of power reduction on FPGA,it is becoming necessary to evaluate different data path architectures from the

    point of view of both delay and power.

    This work is aimed at characterizing various architectures implementa-

    tions of common operators for power, delay and area for a target FPGA and

    selecting a particular low power architecture where delay is not critical.

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    5/50

    Contents

    1 Introduction and Motivation 1

    1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2 FPGA And Operator Architectures 4

    2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Operator Architecture . . . . . . . . . . . . . . . . . . . . . . 5

    2.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 6

    Carry Select Adder . . . . . . . . . . . . . . . . . . . . 7Carry Look-Ahead Adder 1 - CLA1 . . . . . . . . . . . 7

    Carry Look-Ahead Adder 2 - CLA2 . . . . . . . . . . . 8

    Carry Look-Ahead Adder 3 - CLA3 . . . . . . . . . . . 9

    Brent and Kung Adder . . . . . . . . . . . . . . . . . . 9

    2.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Non-Booth Encoding . . . . . . . . . . . . . . . . . . . 11

    Booth Encoding . . . . . . . . . . . . . . . . . . . . . . 11

    Wallace Tree reduction . . . . . . . . . . . . . . . . . . 12

    2.2.3 Equality Comparators . . . . . . . . . . . . . . . . . . 14

    3 Experiments Setup and Power Plots 15

    3.1 Power Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Power, delay and area plots of operators . . . . . . . . . . . . 18

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    6/50

    CONTENTS ii

    4 Power Analysis 23

    4.1 Dynamic Power Dissipation . . . . . . . . . . . . . . . . . . . 23

    4.2 Power analysis of operators . . . . . . . . . . . . . . . . . . . 24

    4.2.1 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.2.2 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.2.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . 28

    4.3 General Observations . . . . . . . . . . . . . . . . . . . . . . . 29

    5 Power-Aware Architecture Selection and Results 32

    5.1 Power-Delay-Area database . . . . . . . . . . . . . . . . . . . 32

    5.2 Power-Aware Arithmetic Block Selection . . . . . . . . . . . . 32

    5.2.1 Minimum power . . . . . . . . . . . . . . . . . . . . . . 32

    5.2.2 Minimum power with speed constraint . . . . . . . . . 33

    5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    6 Conclusion 40

    6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    References 41

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    7/50

    List of Figures

    2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Flash-based switch . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Carry Select Adder . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.5 Carry Look-Ahead Adder-1 . . . . . . . . . . . . . . . . . . . 8

    2.6 Carry Look-Ahead Adder-2 . . . . . . . . . . . . . . . . . . . 9

    2.7 8-bit Brent and Kung Adder Tree Diagram . . . . . . . . . . . 10

    2.8 A 4 4 multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.9 An equality comparator . . . . . . . . . . . . . . . . . . . . . 14

    3.1 Power Analysis Flow . . . . . . . . . . . . . . . . . . . . . . . 16

    3.2 Power, delay and area plots for Adder architectures . . . . . . 19

    3.3 Power, delay and area plots for Multiplier architectures . . . . 21

    3.4 Power, delay and area plots for Comparator architectures . . . 22

    4.1 Ripple carry Adder Placement . . . . . . . . . . . . . . . . . . 30

    5.1 Algorithm for selection of low power architecture . . . . . . . 34

    5.2 Default and Power-Aware comparison of Adders . . . . . . . . 36

    5.3 Default and Power-Aware comparison of Multiplier . . . . . . 37

    5.4 Power and delay area plots for Comparator architectures . . . 38

    5.5 Comparison of power for 16-bit adder with given delay con-

    straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    8/50

    List of Tables

    2.1 Booth recoding . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.1 Relative Comparison of Adder architectures . . . . . . . . . . 26

    4.2 Relative Comparison of Multiplier architectures . . . . . . . . 27

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    9/50

    Chapter 1

    Introduction and Motivation

    1.1 IntroductionThe use of FPGAs is rapidly increasing owing to its low cost and smaller

    time to market. Also, FPGAs are more suitable for building prototypes and

    in reconfigurable applications. However, FPGAs are not power efficient. A

    couple of reasons for FPGAs to consume more power are,

    FPGAs consist of large number of interconnects and programmable

    switches

    generic logic structures in FPGAs consume more power than the dedi-cated circuitry in ASICs

    A number of applications like space and military where FPGAs are used

    demand low power features. A major component of FPGA power is the

    datapathand so there is a scope for reduction in power consumption there.

    Many applications using FPGAs involve computations that use arithmetic

    operators like adders, multipliers, etc. The present FPGA synthesis tools

    like Synplify Pro targetperformanceof these operators and no consideration

    is given topower.

    These arithmetic operators can be implemented in different architectures.

    The power consumed by these operators can be decreased by proper selection

    of a particular operator architecture over the other. The power varies not

    only because of the different operator architectures but also because of its

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    10/50

    1.2 Related Work 2

    interconnects, that depend on the FPGA architecture and technology. For

    this, synthesis tool must have detailed knowledge of parameters like power,

    delay and area of operators of different implementations for different bit

    widths.

    1.2 Related Work

    The idea of characterizing arithmetic operators and selection of an implemen-

    tation for minimum power is been discussed lately. A methodology to form

    characteristic lookup table and a couple of approaches to achieve low power

    implementation is discussed in [1]. [2] extends the idea to select a particular

    architecture by simulated annealing considering power, delay as well as area

    of the operators. [3] analyzes the power-delay-area tradeoff among differentarchitectures of adders and multipliers for Actel ProASIC FPGAs while a

    similar study for adders is done for Altera FPGA in [4].

    1.3 Objective

    The project is aimed at developing a power aware utility to generate an

    FPGA implementation for a data path operator given its word length and

    expected frequency of operation. It can be divided in two parts:

    Analyzing the power for different operators and study a power-time-

    area relation among them.

    Develop a utility that will go through all the implementation and come

    up with one having minimum power for an operator at the desired

    frequency.

    1.4 ApproachFor analyzing power of different architectures, first the arithmetic operators

    of interest, their types and word lengths for which analysis is to be done are

    identified. Then we develop complete power flow that will take the param-

    eters, invoke different tools and will directly give the corresponding power.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    11/50

    1.5 Organization 3

    As power analysis needs to be done repeatedly, this step is crucial. The ar-

    chitectures for operators are studied from power point of view. Finally we

    develop a script that uses the Synopsys internal module generator to gener-

    ate right architecture for delay and power requirements and compare these

    power values by those when default architectures are used.

    1.5 Organization

    The following gives a general overview of this report.

    Chapter 2 describes a generic FPGA architecture and also the different

    operator architectures are would be part of our analysis. Chapter 3 explains

    the power flow set up for the analysis and also gives the power, delay and

    area graphs for operator implementations. In Chapter 4 a detailed analysisis presented for the architectures from power point of view and trade offs are

    discussed. Chapter 5 gives a basic way of selecting a low power implementa-

    tion and compares results ofdefault and poweraware implementations while

    Chapter 6 presents the conclusions.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    12/50

    Chapter 2

    FPGA And Operator

    Architectures

    2.1 FPGA Architecture

    A generic FPGA architecture is shown in the figure below.

    Figure 2.1: FPGA Architecture

    An FPGA consists of a two-dimensional array of logic blocks connected

    by general interconnection resources [5]. FPGA includes three main com-

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    13/50

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    14/50

    2.2 Operator Architecture 6

    delay and area for different bit widths. For this, the architectures are selected

    from those available in Synopsyss internal module generator. The operators

    targeted are adders, multipliers and comparators.

    2.2.1 Adders

    The module generator offer a wide variety of adder architectures that can be

    selected according to the power and performance needs of the design. The

    implementations include ripple carry adder, carry select, Carry Look-Ahead

    adders and Brent and Kung adder.

    A carry can either be generated in an adder stage or propagated from the

    previous one. CLAs use these generate (G) and propagate (P) values of all

    previous stages to determine the carry of a particular stage. This combination

    of previous stage GP values can be done in many ways depending on the

    architecture used. Accordingly there are three variants of Carry Look-Ahead

    adders having power-delay-area trade-offs.

    Ripple Carry Adder

    This is the simplest implementation of adder which is formed by cascading

    full adders in series as shown in figure 2.3. A full adder computes the sum

    and carry for each stage. The CarryOutof a full adder stage is applied as a

    CarryInto the next stage [7].

    The main advantage of this architecture is its small area; however, it has

    large delay that increases linearly with the bit width. Both, delay and area

    are of order O (n).

    Figure 2.3: Ripple Carry Adder

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    15/50

    2.2 Operator Architecture 7

    Carry Select Adder

    Figure 2.4: Carry Select Adder

    A carry select adder for n bits is divided into n/sstages ofs bits each.

    The sum and carry for each stage is first computed separately considering

    input carry as 0 as well as 1 and then one of these is selected depending

    on the actualCarryInfrom the previous stage [7].

    This circuit has more area (almost twice) than ripple carry adder as each

    stage consists of an extra s-bit adder and a multiplexer. Carry Select has a

    lower delay than ripple carry as the carry and sum are computed for each

    block in advance.

    A block diagram for n-bit carry select adder with s-bit blocks is shown in

    Figure 2.4. Here, 0-Carry and 1-Carry are s-bit adders with input carry as 0

    and 1 respectively.

    Carry Look-Ahead Adder 1 - CLA1

    This is a conventional form of CLA architecture. In this the entire word

    length is divided into blocks of, say, 4-bits each. The generate-propagate

    (GP) values for individual bits (denoted by G0,P0, and so on) and further,for a pair and then for combination for 4-bits are generated in the first three

    stages of the circuit as,

    Stage 1: G0= X0 Y0,G1= X1 Y1, P1= X1 Y1 . . .

    Stage 2: (G1:0= G1+G0P1), (G3:2= G3+G2P3),

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    16/50

    2.2 Operator Architecture 8

    (G5:4= G5+G4P5). . .

    (P3:2= P2P3), (P5:4= P4P5). . .

    Stage 3: (G3:0= G3:2+G1:0P3:2= G3+G2P3+G1P2P3+G0P1P2P3),

    (G7:4= G7+G6P7+G5P6P7+G4P5P6P7). . .

    (P7:4= P4P5P6P7). . .

    In each latter stage, a 4-bit GP block gets combined with the complete

    lower GP block to finally realize the complete carry equation. This is achieved

    by forming a skewed tree-like structure as shown in figure 2.5 given below.

    S

    T

    AG

    E

    S

    1

    to

    3

    GP

    Combination

    Block

    GP

    Combination

    Block

    GP

    Combination

    Block

    GP 3:0

    GP 13:12

    GP 11:8

    GP 7:4

    GP 7:0

    GP 11:0

    GP 13:0

    Figure 2.5: Carry Look-Ahead Adder-1

    The GP combination blocks are simple and can be realized by simple

    AND-OR logic. This architecture has small area as the GP combination

    blocks are simple. In this implementation the GP values of lower bits ripple

    across the design, but, in batches. So, this circuit has greater delay than

    other CLA implementation

    Carry Look-Ahead Adder 2 - CLA2

    This is a faster version of CLA architecture. First three stages of generation of

    GP values are similar to that of CLA1. Further, the GP blocks are combined

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    17/50

    2.2 Operator Architecture 9

    in more aggressive manner in lesser logic levels thereby decreasing the delay.

    The architecture has a balanced tree structure.

    Figure 2.6: Carry Look-Ahead Adder-2

    As seen from the above figure 2.6, the combining of GP values is simul-

    taneous at lower bits and higher bits, unlike in previous CLA where it was

    from lower bits toward higher bits. Hence, it requires lesser logic levels. More

    gates and logic is required to generate GP blocks of higher bit width. So,

    area of FCLA circuit is larger.

    Carry Look-Ahead Adder 3 - CLA3

    One of the Carry Look-Ahead implementations seen before is small(but slow)

    while other one is fast(but large). This third variant of CLA architecture is

    a compromise between the earlier two. It has only one logic level more than

    CLA2 but about 30% less area.

    Brent and Kung Adder

    Brent and Kung architecture is a parallel prefix tree based algorithm to

    develop a binary adder circuit optimized for time and area [8]. A tree diagram

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    18/50

    2.2 Operator Architecture 10

    for an 8-bit Brent and Kung adder is shown in figure 2.7. It has a black

    operator to compute the GP values of higher order from lower order values.

    For a black operator(),

    Gi:j =Gi:k+Pi:k Gk1:j

    Pi:j =Pi:k Pk1:j.

    Figure 2.7: 8-bit Brent and Kung Adder Tree Diagram

    The addition of n-bit numbers can be performed in time O (logn), while

    the area complexity is O (nlogn). Brent and Kung architecture is considered

    as the most area-efficient parallel prefix adder as it needs very few logic

    elements.

    2.2.2 Multipliers

    Multiplication is an important arithmetic operation. Multipliers are much

    larger than adders and also more power consuming. Also, multiplication is

    a slow operation Multiplication is actually a process of addition of multi-

    ple partial-products. These partial products are formed by operating the

    multiplicand by each bit/bits of the multiplier. There are different multi-

    plier architectures formed depending on the method of generation of partial

    product and its addition to finally give the complete product.

    The partial products can be computed using the normal ANDing opera-

    tion or by using some encoding method. By encoding, the number of partial

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    19/50

    2.2 Operator Architecture 11

    products is reduced. Booth encoding is used for this purpose. Partial prod-

    uct generation is followed by their addition that is done using Wallace tree

    optimization in tree-like format. These techniques can be used in different

    ways with different options to give various architectures.

    Non-Booth Encoding

    This is the simplest way of generating partial products. In this, the multi-

    plicandis ANDed by each bit of the multiplierto form a partial product [9].

    The total number of partial products needed to be added to get the final

    product is not reduced.

    Non-booth encoding is preferred for lower bit multipliers as there is no

    overhead of encoding themultiplier.

    Booth Encoding

    Booth Encoding is used to decrease the number of partial products. Smaller

    the number of partial products, better the performance. This is achieved by

    recoding the multiplierbits so that now a group of bits decide the partial

    product.

    The basic Booth-encoding technique recodes two multiplier bits to 1, -1 or

    0 depending on the value of the pair of bits [9]. The multiplicand is assigned

    accordingly to the partial product as given in table 2.1.

    Multiplier bits Recoded bits Corresponding Partial Product

    00 0 001 +1 +Multiplicand10 1 Multiplicand11 0 0

    Table 2.1: Booth recoding

    Amodified Booth encodingis also used in which three multiplier bits arerecoded. Booth encoding is very useful, especially for higher bit widths. It

    helps to decrease the area as well as delay of the next stage i.e addition of

    partial products.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    20/50

    2.2 Operator Architecture 12

    Wallace Tree reduction

    Wallace tree reduction is a technique of arranging partial products in a tree-

    like manner. It is a way in which partial products are added in parallel. The

    number of adders required decrease by this technique [10, 11]. It uses twooperators, a full adder that has three inputs and gives two outputs and the

    other one is a half adder that has two inputs and two outputs. Using this 3:2

    and 2:2 operators, the required reduction is achieved. The routing in Wallace

    tree multipliers is much irregular.

    An n-bit multiplier consists of n n-bit partial products. These partial

    products need to be summed to get the final product. This of addition on n

    n-bit operands is converted into an addition of 2 2n-bit operands.

    Consider a 4 4 multiplier. Fig. 2.8 shows Wallace tree architecture

    for 4 4 multiplier. Assuming non-Booth encoding, partial products will be

    computed as shown in Fig. 2.8(a), where Pij =Xi Yj .

    (a) Partial Products Generation

    Addition of these partial products will give the final 8bitproduct. This

    addition is done in two stages.

    1. First stage consists of fast carry save addition of partial products using

    full adders and half adders (Fig. 2.8(b)). The carries generated at

    one layer are passed on to other layer for addition. An advantage of

    this is that only two layers are sufficient for addition of 16-bits (not

    included the last stage). Thus, both the number of adders as well

    as critical path is reduced. This reduction is significant for multipliers

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    21/50

    2.2 Operator Architecture 13

    FA FA

    FA

    HA

    HAHA

    P00P01P10P02 P11 P20P21 P03P30 P12P22P31 P13

    P23P32

    P33

    LAST STAGE ADDER

    Z7 Z2 Z1 Z0Z3Z4Z5Z6

    (b) Wallace Tree Architecture

    Figure 2.8: A 4

    4 multiplier

    with higher word length. Because of the use of full adders (also referred

    as 3 : 2 compressor) the delay of the tree is O(log3/2(N)).

    2. The last stage consists of a fast 2n-bit adder (8-bit, in this case). Usu-

    ally a carry look-ahead adder is used here.

    The Wallace tree reduction can be implemented in different ways. Synop-

    syss internal module generator provides various options for the same. The

    order ofcompressorcan be varied to 3 or 4. The other few options availableallow to optimize eitherareaor timingof Wallace tree.

    All these options are used to generate different implementations of mul-

    tipliers. The architectures with booth and non-booth encoding start with

    bwt andnbwt respectively. The latter part denotes the type of Wallace tree

    optimization used. These are,

    array: This architecture gives the traditional array form of multiplier

    architecture. It has smaller area. But, as the addition of the partial

    products is in array form, this circuit has larger delay. It produces a

    dense layout.

    tree : This is a regular tree-like implementation of multipliers. It has

    lesser delay. The 4:2 compressorsare used in this implementation.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    22/50

    2.2 Operator Architecture 14

    bitopt : The bit-optimized solution is the minimum gate count and

    minimum delay circuit.

    2.2.3 Equality ComparatorsEquality comparators check the equality between two numbers and output

    1 if equal, else 0. The simplest way of doing so is to bitwise XNOR two

    inputs and then to AND the outputs of the XNORs to get final result. A

    generic equality comparator is shown in Fig. 2.9

    A0

    A

    A

    A1

    B0

    B

    B1

    Y

    B

    n-2

    n-2

    n-1

    n-1

    Figure 2.9: An equality comparator

    The polarity of logic and XOR gates can be set to either trueorinverting

    in Synopsyss module generator. However, the basic architecture remains the

    same.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    23/50

    Chapter 3

    Experiments Setup and Power

    Plots

    3.1 Power Analysis Setup

    For power analysis of architectures of different operators at various bit widths,

    a power flow is set up using tools provided by different vendors.

    The power analysis flow is as follows:

    1. Synopsys internal module generator is used to generate Verilog netlist

    for a particular architecture of an operator. It has two inputs files,one to specify the functionality of the design/operator and the other

    to select among the options provided that helps to select a particular

    architecture. The .lib library file of the target FPGA is also loaded.

    The module generator selects appropriate library components to export

    the netlist.

    Registers are placed at the inputs and outputs of the operators. First

    reason is that timing analysis tools provide only clock-to-clock delay.

    Delay of a pure combinational circuit is not calculated. The second

    reason is that I/O pads have high delay values. This delay adds up with

    combinational circuit delay. Placing registers breaks this direct link.

    2. The verilog netlist from previous stage is then synthesized in Synplify

    Pro, a logic synthesis tool for FPGAs. The details of target FPGA

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    24/50

    3.1 Power Analysis Setup 16

    Figure 3.1: Power Analysis Flow

    are specified and constraints (if any) are given. For the power analy-

    sis experiments frequency is kept at auto-constrain mode. Automatic

    constraints generate the fastest design implementation.

    If frequency constraint is set to a value greater than that derived when

    set to auto-constrain, then Synplify Pro tries to optimize the circuit for

    higher frequency. In the process, it adds more gates, makes it larger

    and thus distorts the original architecture. This causes the circuit to

    consume more power. So, frequency is set to auto-constrain to preserve

    the architecture and to meet required timing with minimum power.

    Synplify Pro generates post-synthesis.edn(EDIF) and verilog netlists.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    25/50

    3.1 Power Analysis Setup 17

    3. The post-synthesis verilog netlist is simulated in VCS, a Verilog sim-

    ulator, using a testbench. First, a pseudo-random number generator

    dumps a large number of binary numbers of specified width in a file.

    The testbench then reads these values in pairs and applies it to the

    operator module.

    A .vcd (Value Change Dump) file is generated during simulation us-

    ing $dumpvars system task in the testbench. It contains the switching

    activity for all the signals in the module. This is used for power es-

    timation. The arithmetic operation (addition, multiplication, etc) is

    performed on about 10,000 pairs of random numbers. Greater the

    number of input vectors, more accurate the results can be expected.

    4. The post-synthesis EDIF netlist is passed through the Place and Route(layout) tool of the target FPGA vendor. Any constraints for place-

    ments like defining a region and placing all or a group of instances in

    the design within that region can be specified here using.pdc(Physical

    Design Constraint) file.

    A status report is generated after compilation of EDIF netlist. It con-

    tains information about number of combinational and sequential in-

    stances used in the design and number of high fanout nets and their

    respective fanouts.

    5. A vendor-specific power estimation/analysis tool is used to calculate

    the operator power. This is immediately followed after PnR. It uses

    the VCD file for power estimation. This is a recommended method for

    power estimation as the switching activity information from VCD file

    is more accurate.

    A VCD file dumped by VCS may not be directly used for power estima-

    tion as signals are specified in vector forms. So, these signals should be

    scalarized. This is done by using a vcdpostutility that gives out a VCDfile that contains value change for each bit of vector signal, recorded as

    a separate signal.

    The power report gives breakdown of power by type viz. Net, Gate,

    I/O, Core Static and Bank Static. It also contains breakdown by in-

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    26/50

    3.2 Power, delay and area plots of operators 18

    stances in which gives power for every instance/gate and net in the

    design. Details of number of transitions at each net can also be stud-

    ied.

    3.2 Power, delay and area plots of operators

    The power analysis flow was carried on architectures of adders, multipliers

    and comparators for different bit widths. The total dynamic power (net +

    gate), delay and number of cores utilized are plotted accordingly. Following

    are the respective graphs and their analysis is done in the next chapter.

    0

    5

    10

    15

    20

    25

    0 10 20 30 40 50 60 70

    Power(Net+Gate)mW

    bit width

    bnkcarrySelect

    cla1cla2cla3

    ripple

    (a) Power

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    27/50

    3.2 Power, delay and area plots of operators 19

    4

    6

    8

    10

    12

    14

    16

    18

    20

    0 10 20 30 40 50 60 70

    Delay(ns)

    bit width

    bnkcarrySelect

    cla1cla2cla3

    (b) Delay

    0

    100

    200

    300

    400

    500

    600

    700

    800

    0 10 20 30 40 50 60 70

    Area(#ofcores)

    bit width

    bnkcarrySelect

    cla1cla2cla3

    ripple

    (c) Area

    Figure 3.2: Power, delay and area plots for Adder architectures

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    28/50

    3.2 Power, delay and area plots of operators 20

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    0 5 10 15 20 25 30 35

    Power(Net+Gate)mW

    bit width

    bwt-areabwt-bitoptbwt-timingnbwt-area

    nbwt-bitoptnbwt-timing

    (a) Power

    10

    20

    30

    40

    50

    60

    70

    0 5 10 15 20 25 30 35

    Delay(ns)

    bit width

    bwt-areabwt-bitoptbwt-timingnbwt-area

    nbwt-bitopt

    nbwt-timing

    (b) Delay

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    29/50

    3.2 Power, delay and area plots of operators 21

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    0 5 10 15 20 25 30 35

    Area(#cores)

    bit width

    bwt-areabwt-bitoptbwt-timingnbwt-area

    nbwt-bitoptnbwt-timing

    (c) Area

    Figure 3.3: Power, delay and area plots for Multiplier architectures

    0

    1

    2

    3

    4

    5

    6

    0 10 20 30 40 50 60 70

    Power(Net+Gate)mW

    bit width

    invertedLogictrueLogic

    (a) Power

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    30/50

    3.2 Power, delay and area plots of operators 22

    4.5

    5

    5.5

    6

    6.5

    7

    7.5

    8

    8.5

    0 10 20 30 40 50 60 70

    delay(ns)

    bit width

    invertedLogictrueLogic

    (b) Delay

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    0 10 20 30 40 50 60 70

    Area(#cores)

    bit width

    invertedLogictrueLogic

    (c) Area

    Figure 3.4: Power, delay and area plots for Comparator architectures

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    31/50

    Chapter 4

    Power Analysis

    4.1 Dynamic Power DissipationDynamic power dissipation is caused by signal transitions in the circuit [12].

    A higher operating frequency leads to more frequent signal transitions and

    results in increased power dissipation. The most significant source of dynamic

    power consumption in CMOS circuits is the charging and discharging of

    capacitance. This power is given as,

    P =

    iCiV2i fi,

    where Ci,Vi and fi are capacitance, voltage and frequency respectively

    for any instance i.

    Another source of dynamic power consumption isshort-circuit power. But

    this power has smaller contribution in FPGAs.

    Thus the total dynamic power consumption is based on following factors,

    the length, fanout and effective capacitance of the interconnection wires

    and switches,

    number of resources utilized, as more the resources, more will be thepower consumed,

    and switching activity of different nets and instances in the design.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    32/50

    4.2 Power analysis of operators 24

    4.2 Power analysis of operators

    4.2.1 Adders

    Ripple Carry Adder

    Ripple carry adder has the smallest circuit in terms of number of nets and

    gates. The nets have fanout not more than 2. So, it is also expected to

    consume less power.

    The gate power of ripple carry adder is the least. However, PnR tool

    does not identify the architecture and so the placement for this architecture

    is spread out that increases the net length. Also, the number of logic levels is

    much more than any other architecture. The effect of these two scenarios is

    that the net power is not as less as expected. The increase of power becauseof increase in logic levels is more dominant for higher word lengths.

    The major drawback of this architecture is that its delay is too high and

    increases linearly with word length (Hence, not shown in Fig. 3.2(a)).

    Carry Select Adder

    Carry Select architecture is formed by duplication of blocks of ripple carry

    adder. So, its area is around twice that of the ripple carry adder circuit.

    But it is smaller than the CLAs which employ a more complex circuitry. It

    has very high fanout nets, even more than CLA2. This cause carry select to

    consume more power.

    Its delay is comparable to that of Brent and Kung adder, and is much

    less than ripple carry adder as sum and carry for each block is computed

    beforehand.

    Carry Look-Ahead Adder 1 - CLA1

    CLA1 is the slowest variant of CLAs. Its circuit size is 50% of that of CLA2

    and the number of logic levels smaller than ripple carry and Brent and Kung

    adder. As a result of small size and less levels, CLA1 circuits consume less

    power. One more advantage is that it has low fanout nets. So the Net power

    is less. But there are more spurious transitions in this circuit since it is a

    skewed-tree like architecture.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    33/50

    4.2 Power analysis of operators 25

    The circuit is slower than other CLAs but still better than ripple carry

    and BnK architectures. So, this architecture can be a good choice when all

    the three parameters viz. delay, power and area are equally critical.

    During placement and routing, the PnR tool performs some logic com-

    bining optimization on the post-synthesis netlist. The number of logics com-

    bined is given as a part of Netlist Optimization Report in Status report. It

    is observed that there is more scope for such post-synthesis logic combining

    for CLA1 architecture than others. There is about 10% logic combining for

    CLA1 at this stage further reducing its area.

    Carry Look-Ahead Adder 2 - CLA2

    The CLA2 architecture is the most power-consuming of all as it is more

    complex and having more number of gates and nets than other architectures.

    This architecture has many interconnections and dependencies in the circuit.

    Hence, it has a well-packed layout and so, though the number of nets is more,

    they are of smaller lengths. CLA2 has high fanout nets.

    The logic depth of the circuit is very less. It does more computations in

    lesser levels. This makes CLA2 fastest adder architecture. The number of

    logic levels remains constant for a range bit widths. With this, the delay also

    tends to be flat.

    Carry Look-Ahead Adder 3 - CLA3

    CLA3 circuit is only one level slower than CLA2. At the same time, it

    is about 70% smaller than CLA2. As a result, power consumption of this

    architecture is also about 75-80% less than that of CLA2.

    So, this circuit can be a better alternative than CLA2, in terms of area

    and power. However, cases in which timing is critical CLA2 should be used.

    For higher bit widths, though CLA3 has one level more than CLA2, it uses

    gates with lower delay. So, its delay becomes almost comparable to CLA2.

    Brent And Kung Adder

    Brent and Kung architecture is slightly better in power than other archi-

    tectures. The main reason being that the number of gates and nets in this

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    34/50

    4.2 Power analysis of operators 26

    architecture is much less than in others. Only ripple carry adder has gates

    and nets lesser than Brent and Kung. However, Brent and Kung have the

    advantage that its number of logic levels is of the order log n, whereas for

    ripple carry adder its of order n. Also its layout is scattered to some extent

    as for ripple carry adder.

    Performance wise, Brent and Kung is slower than the CLAs as it has

    a deeper logic. It is more suitable from power point of view at higher bit

    widths but then its delay is 1.5 times of that of CLA2 for 32 and 64 bits.

    The relative comparison of architectures of adders is given in Table 4.1

    and the graphs are given in Fig. 3.2.

    No. of nets/gates No. of logic levels fanout

    ripple Low Very High Low

    Carry Select High High HighCLA1 Moderate Moderate ModerateCLA2 Very High Low HighCLA3 High Low High

    Brent & Kung Low High Moderate

    Table 4.1: Relative Comparison of Adder architectures

    4.2.2 Multipliers

    The power, delay and area graphs for different multiplier architectures are

    given in Fig. 3.3. The trade offs in these three parameters are discussed here.

    Area

    The architecture employing booth encoding have more area because of addi-

    tional circuitry for encoding. This booth circuit vary along with the encoding

    method. So, for lower bit multipliers, non-booth encoding architectures have

    smaller area.

    However, as seen in Fig. 3.3(c), the booth variant of any implementation

    has smaller area than its non-booth counterpart at higher bit widths. This is

    because at higher bit widths, the hardware overhead of the booth encoding

    part is much lesser than the amount of circuitry saved in the latter stage

    because of using booth encoding. Thebwt areaarchitecture has considerably

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    35/50

    4.2 Power analysis of operators 27

    smaller area than other architectures for multipliers more than 24-bit wide

    (even for 48 and 68-bits multiplier not shown in the graph).

    It is observed that logic combining during PnR stage is more for booth

    encoded circuits. This is because the PnR tool can identify the booth encoding

    circuits and optimize it accordingly.

    Delay

    The delay ofarraymultipliers is, as expected, much greater than other ar-

    chitectures for all range of word lengths. This is because carry generated by

    addition of partial products are rippled along the row and column in array.

    So the number of logic levels increases, in the worst case, linearly with bit

    width, for non booth encoding implementation. For circuit with booth en-

    coding, as the number of partial products to be summed are reduced, delay

    is lesser.

    Thetreeandbitoptarchitectures have similar delay characteristics,bitopt

    fairing slightly better than corresponding tree implementations. Below 28-

    bits, non-booth implementation is faster while for higher width multipliers

    bit optimized booth implementation gives minimum delay.

    Power

    The power of the architectures depend on three major factors, number of

    nets and gates in the design, number of logic levels (that contribute to the

    spurious transitions increasing the glitch power), and number of high fanout

    nets and their fanout.

    A relative analysis of these three factors is given in table 4.2.

    No. of nets/gates No. of logic levels fanout

    bwt array Low High Highbwt bitopt Low Low High

    bwt tree High Low Highnbwt array High High Lownbwt bitopt Low/Moderate Low Low

    nbwt tree High Low Low

    Table 4.2: Relative Comparison of Multiplier architectures

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    36/50

    4.2 Power analysis of operators 28

    The fanout for booth encoding circuits is very high. Mostly, the input

    nets are high fanout nets and hence consume more power.

    There are two types of transitions in a circuit, functional (intended) and

    spurious (unintended). More the number of logic levels, higher are the spu-

    rious transitions (orglitches). These transitions increase the dynamic power

    consumption. The number of logic levels, as seen before is higher for array

    implementations. The tree and bitopt implementations have smaller logic

    depth. Thus the contribution of glitch power due to spurious transitions is

    lesser in these circuits.

    Finally, power consumption increase with the number of nets and gates.

    Gates consume switching as well as short-circuit power. Nets, as seen earlier,

    are also expensive in terms of power due to their length and fanouts. The bit

    optimized architectures are smaller. So, they have lesser power consumingresources.

    As seen from table 4.2, the nbwt bitoptarchitecture does well in all these

    three fronts. Hence, it gives minimum power while the array based circuits

    consume more power (Fig. 3.3(a)).

    4.2.3 Comparators

    The two comparator architectures implemented are : invertedLogic and tru-

    eLogic, which, as the names suggest are according to the type of logic imple-mented.

    As seen from the power, delay and area plots of comparators given in Fig.

    3.4, the values of these parameters are almost similar for the two implemen-

    tations. This is because, as comparator architecture is fairly simple and there

    is no scope for further optimization, there is no structural difference between

    the two implementations. Only the library elements used are different.

    This comparator architecture can be optimized if arrival times of inputs

    are known. However, this is not a general case and may vary with design.

    The comparator circuits are scalable. Hence, the area and power of the

    circuits increase linearly with bit width (Fig.3.4(a) & (c)).

    But the delay, as shown in Fig. 3.4(b), increase with bit width in steps.

    This is because number of logic levels is same to compute the comparator

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    37/50

    4.3 General Observations 29

    output for bit widths in the range, 2n1 < bitwidth 2n.

    4.3 General Observations

    The power due to nets is about 3 to 4 times greater than the gate power

    for all the architectures. This is primarily because nets in FPGA are not

    contiguous. FPGA nets consist of net segments and switching elements. Net

    power consists of switching power of these interconnecting switches as well.

    So, total power mostly depends on number of nets, their length, and fanout.

    If the PnR tool is made to do placement and routing without any con-

    straints, then power indirectly depends on the placement too. Unconstrained

    placement may cause a circuit to disseminate over the available region which

    increases the net length. With this, power also varies contradicting the ex-pected power values. One important observation made in this regards was

    that when power for ripple carry adder is calculated after unconstrained PnR,

    it is almost comparable to power of CLA2 circuit. This contradicts with ex-

    pectation as CLA2 is more complex and bigger than ripple carry adder circuit,

    and hence is expected to have more power. It is after some constraints are

    imposed that the power behaves in accordance with the architecture.

    (a) Before Constraints

    The constraints can be added indirectly by assigning pins to the ports or

    directly by giving some area restrictions. Applying pin constraints does not

    serve the purpose, the reasons being,

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    38/50

    4.3 General Observations 30

    (b) After Constraints

    Figure 4.1: Ripple carry Adder Placement

    When a particular adder, or any operator, is part of a bigger circuit, it

    will have to be placed together like a single module. On the other hand,

    if we impose pin constraints, the input-output registers are placed near

    the pins and the core circuit away from it. This decreases the clock net

    length but may increase lengths of other nets. This could introduce

    errors in power values.

    Also, providing pin constraints is difficult for varying circuits of varying

    bit widths. It is mainly because the pin assignments must follow DesignRule Check (DRC).

    So, for the power analysis experiments, area constraints are given to PnR

    tool. A region is defined such that it is close to the CLK pad. Then using

    a Physical Design Constraints (PDC) command assign net macros all the

    instances connected to the clock net (input-output registers) are confined

    within that region. This causes the other circuitry to gather around it. If

    assign net macros is applied to all nets, so as to accumulate entire circuit

    within the region, then it will be required to define a larger region to occupyoperators of higher bit widths. This may again lead the smaller circuits to

    disperse. Hence, the constraint is only applied to instances connected to the

    clock net.

    Fig 4.1 shows the placement of 16-bit ripple carry adder before and after

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    39/50

    4.3 General Observations 31

    applying area constraints.

    Even after such constraints on area are applied, the circuits of ripple carry

    and Brent and Kung adder is scattered to some extent. This increases the

    net length and thus contributing to power.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    40/50

    Chapter 5

    Power-Aware Architecture

    Selection and Results

    5.1 Power-Delay-Area database

    The present synthesis tools select the best performance (minimum delay)

    implementation of any arithmetic operator. To select an architecture detailed

    knowledge of power, delay and area values for different bit widths is needed

    that gives an idea of which architecture will be optimum at given bit width.

    Having a database can serve this purpose. It can be created by running the

    power analysis flow (Section 3.1) on different architectures.This is a one time characterization and lookup tables can be created that

    can be used to select low power implementation.

    5.2 Power-Aware Arithmetic Block Selection

    An optimum power architecture can be selected from among available archi-

    tectures in the following ways,

    5.2.1 Minimum power

    Select a minimum power implementation, irrespective of delay and area.

    Generally, a low power implementation also have a smaller area. Such selec-

    tion is useful when that operator is not on the critical path.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    41/50

    5.3 Results 33

    This gives the maximum amount of power that can be saved if a low

    power architecture of an operator is implemented rather than a default one.

    5.2.2 Minimum power with speed constraintThe algorithm for selection of a minimum power implementation with delay

    within a given constraint is shown in Fig.5.1.

    According to it, given a speed requirement, select a minimum power im-

    plementation first. If speed requirements are met, choose it. Else, the next

    low power implementation is selected. This is repeated such that an imple-

    mentation with delay within the given constraint is selected with minimum

    possible power. If none of the architecture gives delay less than the con-

    strained value, then finally an architecture with minimum delay is selected

    irrespective of power.

    This algorithm can be used if an operator is on the critical path, or, even

    if it is not on the critical path, may violate timing requirements if a low power

    but high delay architecture gets selected. For this, the synthesis tool may

    need to extract timing details of the paths on which operators are present

    and to deduce the maximum delay requirement for the operators.

    5.3 Results

    An utility is developed to select minimum power architecture, irrespective of

    delay and area. It has two inputs, the operator type and its bit width. It

    looks for all the available implementations of that operator and selects the

    one with minimum power.

    This poweraware architecture is compared to the default operator ar-

    chitecture generated by Synplify Pro by synthesizing HDL commands like

    c= a+b for adder,c = a b for multipliers, etc

    Power in powerawaremode for adders is 50% less than that in defaultmode. Area till 32-bit is about 30% while for higher width adders it is 50%

    of the area defaultarchitecture. However, delay for lower bit adders is too

    high inpowerawarearchitecture. This is because as timing is not considered,

    ripple carry adder is selected here which has higher delay.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    42/50

    5.3 Results 34

    Get a list of

    architectures

    Select architec-ture with minimum

    power

    Check if

    delay constraints

    are met

    Yes

    No

    Choose

    that

    architecture

    Remove that

    architecture from

    the list

    Is list

    empty?

    No

    Yes

    Select

    architecture with

    minimum delay

    End

    Figure 5.1: Algorithm for selection of low power architecture

    Even for multipliers (Fig. 5.3), powerawarearchitectures have about 60-

    70% power of that of default architectures. The delay at lower bit widths

    is smaller for poweraware, while defaultarchitectures fair better in delay for

    wider multipliers(even for 48 and 64-bit multipliers not shown in the graph).

    As expected, power (Fig. 5.4 a) and area for comparators are almost the

    same for poweraware and default implementations as the architectures are

    not much different. But, as seen from Fig. 5.4(b), delays of these form a

    typical pattern,powerawarehaving lower delay for bit widths that are powers

    of 2. This is because after that the number of logic levels increase by 1 and

    delay increases that defaultimplementation.

    This comparison tells us that power consumption can be reduced if a low

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    43/50

    5.3 Results 35

    0

    5

    10

    15

    20

    25

    30

    35

    0 10 20 30 40 50 60 70

    Power(Net+Ga

    te)mW

    bit width

    defaultpoweraware

    (a) Power

    0

    5

    10

    15

    20

    25

    30

    35

    0 10 20 30 40 50 60 70

    delay(ns)

    bit width

    defaultpoweraware

    (b) Delay

    power implementation of an operator is selected where delay is not critical.

    Fig. 5.5 gives a comparison of power ofdefaultandpoweraware architectures

    for a 16-bit adder with given delay constraints. Thepoweraware architecture

    is selected using the algorithm given in Fig. 5.1. The ripple carry adder,

    CLA2 and CLA1 architectures are selected by the algorithm for delay con-

    straint of 20 ns, 15 ns and 10 ns respectively. Fig. 5.5 shows that if delayconstraints are not stringent, the our algorithm is able to select a low power

    implemetation with given speed requirement.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    44/50

    5.3 Results 36

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    0 10 20 30 40 50 60 70

    Area(#ofcores)

    bit width

    defaultpoweraware

    (c) Area

    Figure 5.2: Default and Power-Aware comparison of Adders

    0

    50

    100

    150

    200

    250

    0 5 10 15 20 25 30 35

    Power(Net+Gate)mW

    bit width

    defaultpoweraware

    (a) Power

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    45/50

    5.3 Results 37

    10

    15

    20

    25

    30

    35

    40

    0 5 10 15 20 25 30 35

    delay(ns)

    bit width

    defaultpoweraware

    (b) Delay

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    0 5 10 15 20 25 30 35

    Area(#ofcores)

    bit width

    defaultpoweraware

    (c) Area

    Figure 5.3: Default and Power-Aware comparison of Multiplier

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    46/50

    5.3 Results 38

    0

    1

    2

    3

    4

    5

    6

    0 10 20 30 40 50 60 70

    Power(Net+Gate)mW

    bit width

    defaultpoweraware

    (a) Power

    4.5

    5

    5.5

    6

    6.5

    7

    7.5

    8

    8.5

    9

    0 10 20 30 40 50 60 70

    delay(ns)

    bit width

    defaultpoweraware

    (b) Delay

    Figure 5.4: Power and delay area plots for Comparator architectures

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    47/50

    5.3 Results 39

    defaultpoweraware

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    201510

    Power(mW)

    delay constraint (ns)

    Figure 5.5: Comparison of power for 16-bit adder with given delay constraints

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    48/50

    Chapter 6

    Conclusion

    6.1 ConclusionThe synthesis tools that by default go for minimum delay implementation of

    an arithmetic operator can be made to select a low power alternative where

    timing is not critical. This can help to minimize considerable amount of

    power.

    Thenet power is many times greater than gate power as intercon-

    nection wires and switches are more power consuming. The power consumed

    by a net depends much on its length. Thus, power does not only depends on

    the number of instances but also on their placement as scattered placementincreases the lengths of the nets.

    So, for the operator architectures, cells of which are not placed together,

    blocks must be created using some manual placement by placing the complete

    module together so as to minimize net length. Such blocks can be used during

    actual physical synthesis.

    Also, for power-aware synthesis, decisions need to be taken at logic syn-

    thesisas well as physical synthesisstage. Some guidelines are required to

    be set for the PnR tool to make it identify that all instances in a particular

    operator module should be placed together.

    c 2009, Indian Institute of Technology Delhi

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    49/50

  • 8/10/2019 Vdtt.iitd.Ac.in Research Projects Thesis Jvl072175

    50/50

    REFERENCES 42

    [9] Hesham Al-Twaijry , Michael Flynn; Performance/Area Tradeoffs in

    Booth Multipliers; Technical Report CSL TR-95-684, pp. 1-18, Nov.

    1995.

    [10] C.S.Wallace; A Suggestion for a Fast Multiplier; IEEE Trans. ElectronicComputers, vol. 13, pp. 14-17, 1964.

    [11] J. M. Rabaey, A. Chandrakasan, and B. Nikolic; Digital Integrated

    Circuits, A Design Perspective, Second Edition; Prentice-Hall, 2003.

    [12] L.Shang, A.Kaviani, and K.Bhadala; Dynamic Power Consumption of

    the Virtex-2 FPGA family; ACM International Symposium on Field-

    Programmable Gate Arrays, pp. 157-164 , Monterey, CA, 2002.