Download - body 12-12-13

Design of Low Power ALU using Area Efficient Carry Select Adder

CHAPTER 1

OVERVIEW OF THE PROJECT

1.1 Introduction

Design of any Low power VLSI circuit with less area and high speed has

become a main concern for digital designers. Building low power VLSI systems has

emerged as highly in demand because of the fast growing technology in mobile

communications and computation. The battery technology does not advance at the

same rate as microelectronics technology. There is a limited amount of power

available for the mobile systems. So designers are faced with more constraints such as

high speed, high throughput, small silicon area, and at the same time, low power

consumption. So building low power, high performance adder cells are of great

interest [1]-[5].

In the past few decades ago, the electronics industry has been experiencing an

unprecedented spurt in growth, thanks to the use of integrated circuits in computing,

telecommunications and consumer electronics. We have come a long way from the

single transistor era in 1958 to the present day ULSI (Ultra Large Scale Integration)

systems with more than 50 million transistors in a single chip [6].

As the performance of processors has increased, the demand for high

speed arithmetic blocks has also increased. With clock frequencies approaching 1

GHz, arithmetic blocks must keep pace with the continued demand for more

computational power. The purpose of this thesis is to present methods of

implementing the area and power efficient carry select adder.

To reduce the power and area requirements of the computational complexities,

the size of transistors are shrunk into the deep sub-micron region [7] and

predominantly handled by process engineering.

There are several Adder designs have been proposed to reduce the power

consumption. Logic minimization not only results in better system throughput but also

results in low power consumption designs. For low power results it is always

Department of ECE, MRITS 1


advisable to use CMOS technology in which the power dissipation is a complex

function of the gate delays, clock frequency, process parameters, circuit topology and

structure, and the input vectors applied. Once the processing and structural parameters

have been fixed, the measure of power dissipation is dominated by the switching

activity (toggle count) of the circuit .The dynamic power is given by,

P=1/2 * Cload * (Vdd2/Tcycle) * E(switching),

Where Cload is the load capacitance of the gate, Tcycle is the clock cycle time,

E (switching) is the expected number of signal transitions per cycle and Vdd is the

supply voltage [8].

1.2 Objective

To design a high speed Arithmetic Logic Unit (ALU) by using the efficient

carry select adder.

Adder is the important block in ALU, speed of the ALU is limited by the

adder because it has to pass carry to more number of bits. In digital adders, for speed

up the operation Ripple Carry Adder (RCA) is modified as CSLA. To achieve more

speed CSLA is replaces by SQRT CSLA. The CSLA is used in many computational

systems to alleviate the problem of carry propagation delay by independently

generating multiple carries and then select a carry to generate the sum [9]-[10].

However, the CSLA is not area efficient because it uses multiple pairs of Ripple

Carry Adders (RCA) to generate partial sum and carry input Cin=0 and Cin=1, the

final sum and carry are selected by the multiplexers(mux) [11]-[15].

1.2.1 Existing SQRT Carry Select Adder

In general the complete SQRT CSLA is divided into different blocks. Block

size and the number of blocks depend upon the size of SQRT CSLA according to the

SQRT technique. From second block onwards, each block contains three different

levels, first level is ripple carry adder with input carry zero, second level is ripple

carry adder with input carry one and the third level is multiplexer which is used to

select one of the ripple carry adders output according to the previous block carry. The

disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs.



For achieving better area efficiency [13]-[15] Binary to Excess-1 Converter (BEC) is

replaced in the place of RCA with Cin=1 in the regular CSLA. To replace n bit RCA

an n+1 bit BEC is required.

Though BEC technique reduces area and power [16] but not up to

considerable amount and also the design is not suitable for sub threshold level

modifications.

The drawback with this logic structure is that it does not reduce the area and

power to a satisfactory level. There is still scope to reduce the delay. In order to

reduce the power and area a new logic structure for a BEC is proposed.

1.2.2 Proposed SQRT Carry Select Adder

The 16-bit SQRT CSLA using BEC in its second level requires 792

transistors. There is a scope to reduce the number of transistors along with the area

reduction and power dissipation reduction by using proposed logic. For the

implementation of a 16-bit SQRT CSLA, 736 transistors are required by using

proposed logic.

The proposed logic implementation for second level RCA is Special

Hardware using Multiplexers (SHM). In this the inputs are applied to first level RCA.

And the output of RCA is applied to second level SHM and then to third level

multiplexer. Third level multiplexer selects either RCA output or SHM output

according to the previous carry.

By using the proposed logic 8-bit Arithmetic Logic Unit (ALU) which

performs arithmetic operations such as addition, subtraction, increment and decrement

and logical operations such as AND, OR, XOR and XNOR is designed.

1.3 Tools usedSOFTWARE:

Logic Editor: DSCH2.6c

Layout Editor: Micro wind 2.6a.



The performance of the proposed design is analyzed. The simulations are

performed with 120nm(0.12um) using simulation tool Microwind2, power supply of

1.2V and nominal temperature of 27°C to extract the critical path delay and power

consumption.

1.4 Thesis outline

The next chapter describes literature survey such as different types of adders,

different types low power design techniques in the design of low power ALU and

different logic styles are analyzed.

Existing design such as 8- bit ALU using ripple carry adders are designed in

chapter 3 along with the implementation of SQRT CSLA using BEC technique.

Chapter 4 describes implementation of proposed SQRT CSLA and proposed

ALU using efficient carry select adder.

Comparative analysis and results are shown in the chapter 5.

Conclusion and future scope are discussed in chapter 6.



CHAPTER 2

LITERATURE SURVEY2.1 Introduction

In nearly all digital IC designs today, the addition operation is one of the most

essential and frequent operations. Instruction sets for DSP’s and general purpose

processors include at least one type of addition. Other instructions such as subtraction

and multiplication employ addition in their operations, and their underlying

hardware is similar if not identical to addition hardware. Often, an adder or multiple

adders will be in the critical path of the design, hence the performance of a design will

be often be limited by the performance of its adders. When looking at other attributes

of a chip, such as area or power, the designer will find that the hardware for addition

will be a large contributor to these areas. It is therefore beneficial to choose the

correct adder to implement in a design because of the many factors it aspects in the

overall chip. In this chapter we begin with the basic building blocks used for addition,

then go through different algorithms and name their advantages and disadvantages.

2.2 Basic Adder Blocks

2.2.1 Half Adder

The half adder is an example of a simple, functional digital circuit built from

two logic gates. The half adder adds to one-bit binary numbers (AB). The output is

the sum of the two bits (S) and the carry (C). Note how the same two inputs are

directed to two different gates. The inputs to the XOR gate are also the inputs to the

AND gate. The input "wires" to the XOR gate are tied to the input wires of the AND

gate; thus, when voltage is applied to the A input of the XOR gate, the A input to the

AND gate receives the same voltage.

2.1

2.2



Fig.2.1 Half adder

2.2.2 Full Adder

In electronics, an adder is a digital circuit that performs addition of numbers.

Full adders are fundamental units in various circuits, especially in circuits used for

performing arithmetic operations such as compressors, comparators, parity

checkers, and arithmetic logic units and so on. The full adder takes into account a

carry input such that multiple adders can be used to add larger numbers. To remove

ambiguity between the input and output carry lines, the carry in is labeled Cin while

the carry out is labeled Cout. The full-adder circuit adds three one-bit binary

numbers (Cin, A, B) and outputs two one-bit binary numbers, a sum (S) and a carry

(Cout). The full-adder is usually a component in a cascade of adders, which add 8,

16, 32, etc. binary numbers. The carry input for the full-adder circuit is from the

carry output from the circuit "above" itself in the cascade. The carry output from

the full adder is fed to another full adder "below" itself in the cascade. Hence, a full

adder is a digital circuit that performs an addition operation on three binary digits.

The full adder produces a sum and carries value, which are both binary digits. It can

be combined with other full adders or work on its own.

Fig.2.2 Schematic Symbol of 1-bit full-adder cell


A B

CO CIN

1-bit Full Adder

S


The final OR gate before the carry-out output may be replaced by an XOR gate

without altering the resulting logic. This is because the only discrepancy between OR

and XOR gates occurs when both inputs are 1; for the adder shown here, one can

check this is never possible. Using only two types of gates is convenient if one desires

to implement the adder directly using common IC chips.

A full adder can be constructed from two half adders by connecting A and B

to the input of one half adder, connecting the sum from that to an input to the second

adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S

could be made the three-bit XOR of A, B, and Ci and Co could be made the three-bit

majority function of A, B, and Ci. The output of the full adder is the two-bit arithmetic

sum of three one-bit numbers.

Figure 2.3 Circuit diagram of 1-bit full-adder cell

2.3

2.4

2.2.3 Partial Full Adder

The Partial Full Adder (PFA) is a structure that implements intermediate

signals that can be used in the calculation of the carry bit. It is an extension of FA



which include the signals generate (g), kill (k), and propagate (p).When g=1, it means

carryout will be 1 (generated) regardless of carry-in. When k=1, it means carryout

will be 0 (killed) regardless of carry-in. When p=1, it means carryout will equal

carry-in (carry-in will be propagated). Table 2.1 reflects these three additional signals,

with a comment on the carryout bit in an additional column. Equations 2.5 − 2.7 are

the Boolean equations for generate, kill, and propagate, respectively. It should be

noted that for the propagate signal, the XOR function can also be used, since in the

case of a, b=1, the generate signal will assert that carryout is 1. The Boolean equations

for the sum and carryout can now be written as functions of g, p, or k shown by

Equations 2.8 and 2.9. Figure 2.4 shows a circuit for creating the Generate, Propagate,

and Sum signals. It is a partial full adder because it does not calculate the carryout

signal directly; rather, it creates the signals needed to calculate the carryout signal.

Generatei (gi) = ai . bi 2.5

Killi (ki) = ai . bi 2.6

Propagatei (pi) = ai bi 2.7

Sumi = Pi Cini 2.8

Carry-outi+1 = ai . bi + bi . carry-ini +ai .carry-ini 2.9

Figure 2.4 Generation of GENERATE, PROPAGATE and SUM



Inputs OutputsCarry-in a B Carry-out Sum G K p Carry-status

0 0 0 0 0 0 1 0 delete0 0 1 0 1 0 0 1 propagate0 1 0 0 1 0 0 1 propagate0 1 1 1 0 1 0 1 generate/propagate1 0 0 0 1 0 1 0 delete1 0 1 1 0 0 0 1 propagate1 1 0 1 0 0 0 1 propagate1 1 1 1 1 1 0 1 generate/propagate

Table 2.1 Truth table of partial full adder

2.3 Adder Algorithms

2.3.1 Ripple Carry Adder

The Ripple Carry Adder (RCA) is one of the simplest adders to implement.

This adder takes in two N-bit inputs (where N is a positive integer) and produces (N +

1) output bits (an N-bit sum and a 1-bit carryout). The RCA is built from N full adders

cascaded together, with the carryout bit of one FA tied to the carry-in bit of the next

FA. Figure 2.5 shows the schematic for an N-bit RCA. The input operands are labeled

‘a’ and ‘b’ the carryout of each FA is labeled Cout (which is equivalent to the carry-in

(c-in) of the subsequent FA), and the sum bits are labeled sum. Each sum bit requires

both input operands and Cin before it can be calculated. To estimate the propagation

delay of this adder, we should look at the worst case delay over every possible

combination of inputs. This is also known as the critical path. The most significant

sum bit can only be calculated when the carryout of the previous FA is known. In the

worst case (when all the carry-out’s are 1), this carry bit needs to ripple across the

structure from the least significant position to the most significant position. Figure 2.6

has a darkened line indicating the critical path.



Hence, the time for this implementation of the adder is expressed in Equation

2.10, where tRCAcarry is the delay for the carryout of a FA and t RCAsum is the delay for

the sum of a FA.

Propagation Delay (tRCAgroup) = (N-1) . tRCAcarry + tRCAsum 2.10

From Equation 2.10, we can see that the delay is proportional to the length of

the adder. An example of a worst case propagation delay input pattern for a 4 bit

ripple carry adder is where the input operands change from 1111 and 0000 to 1111

and 0001, resulting in a sum changing from 01111 to 10000.

From a VLSI design perspective, this is the easiest adder to implement. One

just needs to design and layout one FA cell, and then array N of these cells to create

an N-bit RCA. The performance of the one FA cell will largely determine the speed of

the whole RCA. From the critical path in Equation 2.10, minimizing the carryout

delay (tRCAcarry) of the FA will minimize t RCAprop. Various implementations of the

FA cell to minimize the carryout delay .

Figure 2.5 Schematic for an N-bit Ripple Carry Adder

Figure 2.6 Critical paths for an N-bit Ripple Carry Adder



2.3.2 Carry Skip Adder

From examination of the RCA, the limiting factor for speed in that adder is the

propagation of the Cout bit. The Carry Skip Adder (CSKA, also known as the Carry

Bypass Adder) addresses this issue by looking at groups of bits and determines

whether this group has a carryout or not. This is accomplished by creating a group

propagate signal (PCSKAgroup) to determine whether the group carry-in (carry-in CSKAgroup)

will propagate across the group to the carryout (carry-out CSKAgroup). To explore the

operation of the whole CSKA, take an N-bit adder and divide it into N/M groups,

where M is the number of bits per group. Each group contains a 2-to-1 multiplexer,

logic to calculate M sum bits, and logic to calculate PCSKAgroup. The select line for the

mux is simply the PCSKAgroup signal, and it chooses between carry-inCSKAgroup or cout 4.

To aid the explanation, we refer the reader to Figure 2.7, which shows

the hardware for a group of 4 bits (M=4) in the CSKA. There are four full adders

cascaded together and each FA creates a carryout (cout), a propagate (p) signal, and a

sum (sum not shown). The propagate signal from each FA comes at no extra hardware

cost since it is calculated in the sum logic (the hardware is identical to the sum

hardware for the PFA shown in Figure 2.4). For the carry-outCSKAgroup to equal carry-in

CSKAgroup, all of the individual propagates must be asserted (Equations 2.11 and 2.12). If

this is true then carry-inCSKAgroup skips" past the group of full adders and equals the

carryout CSKAgroup. For the case where PCSKAgroup is 0, at least one of the propagate

signals is 0. This implies that either a delete and/or generate occurred in the group. A

delete signal simply means that the carryout for the group is 0 regardless of the carry-

in, and a generate signal means that the carryout is 1 regardless of the carry-in. This is

advantageous because it implies that the carry-out for the group is not dependent

on the carry-in. No hardware is needed to implement these two signals because the

group carryout signal will reflect one of the three cases (a d, g or group p occurred).

The additional hardware to realize the group carryout in Figure 2.7 is accomplished

with a 4-input AND gate and a 2-to-1 multiplexer (mux). In general, an M-input AND

gate and a 2-to-1 mux are required for a group of bits, including the logic to calculate

the sum bits.



PCSKAgroup = P0 . P1 . P2 . P3 2.11

Carry-outCSKAgroup = Carry-inCSKAgroup . PCSKAgroup 2.12

In examining the critical path for the CSKA, we are primarily concerned

whether the carry-in can be propagated (“skipped") across a group or not. Assuming

all input bits come into the adder at the same time, each group can calculate the group

propagate signal (mux select line) simultaneously. Every mux then knows which

signal to pass as the carryout of the group. There are two cases to consider after the

mux select line has determined. In the first case, carry-in CSKA group will propagate

to the carryout. This means PCSKAgroup=1 and the carryout is dependent on the carry-in.

In the second case, the carryout signal of the most significant adder will become

the group carryout. This means PCSKA group =0 and the carryout is independent of the

carry-in. If we isolate the particular group (as in Figure 2.7), the second case (signal

cout4) always takes longer because the carryout signal must be calculated through

logic, whereas the first case (carry-inCSKAgroup) requires only a wire to propagate the

signal. Looking at the whole architecture, however, this second case is part of the

critical path for only the first CSKA group. Since the second case is not dependent on

the group carry-in, all the groups in the CSKA can compute the carryout in parallel. If

a group needs its carry-in (PCSKAgroup=1), then it must wait until it arrives after being

calculated from a previous group. In the worst case, a carryout must be calculated in

the first group, and every group afterwards needs to propagate this carryout. When the

final group receives this propagated signal, then it can calculate its sum bits. Figure

2.8 shows a 16-bit CSKA with 4-bit groups and Figure 2.9 shows a darkened line

indicating the critical path of the signals in the 16-bit CSKA.

If we assume a 16-bit CSKA with 4-bit groups, with each group containing a

4-bit RCA for the sum logic, then the worst case propagation delay through this adder

is expressed in equation 2.13. In this equation, tRCAcarry and tRCAsum are the

delays to calculate the carryout and sum signals of an RCA, respectively. Each group

has 4 bits, so the delay through the first group has 4 RCA carryout delays. This

carryout of the first group potentially propagates through 3 muxes, where one mux

delay is expressed as t muxdelay. Finally, when the carryout signal reaches the final



group, the sum for this group can be calculated. This is represented by the final two

components of Equation 2.4.

Figure 2.7 One group in a Carry Skip Adder, in this case M=4

Figure 2.8 A 16-bit Carry Skip Adder N=16, M=4

Figure 2.9 Critical path through 16-bit CSKA



tCSKA16= 4 * tRCAcarry + 3 * tmuxdelay + 3 * tRCAcarry + tRCAsum 2.13

For Equation 2.13, there are some assumptions about the delay through the

circuit. First, we assume in the first CSKA group that the group propagates signal is

calculated before the carryout of the most significant adder. Thus, the mux for this

first group is waiting for the carryout. For the final CSKA group, we assume that it

takes longer for sum15 to be calculated than for sum16 to be calculated. Once the

carry-in for this last group is known, the delay for sum16 is the delay of the mux; for

sum 15 it is a delay of 3*tRCAcarry + t RCAsum (3 ripples through the adder before

the last sum bit can be calculated).

For an N-bit CSKA, the critical path equation is expressed in Equation 2.5. M

represents the number of bits in each group. There are N/M groups in the adder, and

every mux in this group except for the last one is in the critical path. As in Equation

2.13, Equation 2.14 assumes that each group contains a ripple carry adder.

tCSKAN = M * tRCAcarry +( (N/M)-1)tmuxdelay + (M-1) * tRCAcarry + tRCAsum 2.14

From a VLSI design perspective, this adder shows improved speedup over a

RCA without much area increase. The additional hardware comes from the 2-to-1

mux and group propagates logic in each group, which is about 15% more area. One

drawback to this structure is that its delay is still linearly dependent on the width of

the adder, therefore for large adders where speed is important, the delay may be

unacceptable. Also, there is a long wire in between the groups that carryout

CSKAgroup needs to travel on. This path begins at the carryout of the first CSKA

group and ends at the carry-in to the final CSKA group. This signal also needs to

travel through ((N/M)-1)) muxes, and these will introduce long delays and signal

degradation if pass gate muxes are used. If buffers are required in between these

groups to reproduce the signal, then the critical path is lengthened. An example

of a worst case delay input pattern for a 16-bit CSKA with 4-bit groups is where the

input operands are 1111111111111000 and 0000000000001000. This forces a

carryout in the first group that skips through the middle two groups and enters the

final group. This carry-in to the final group ripples through to the final sum bit



(sum15). To determine the optimal speed for this adder, one needs to find the delay

through a mux and the carryout delay of a FA. It is one of these two delays that will

dominate the delay of the whole CSKA. For short adders (≤ 16 bits), the t carryout of

a FA will probably dominate delay, and for long adders the long wire that skips

through stages and muxes will probably dominate the delay.

2.3.3 Carry Look Ahead Adder

From the critical path equations in Sections 2.2.1 and 2.2.2, the delay is

linearly dependent on N, the length of the adder. It is also shown in Equations 2.10

and 2.14 that the tcarryout signal contributes largely to the delay. An algorithm that

reduces the time to calculate tcarryout and the linear dependency on N can greatly speed

up the addition operation. Equation 2.9 shows that the carryout can be calculated with

g, p, and carry-in. The signals g and p are not dependent on carry-in, and can be

calculated as soon as the two input operands arrive. Weinberger and Smith invented

the Carry Look Ahead (CLA) Adder [19]. Using Equation 2.9, we can write the

carryout equations for a 4-bit adder. These equations are shown in Equations

2.15−2.18, where Ci represents the carryout of the ith position (0 ≤ i ≤ (N − 1)), and gi

with just the input operands and initial carry-in (c3). This process of calculating ci by

using only the pi, gi and c0 signals can be done indefinitely, however, each

subsequent carryout. Generated in this manner becomes increasingly difficult because

of the large number of high fan-in gates [20].

C1 = g0 + p0 .c0 2.15

C2 = g1 + p1 .c1 = g1 + p1 . g0 + p1 . p0 . c0 2.16

C3 = g2 + p2 .c2 = p2 . g1 +p2 . p1 . g0 +p2 . p1 . p0 . c0 2.17

C4 = g3 + p3 .c3

=g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . g0 +p3 . p2 . p1 . p0 . c0 2.18

The CLA adder uses partial full adders as described in Section 2.1.3 to

calculate the Generate and propagate signals needed for the carryout equations. Figure

2.10 shows the schematic for a 4-bit CLA Adder. The CLA logic block implements

the logic in Equations 2.15−2.18, and the gate schematic for this block is in Figure



2.11. For a 4-bit CLA adder the 4th carryout signal can also be considered as the 5th

sum bit.

Although it is impractical to have a single level of carry look-ahead logic for

long adders, this can be solved by adding another level of carry look-ahead logic. To

achieve this, each adder block requires two additional signals: groups generate and a

group propagates. The equations for these two signals, assuming adder block sizes of

4 bits, are shown in Equations 2.19 and 2.20. A group generate occurs if a carry is

generated in one of adder blocks, and a group propagate occurs if the carry-in to the

adder block will be propagated to the carryout. Figure 2.11 shows the gate schematic

of the two additional signals.

Group Generate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3 2.19

Group Propagate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3 2.20

2.19 2.20 with multiple levels of CLA logic, carry look-ahead adders of any

length can be built. To illustrate the use of another level of CLA logic, Figure 2.8

shows the schematic for a 16-bit CLA Adder. There is a second level of CLA logic

which takes the group generate and group propagate signals from each 4-bit

adder sub cell and calculates the carryout signals for each adder block. If an adder

has multiple levels of CLA logic, only the final level needs to generate the

Figure 2.10 4-bit carry look-ahead adder



c4 signal. All other levels replace this c4 signal with the group generate and group

propagate. The CLA logic for this 16-bit adder is identical to the CLA logic for the 4-

bit adder in Figure 2.11; therefore the equations for the carryout signals are in

Equations 2.15−2.18.

Figure 2.11 Schematic for a 16-bit CLA adder

A third level of CLA logic and four 16-bit adder blocks can be used to build a

64-bit adder. The CLA logic would create the c16, c32, and c48 signals to be used as

carry-ins to the 16-bit adder blocks and the c64 as the sum 64 signal. If a design calls

for an adder of length 32, a designer can simply use two 16-bit adder blocks and the

first two carryout signals (c16, c 32) from the third level of CLA logic. The identical

hardware in the CLA logic, coupled with the fact that the adder blocks can be

instantiated as sub cells, makes building long adders with this architecture simple.

Determining the critical path for a CLA adder is difficult because the gates in

the carry path have different fan-in. To get a general idea, we first assume that all gate

delays are the same. The delay for a 4-bit CLA adder then requires one gate delay to

calculate the propagate and generate signals, two gate delays to calculate carry

signals, and one gate delay to calculate the sum signals; this equates to four gate

delays. For a 16-bit CLA adder there is one gate delay to calculate the propagate and

generate signal (from the PFA), two gate delays to calculate the group propagate and

generate in the first level of carry logic, two gate delays for the carryout signals in the

second level of carry logic, and one gate delay for the sum signals. The second level



of carry logic for the 16-bit CLA adder contributes an additional two gate delays over

the 4-bit CLA adder, thus increasing the total to six gate delays. Continuing in this

manner (a 64-bit add takes eight gate delays, a 256- bit add takes ten gate delays), we

see that the delay for a CLA adder is dependent on the number of levels of carry logic,

and not on the length of the adder. If a group size of four is chosen, then the number

of levels in an N-bit CLA is expressed in Equation 2.21 and in general the number of

levels in a CLA for a group size of k is expressed in Equation 2.22. For an N-bit CLA

adder, each level of carry logic introduces two gate delays in addition to a gate delay

for the generate and propagate signals and a gate delay for the sum. The total gate

delay is expressed in Equation 2.23, which shows that the delay of a CLA adder is

logarithmically dependent on the size of the adder. This theoretically results in one of

the fastest adder architectures.

CLA levels (with group size of 4) = [ log 4 N] 2.21

CLA levels (with group size of k) = [ log k N] 2.22

CLA gate delay = 2 + 2 . [ log k N] 2.23

From a VLSI design perspective, this adder may take more time to implement,

but there still exists regularity with the architecture that allows building long adders

fairly easily. The reuse of the CLA logic definitely contributes to the feasibility of

building a long adder without additional design time. Also, after an adder is built, it

can be used as a subcell, as is done with the 4-bit adders as blocks in the 16-bit CLA

adder. A drawback to CLA adders are their larger areas. There is a large amount of

hardware dedicated to calculating the carry bits from cell to cell. However, if the

application calls for high performance, then the benefits of decreased delay can

outweigh the larger area.

2.3.4 Carry Select Adder

Adding two numbers by using redundancy can speed addition even further.

That is, for any number of sum bits we can perform two additions, one assuming the

carry-in is 1 and one assuming the carry-in is 0, and then choose between the two

results once the actual carry-in is known. This scheme, proposed by Sklanski in 1960,



is called conditional-sum addition [21]. An implementation of this scheme was first

realized by Bedrij and is called the Carry Select Adder (CSLA) [22].

The CSLA divides the adder into blocks that have the same input operands

except for the carryin. Figure 2.12 shows a possible implementation for a 16-bit

CSLA using ripple carry adder blocks. The carryout of the first block is used as the

select line for the 9-bit 2-to-1 mux. The second and third blocks calculate the signals

sum 16 − sum 8 in parallel, with one block having its carryin hardwired to 0 and

another hardwired to 1. After one 8-bit ripple adder delay there is only the delay of

the mux to choose between the results of block 2 or 3. Equation 2.24 shows the delay

for this adder. The 16-bit CSLA can also be built by dividing it into even more blocks.

Figure 2.13 shows the block diagram for the adder if it were divided into 4-bit RCA

blocks. Equation 2.25 expresses the delay for this structure.

tCSLA16a =t8bitRCA + t9bitmux 2.24

tCSLA16b =t4bitRCA + 3 . t5bitmux 2.25

The CSLA described so far is called the Linear Carry Select Adder, because its delay

is linearly dependent on the length of the adder. In the worst case, the carry signal

Figure 2.12 Schematic for a 16-bit CSLA with 8-bit RCA blocks



must ripple through each mux in the adder. Also, notice that the sub cells are done

with their addition at the same time, yet the more significant bits are waiting at the

input of the mux to be selected. From a VLSI design perspective, the CSLA uses a

large amount of area compared to the other adders. There is hardware in this

architecture which computes results that are thrown away on every addition, but the

Figure 2.13 Schematic for a 16-bit CSLA with 4-bit RCA blocks

Fact that the delay for an addition can be replaced by the delay of a mux makes this

architecture very fast. Also, the Linear CSLA has regularity that makes it easier to

layout.

2.3.5 SQRT Carry Select Adder

To increase SQRT technique is developed. In this design the number of bits

per block is not depend upon the total number of bits corresponding logical equation

is shown in 2.26. Using that technique for 16-bit SQRT CSLA the bits per block is as

follows 2-2-2-3-4-5. For 8-bit the sequence is 1-3-4. The 16- bit SQRT CSLA is

shown in figure 2.14.

tadd= tsetup+ (m X tcarry)+sqrt (2n) X tmux + tsum 2.26



Figure 2.14 Schematic for a 16-bit SQRT CSLA

2.4 Low power design techniques

Designing systems aiming for low power is not a straightforward task, as it is

involved in all the IC design stages beginning with the system behavioral description

and ending with the fabrication and packaging processes. In some of these stages

there are guidelines that are clear and there are steps to follow that reduce power

consumption, such as decreasing the power-supply voltage. While in other stages

there are no clear steps to follow, so statistical or probabilistic heuristic methods are

used to estimate the power consumption of a given design.

There are three major components of power dissipation in complementary

metal–oxide–semiconductor (CMOS) circuits.

1) Switching Power: Power consumed by the circuit node capacitances during

transistor switching.

2) Short Circuit Power: Power consumed because of the current flowing from power

supply to ground during transistor switching.

3) Static Power: Due to leakage and static currents.

4) Dynamic Power: As given in equation 2.1

The first two components are referred to as dynamic power. Dynamic power

constitutes the majority of the power dissipated in CMOS VLSI circuits. It is the

power dissipated during charging or discharging the load capacitances of a given



circuit. It depends on the input pattern that will either cause the transistors to switch

(consume dynamic power) or not to switch (no dynamic power consumed) at every

clock cycle.

The summation is over all the nodes of the circuit. Reducing any of these

components will end up with lower-power consumption, although, it is of equal

importance to increase the system-clock frequency for faster operation. Estimating the

power of a large circuit is a complex task. Heuristic algorithms, statistical, and

probabilistic methods are used to generate random-input patterns to test the switching

activity of the circuit. These methods become less accurate when the size of the

circuit increases. It is better to decompose the large circuit into smaller modules and

then use these methods to estimate the power consumption of each module. When the

decomposed modules are small enough, exact methods can be used to optimize their

performance.

2.4.1 Transistor sizing optimization

The transistor sizing for optimal performance is technology dependent. As the

demand for high speed, low power consumption and high packing density continues

to grow each year, there is need to scale the device to smaller dimensions. As the

market trend moves towards greater scale of integration, the move towards a reduced

supply voltage also has the advantage of improving the reliability of IC components

of ever-reducing dimensions. This change can be easily understood if one recalls that

IC component with smaller dimensions have more of a tendency to breakdown at high

voltages. It has already been accepted that scaled-down CMOS devices even at 2.5V

do not sacrifice device performance as they maintain device reliability.

Scaling brings about the following benefits:

Improved device characteristics for low voltage operation due to the

improvement in the current driving capabilities, reduced capacitance through small

geometries and junction capacitances, improved interconnect technology, higher

density of integration.



The major device problem associated with simple scaling lies in the increase of

the threshold voltage and the decrease of the carrier surface mobility, when the

substrate doping concentration is increased to prevent punch-through.

2.4.2 Low-power clock distribution

The clock network constitutes one of the most important parts of a

synchronous very large scale integration (VLSI) chip as it can significantly influence

the speed, area, and power dissipation of the system. Recent research on clock

network construction has developed procedures for building a zero or near-zero skew

clock networks with sharp clock edge rates at the clock utilization points. However,

one major drawback associated with clock networks is their power dissipation.

Studies have shown that the clock network can dissipate 20–50% of the total power

on a chip. In the context of the growing importance of low-power designs for portable

electronics, it is necessary to develop strategies to significantly reduce the power

dissipation of the clock network, since this will lead to a major reduction in the

overall power dissipation of the chip. Using a lower to distribute the signal over the

chip, the clock network can be made to dissipate less power. However, for reasons

related to performance requirements, the rest of the circuitry on the chip may use a

higher Vdd and this implies that the clock levels would have to be converted to this

higher value at the utilization points.

2.4.3 Low power design through voltage scaling

The equation (2.1) shows that the avg. switching power dissipation is

proportional to the square of the power supply voltage; hence, reduction of VDD will

significantly reduce the power consumption.

If the power supply voltage is scaled down while all other parameters are

kept constant, the propagation delay time would increase. The dependence of circuit

speed on the power supply voltage and the above equation. Suggest that a quadratic

improvement or reduction of power consumption is possible as the power supply

voltage is reduced. If the circuit is always operated at maximum frequency allowed by

its propagation delay, the operating frequency or the no. of switching events per unit



time will drop as the propagation delay becomes larger with the reduction of power

supply voltage. The net result is that the dependence of switching power dissipation

on the power supply voltage becomes stronger than a simple quadratic equation. The

propagation delay expressions show that the negative effect of reducing the power

supply voltage upon delay can be compensated for, if the threshold voltage of the

transistors (VT) is scaled down accordingly. However, this approach is limited

because the threshold voltage may not be scaled to the same extent as the supply

voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce

the same speed performance at a lower VDD.

2.4.4 Reduction of switching activity

Switching activity can be reduced by algorithmic optimization, proper choice

of logic topology, glitch reduction, and gated clock signals.

Algorithmic optimization

This depends heavily on the application and the characteristics of data such

as dynamic range, correlation, and statistics of data transmission. The representation

of data can have a significant impact on switching activity at the system level. In

applications where data bits change sequentially and are highly correlated, the use of

Gray Coding leads to a reduced number of transitions compared to binary coding.

Another example is the use of sign-magnitude representation instead of conventional

two’s complement representation for signed data. A change in sign will cause

transitions of the higher order bits in the two’s complement representation, whereas

only the sign bit will change in sign-magnitude representation. Hence, switching

activity can be reduced by using the sign-magnitude representation in applications

where the data sign changes are frequent.

Glitch reduction

An important architecture level measure to reduce switching activity is

based on delay balancing and reduction of glitches. In multi-level logic circuits, the

propagation delay from one logic block to the next can cause spurious signal

transitions ,or glitches .Glitches occur primarily due to a mismatch or imbalance in

the path lengths in the logic network. Such a mismatch in path lengths results in a



mismatch of signal timing with respect to the primary inputs. Redesigning the logic

network in order to balance the delay paths can significantly reduce glitches, and

consequently, the dynamic power dissipation in complex multi-level networks.

Gated Clock Signals

Another effective design technique for reducing the switching activity in

CMOS logic circuits is the use of conditional or gated clock signals. If certain logic

blocks in a system are not immediately used during the current clock cycle,

temporarily disabling the clock signals of these blocks will obviously save switching

power that is otherwise wasted. An N-bit number comparator compares the

magnitudes of two unsigned N-bit binary numbers and produces an output to indicate

which one is larger. In the conventional approach, all input bits are first latched into

two N-bit registers, and subsequently applied to the comparator circuit .In this case,

two N-bit register arrays dissipate power in every cycle. Yet, if only the most

significant bits of the two binary numbers are different from each other, then the

decision can be made by comparing the MSBs only. The two MSBs are latched in a

two-bit register which is driven by the original system clock. At the same time, these

two bits are applied to an XNOR gate and its output is used to generate the gated

clock signal with an AND gate. If the two MSBs are different, the XNOR produces

logic 0 at the output, disabling the clock signal of the lower order registers. If the two

MSBs are same, the gated clock signal is applied to the lower-order registers and the

decision is made by the (N-1) bit comparator. The gated clock strategy effectively

reduces the overall switching power dissipation of the system by about 50%, since a

large portion of the system is disabled for half of all input combinations.

2.4.5 Reduction of switching capacitance

The amount of switched capacitance plays a significant role in the dynamic

power dissipation of the circuit. Hence, reduction of this parasitic capacitance is a

major goal for low-power design of digital integrated circuits.

System-Level Measures

At the system level, one approach to reduce the switched capacitance is to

limit the use of shared resources. If a single shared bus is connected to all modules,



for example, a large bus capacitance comes into play due to-the large number of

drivers and receivers sharing the same transmission medium, and the parasitic

capacitance of the long bus line. Obviously, driving the large capacitance will require

a significant amount of power consumption during each bus access. Alternatively, the

global bus structure can be partitioned into a number of smaller dedicated local buses

to handle the data transmission between the neighboring modules. As a result, the

switched capacitance during each bus access is significantly reduced, although

multiple buses may increase the overall routing area on the chip.

Circuit-Level Measures

The type of logic style used to implement a digital circuit also affects the

output load capacitance of the circuit. The capacitance of a function of the number of

transistors that are required to implement a given function. Pass-gate logic design is

attractive since fewer transistors are required for certain functions such as XOR and

XNOR. Pass-transistor structures typically require complementary control signals;

dual-rail logic is used to provide all signals in complementary form. This diminishes

the inherent advantages of pass-transistor logic gates over conventional CMOS logic.

Thus, the use of pass-transistor logic gates to achieve low-power dissipation must be

carefully considered, and the choice of logic design style must ultimately be based on

a detailed comparison of all design aspects such as silicon area, overall delay as well

as switching power dissipation.

Mask-Level Measures

The amount of parasitic capacitance that is switched (i.e., charged up or

charged down) during operation can also be reduced at the physical design level, or

mask level. A simple mask-level measure to reduce power dissipation is keeping the

transistors at minimum dimensions whenever possible and feasible, thereby

minimizing the parasitic capacitances. Designing a logic gate with minimum-size

transistors certainly affects the dynamic performance of the circuit, and this trade-off

between dynamic performance and power dissipation should be carefully considered

in critical circuits.



2.4 Different logic stylesSeveral variants of static CMOS logic styles have been used to implement low-

power 1-bit adder cells. Several logic styles have been used to design full adder cells.

Each design style has its own merits and demerits.

In general, they can be broadly divided into two major categories:

1) Static logic style and

2) Dynamic logic style

A major distinction, also with respect to power dissipation, must be made

between static and dynamic logic styles. As opposed to static gates, dynamic gates are

clocked and work in two phases, a precharge and an evaluation phase. The logic

function is realized in a single NMOS pull-down or PMOS pull-up network, resulting

in small input capacitances and fast evaluation times. This makes dynamic logic

attractive for high speed applications. However, the large clock loads and the high

signal transition activities due to the precharging mechanism result in excessive high

power dissipation. Also, the usage of dynamic gates is not as straightforward and

universal as it is for static gates, and robustness is considerably degraded. With the

exception of some very special circuit applications, dynamic logic is no viable

candidate for low-power circuit design.

Although they all perform the same function, their styles of generating the

intermediate nodes and the outputs are different, the loads on the inputs and

intermediate nodes are different, and the transistor count varies significantly.

There are standard implementations for the full adder cell that are implemented. They

are the following:

1) Double pass transistor logic uses both N and P channel transistors, with dual

logic paths for every function. It uses 28 transistors.

2) The complementary pass-transistor logic (SR-CPL) full adder, it has 26

transistors and uses the CPL logic family.

3) Multiplexer based low power full adder which makes use of 34 transistors, it

makes use of only multiplexer operation.



All these adder cells are compared based on power consumption, speed, power delay

product, area, and driving capability.

Classical designs of full adders normally use only one logic style for the whole

full-adder design. While other hybrid designs exploit the features of different logic

styles to improve upon the performance of the designs using single logic style. All

hybrid designs use the best available modules implemented using different logic

styles or enhance the available modules in an attempt to build a low power full-adder

cell. Generally, the main focus in such attempts is to reduce the numbers of transistors

in the adder cell and, consequently, reduce the number of power dissipating nodes. In

doing so, the designers often trade off other vital requirements such as driving

capability, noise immunity, and layout complexity. Most of these adders lack driving

capabilities as the inputs are coupled to the outputs. Their performance as a single unit

or in small chains is good but when large adders are built by cascading these 1-b full-

adder cells, the performance degrades drastically. The performance degradation can

be handled by inserting buffers in between stages to enhance the delay characteristics.

However, this leads to an extra overhead and the initial advantage of having a lesser

number of transistors is lost.



CHAPTER 3

DESIGN OF ALU AND SQRT CSLA

3.1 Introduction to ALU and SQRT CSLA

The arithmetic logic unit (ALU) is one of the main components inside a

microprocessor. It is responsible for performing arithmetic and logic operations such

as addition, subtraction, increment, and decrement, logical AND, logical OR, logical

XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and

logical operations. Generally the performance of ALU is degraded by adder because

of carry propagation. To reduced carry propagation delay so many adders are

proposed.

In digital adders, for speed up the operation Ripple Carry Adder (RCA) is

modified as CSLA. To achieve more speed CSLA is replaced by SQRT CSLA. The

CSLA is used in many computational systems to alleviate the problem of carry

propagation delay by independently generating multiple carries and then select a carry

to generate the sum [8]-[9]. However, the CSLA is not area efficient because it uses

multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry input

Cin=0 and Cin=1, the final sum and carry are selected by the multiplexers (mux). For

achieving better area efficiency [10]-[14] Binary to Excess-1 Converter (BEC) is

replaced in the place of RCA with Cin=1 in the regular CSLA.

The total 16-bit SQRT CSLA is divided into different blocks. Block size and

the number of blocks depend upon size of SQRT CSLA according to the SQRT

technique. From second block onwards, each block contains three different levels,

first level is ripple carry adder with input carry zero, second level is ripple carry

adder with input carry one and the third level is multiplexer which is used to select

one of the ripple carry adders output according to the previous block carry. The

disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs.

To reduce the area BEC is replaced in place of second level RCA. In place of 2-bit

RCA, 3- bit BEC is used.



3.1.1 Delay and Area evaluation methodology of the basic adder blocks

The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in fig 3.1 we add up the number of gates in the longest path of area evaluation approach, the CSLA adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated and listed in Table 3.1.

Table 3.1 Delay and area for basic gates

Figure 3.1 AOI implementation of XOR gate

3.1.2 Binary to Excess one Converter (BEC)

As stated above the main idea of this work is to use BEC instead of the RCA with cin =1 in order to reduce the area and power consumption of the regular CSLA. To replace the n-bit RCA, an



(n+1)-bit BEC is required. A structure and the function table of a 4-b BEC are shown in Fig.3.1.2 and Table 3.1.2, respectively.

Fig. 3.2 illustrates how the basic function of the CSLA is obtained by using the 4-bit BEC together with the mux. One input of the 8:4 mux

gets as it input (B3, B2, B1, and B0) and another input of the mux is the BEC output. This produces the two possible partial results in parallel and the mux is used to select either the BEC output or the direct inputs according to the control signal Cin. The importance of the BEC logic stems from the large silicon area reduction when the CSLA with large number of bits are designed. The Boolean expressions of the 4-bit BEC is listed as (note the functional symbols NOT, &AND, XOR)

Fig 3.2 A 4- bit BEC



Fig 3.3 Functional block of CSLA

Figure 3.4 Block diagram for a 16-bit SQRT CSLA

3.1.3 Delay and area evaluation methodology of regular 16-bit SQRT CSLA

The structure of the 16-b regular SQRT CSLA is shown in Fig. 3.4. It has five

groups of different size RCA. The delay and area evaluation of each group in which

the numerals within [] specify the delay values, e.g., sum2 requires 10 gate delays.

The steps leading to the evaluation are as follows.

1) The group2 has two sets of 2-b RCA. Based on the consideration of delay

values of Table 3.2 , the arrival time of selection input c1 [time (t) =7] of

6:3 mux is earlier than s3[t=8] and later than s2[t=6]. Thus, sum3 [t=11] is



summation of s3 and mux [t=3] and sum2[t=10] is summation of c1 and

mux.

2) Except for group2, the arrival time of mux selection input is always

greater than the arrival time of data outputs from the RCA’s. Thus, the

delay of group3 to group5 is determined, respectively as follows:

3) The one set of 2-b RCA in group2 has 2 FA for and the other set has 1 FA

and 1 HA for. Based on the area count of Table I, the total number of gate

counts in group2 is determined as follows:

4) Similarly, the estimated maximum delay and area of the other groups in the

regular SQRT CSLA are evaluated and listed in Table 3.2.

Table 3.2 Delay and area for SQRT CSLA

3.1.4 Delay and area evaluation methodology of modified 16-bit SQRT CSLA

The structure of the proposed 16-b SQRT CSLA using BEC for RCA with to

optimize the area and power is shown in Fig. 3.5. We again split the structure into five

groups. The delay and area estimation of each group are shown in Figure.



Figure 3.5 A 16-bit SQRT CSLA using BEC

1) The group2 has one 2-b RCA which has 1 FA and 1 HA for carry input zero.

Instead of another 2-b RCA with carry input one a 3-bit BEC is used which adds one

to the output from 2-b RCA. The sum3 and final (output from mux) are depending on

and mux and partial (input to mux) and mux, respectively. The sum2 depends on

and mux.

2) For the remaining group’s the arrival time of mux selection input is

always greater than the arrival time of data inputs from the BEC’s. Thus, the delay of

the remaining groups depends on the arrival time of mux selection input and the mux

delay.

3) The area count of group2 is determined as follows:

Table 3.3 Delay and area for modified SQRT CSLA

3.1.5 Transistor Level design of existing technique

1) Conventional full adder



A conventional full adder takes 28 transistors to implement sum and carry

functions. The conventional full adder is shown in figure 3.6

2) A 2-bit RCA

A two bit Ripple Carry Adder (RCA) is formed by connecting the two full

adders. It takes total 56 transistors to implement. It is shown in figure 3.7.

Figure 3.6 A conventional full adder

3) A 3-bit BEC

A 3- bit BEC uses two XOR, one AND, one NOT gates, which takes 32

transistors overall whereas 2-bit RCA, which is the basic block in place of 3-bit BEC

takes 56 transistors. A 3-bit BEC is shown in figure 3.8. comparison between 2-bit

RCA, 3- bit BEC is shown in table 3.4.



Figure 3.7 A 2-bit RCA using conventional full adder

Figure 3.8 Transistor level 3-bit BEC

Table 3.4 Comparison between 2-bit RCA and BEC



Logic for Second

Level

Number of transistors

Critical path delay

(ns)

Area

(µm2)

Power dissipation

(µw)

Static Dynamic

Total

RCA using CMOS

56 1.900 1342 6.706 42.565 49.271

BEC using CMOS

32 1.200 781 3.269 25.746 29.015

Though BEC technique reduces area and power [16] but not up to

considerable amount and also the design is not suitable for sub threshold level

modifications. The drawback with this logic structure is that it does not reduce the

area and power to a satisfactory level. There is still scope to reduce the delay. In order

to improve the delay a new logic structure for a full-adder cell is proposed.

3.2 ALU



as addition, subtraction, increment, and decrement, logical AND, logical OR, logical

XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and

logical operations. The ALU is a fundamental building block of the Central

Processing Unit (CPU) of a computer, and even the simplest microprocessors contain

one. The processors found inside modern CPUs and Graphics Processing Units

(GPUs) have inside them very powerful ALUs. We have designed ALU by using

multiplexer and full adder circuit. The input and output sections consist of 4xl and 2xl

multiplexers and logic is implemented by using full adder.

The full adder performs the computing function of the ALU. A full adder

could be defined as a combinational circuit that forms the arithmetic sum of three

input bits. It consists of three inputs and two outputs.





as addition, subtraction, increment, decrement, logical AND, logical OR logical XOR

and logical XNOR. An ALU is a digital circuit that performs arithmetic and logical

operations. We have designed ALU using 4Xl mux, 2Xl mux and an 8T full adder.

Here all the blocks in ALU are designed using Gate Diffusion Input (GDI).

3.2.1 GDI Technique

AS there is a scope to reduce power, area and delay using GDI cell technique

A simple GDI cell is shown in Fig.3.9. We can implement any bullion function using

GDI cell. Low swing problems will arise, because we apply inputs directly to the

sources of P and N transistors. N transistor weak to pass logic high and P transistor

weak to pass logic low. When transition occur from the high to low at the P transistor

source and the low to high at the N transistor source, low swing problem will arise. To

avoid that demands special emphasis is that 50% of the cases, the GDI cell operates as

regular CMOS inverter, which is widely used as a digital buffer for logic-level

restoration. In some of these cases , when Vdd=1 without a swing drop from the

previous stages, a GDI cell functions as an inverter buffer and recovers the voltage

swing. Basic logic gates are shown in figure 3.10.

Figure 3.9 Simple GDI cell



Figure 3.10 Basic logic gates GDI cell

3.2.2 A 10-transistor full adder

A full adder using GDI technique takes 10 transistor where as conventional

full adder takes 28 transistors. It is shown in figure 3.11.

3.2.3 An 8-transistor full adder

Full adder can implement with 8-transistors by using GDI technique. A 10

transistor full adder differentiates the 8 transistor full adder with two pull up

transistors. It is shown in figure 3.12.



Figure 3.11 A 10- transistor full adder

Figure 3.12 A 8- transistor full adder

3.2.4 A 1-bit ALU

ALU is designed using multiplexers and full adder circuit. The input and

output sections consist of 4x1 and 2x1 multiplexers and logic is implemented by using

full adder. A set of three select signals have been incorporated in the design to

determine the operation being performed and the inputs and outputs being selected.

Figure 3.13 shows the block diagram of 1-bit ALU using two 4x1 multiplexers and

one 2x1 multiplexer. The complement of B is used for SUBTRACTION operation.

The full adder performs the SUBTRACT operation by two’s complement method.



Table 3.5 shows the truth table for the operations performed by the ALU based on the

status of the select signals.

Table 3.5 Truth table of one bit ALU

Figure 3.13 A 1-bit ALU

3.2.5 8-bit ALU using ripple carry adders

An 8- bit ALU is formed by connecting eight 1-bit ALUS in series. 8-bit ALU

using 10 transistors and 8- transistors are shown in figure 3.14.


s2 s1 s0 Operation

0 0 0 AND

0 0 1 XOR

0 1 0 XNOR

0 1 1 OR

1 0 0 DECREMENT

1 0 1 ADDTION

1 1 0 SUBTRACTION

1 1 1 INCREMENT


Figure 3.14 Eight bit ALU using 10 and 8 transistor full adders

An eight bit ALU using ripple carry adders takes more propagation delay. The

speed of ALU is limited by propagation of carry. To reduce the carry propagation the

proposed design using carry select adder is implemented.



CHAPTER 4

DESIGN OF ALU USING MODIFIED SQRT CSLA

4.1 Introduction to different transistor types

Combinational logic forms the core of most digital integrated circuits such as

fast arithmetic units and controllers. The design requirements imposed on the logic

circuitry can vary widely. Area is often the prime concern, as it has direct impact on

cost. In many state-of-the-heart designs, speed tends to be the dominating

requirement. Contemporary microprocessors are excellent examples of designs in this

class. For other applications, minimizing the power consumption is crucial, as in the

design of portable applications such as mobile telephones. These different design

requirements generally translate into the use of different circuit styles, or even

different manufacturing technologies.

The static CMOS has excellent properties in many areas: low sensitivity to

noise and process variations, excellent speed, and low power consumption. Most of

those properties are carried over to more static CMOS gates such as NAND gates with

three or more inputs become large and slow. Other design styles like complementary,

the ratioed and the pass transistor logic styles have been devised to address this issue,

all of which belong to the class of static circuits.

4.1.1 Complementary CMOS

A static CMOS gate is a combination of two networks, called the pull-up

network (PUN) and the pull-down network (PDN). The PUN consists solely of PMOS

transistors and provides a conditional connection to Vdd. The PDN potentially

connects the output to Vss and contains only NMOS devices. The PUN and PDN

networks should be designed so that, whatever the value of the inputs, one and only

one of the networks is conducting in steady state. In this way, a path always exists

between Vdd and the output, realizing a high output (one) or alternatively, between Vss

and output for a low output (zero).



Properties of complementary CMOS

Complementary CMOS gates inherit all the nice properties like high noise

margin, no static power consumption, as there is never a direct path between Vdd and

Vss in steady state mode and comparable rise and fall times.

The complementary gate is inverting (implementing functions such as NAND,

NOR & XNOR). Implementing a non inverting Boolean function (such as AND, OR,

XOR) in one stage is not possible and requires the addition of an extra inverter stage.

4.1.2 Pseudo NMOS

A grounded PMOS device presents an even better load. This configuration

which is called pseudo-NMOS because it resembles the depletion NMOS load, is

superior to the other approach. First of all, the PMOS transistor does not experience

anybody effect as its Vsb is constant and equal to 0. Secondly, the PMOS device is

driven by a Vgs equal to –Vdd, resulting in a higher load-current level for similarly

sized devices.

Figure 4.1 Pseudo NMOS

An important disadvantage is that it consumes static power when the output is

low, because a direct path exists between Vdd and ground through the load and device

drivers.

The grounded PMOS load is a good imitation of an ideal current-source load.

For a certain circuit configurations, some simple modifications can further improve



either the speed or the power consumption. The following approach allows to

completely eliminating the static current.

4.1.3 Differential cascade voltage switch logic (DCVSL)

Let us consider that the complement of each signal is always available. This

requires each gate to generate both polarities of the output signal. Such a gate, called

Differential Cascade Voltage Switch Logic (DCVSL) is presented. The PDN1 &

PDN2 are complementary, and implement the required logic function and its inverse.

Assume now that, for a given set of inputs, PDN1 conducts while PDN2 does not.

Node out is pulled down. This turns on the load transistor M2, pulling up out’. This in

turn cuts off load transistor M1. The gate is clearly free of static current paths as only

PDN1 & M2 are conducting.

Figure 4.2 DCVSL logic gate Basic Principle

Figure 4.3 XOR-XNOR gates



The availability of complementary signals eliminates extra inverter stages. An

example in the circuit implements a two input XOR and XNOR gate. The transistor

connected to the A-inputs are shared between the two PDNs. DCVSL has, for

instance, been used for the implementation of fast error-correcting logic in memories.

The DCVSL gate has the speed advantage; the reduction of the parasitic

capacitances at the output nodes produces a faster response. At the same time the

static power consumption is eliminated. This comes at the expense of extra area, as

each gate requires two pull-down networks.

4.1.4 Pass transistor logic

This is another promising approach to implement complex logic by realizing it

as a logical network of switches or pass transistors. The pass transistor approach has

the advantage of being simple and fast. Complex CMOS combinational logic is

implemented with a minimal number of transistors. This reduces the parasitic

capacitances and results in fast circuits. The static and transient performance of such a

structure strongly depends upon the availability of a high-quality switch with low

parasitic capacitance and resistance. Although the MOS transistor in itself is a switch

of reasonable performance, some deficiencies will become apparent. Pass transistor

logic networks are, therefore, often constructed from bidirectional transmission gates

(pass gates). These gates are composed of an NMOS transistor and a PMOS device in

a parallel arrangement. The pass transistor acts as a bidirectional switch controlled by

the gate signal C. When C=1, both MOSFETs are on, allowing the signal to pass

through the gate i.e., A=B if C=1. On the other hand, C=0 places both transistors in

cutoff, creating an open circuit between nodes A and B.

Figure 4.4 Pass transistor logic



Although the pass transistor possesses some excellent properties, such as an

almost constant resistance and no threshold loss, it has the disadvantage that it

requires both an NMOS and a PMOS transistor, which have to be located in different

wells. This reduces the layout efficiency of the design. Also, the control signal has to

be presented in both the polarities, which once again has a negative influence on the

layout density. Furthermore, the parallel connection of PMOS and NMOS results in

increased node capacitances and reduced performance. It would therefore be

advantageous if we could implement transmission gate using NMOS transistor only.

Unfortunately, NMOS only pass transistors are subject to voltage loss. This is not a

problem if the voltage levels are subsequently restored by a complementary CMOS

inverter. Such a circuit suffers from two major drawbacks: reduced noise margin, due

to threshold voltage drop and static power consumption. Several techniques have been

proposed to get around this problem.

4.1.5 Transmission Gate logic

Transmission gate logic includes at least two field-effect transistor elements

used as pass transistors, each having a channel of conductivity type opposite that of

the other (i.e., complementary FET’s).

Transmission gate is switching element which switches the input to the output

according to the gate input. Transmission gate is parallel connection of n-transistor,

which is good at pass logic one and p-transistor which is good at pass logic zero. The

basic arrangement of transmission gate is shown in figure 4.5.

Figure 4.5 A simple Transmission gate



4.2. Special Hardware using Multiplexers (SHM)

Though BEC technique reduces area and power but not up to considerable

amount and also the design is not suitable for sub threshold level modifications.

The 16-bit SQRT CSLA using BEC in its second level requires 792

transistors. There is a scope to reduce the number of transistors along with the area

reduction and power dissipation reduction by using proposed logic. For the

implementation of a 16-bit SQRT CSLA, 736 transistors are required by using

proposed logic.

The proposed logic implementation for second level RCA is Special Hardware

using Multiplexers (SHM) as shown in figure 4.6. In this the inputs are applied to first

level RCA. And the output of RCA is applied to second level SHM and then to third

level multiplexer. Third level multiplexer selects either RCA output or SHM output

according to the previous carry. A simple 3-bit SHM requires 3 multiplexers to

implement. b0, b1, b2 are the inputs to the 3-bit SHM and the x0, x1, x2 are

corresponding outputs. SHM will take first level RCA output as input and appends its

value by one. 3-bit SHM uses three multiplexers and three inverters. First inverter

gives the first output bit x0 basing on input bit b0 and that output will be used as

select line for the first multiplexer. First multiplexer passes either second bit b1 or

inversion of second bit b1to the output because first inverter output acts like a carry to

the second bit. First multiplexer gives the second output bit x1 and that will be used as

second multiplexer select line. Basing on x1 output bit and b1 bit second multiplexer

generates carry for input bit b2. One input to the second multiplexer is b1 and second

input is grounded which will be selected when it is connected as select line to the

third multiplexer. Third multiplexer passes third bit or inversion of third bit to the

output according to the previous carry bit. This logic can be extended to any number

of bits. It is implemented for second block with two inputs under consideration. When

number of inputs is increased the proposed technique produces more efficient results

on large scale. One point to be noticed is despite of the above advantages, delay is

increased as carry has to pass 2(n-1) levels in n bit SHM in order to appear at the

output. The comparison between numbers of transistors is shown in table 4.1.



Figure 4.6 A 3-bit SHM

Xo= bo

X1=x0.b1+x0.b1

X2=(x1+b1).b2+x1.b1.b2

Table 4.1 Area comparison between 2-bit RCA and BEC


Type of

logic

Gates Number of

transistors

Total number of

transistors

3-bit BEC 2 –XOR

1-AND

1-NOT

24

6

2

32

3-bit SHM 3-MUX

3-NOT

18

6 24


4.2.1 Transistor level design of SHM

A 3-bit SHM takes 24 transistors it is shown in figure4.7, corresponding

functional verification in the figure and corresponding wave forms are shown in

figure 4.8 and wave forms and power dissipation window shown in figure 4.9.

Figure 4.7 Transistor level 3-bit SHM

Figure 4.8 Critical path details of a 3-bit SHM



Figure 4.9 Power dissipation of a 3-bit SHM

The power and area between existing technique such as BEC and proposed technique such as SHM are compared in table 4.2.

Logic for Second

Level


Critical path delay

(ns)

Area

(µm2)

Power dissipation

(µw)

static Dynamic Total

BEC using CMOS

32 1.200 781 3.269 25.746 29.015

SHM using CMOS

24 2.350 486 3.100 22.843 25.943

Table 4.2 Power and delay Comparison between 2-bit RCA and BEC



4.3 An 8-bit ALU using proposed carry select adder

The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding circuit diagram shown in figure and for 8- transistor full adder, circuit diagram shown in figure 4.10.

Figure 4.10 Eight bit ALU using modified SQRT CSLA

4.4 Wave forms

By applying the 20 ns clock to the every input output wave forms are obtained. The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding output wave forms and power dissipation is shown in figure 4.11 and for 8- transistor full adder, wave forms are shown in figure 4.12.



Figure 4.11 Wave forms of 8- bit ALU for 10- transistor full adder

Figure 4.12 Wave forms of 8- bit ALU for 8- transistor full adder



CHAPTER 5

RESULTS

5.1 Comparative analysis of existing CSLA and modified CSLA

In the designing of 8 bit ALU using efficient carry select adder, all the blocks

of 16-bit SQRT CSLA, second level of second block such as 3-bit BEC and 3-bit

SHM are implemented in Dsch2.6c – Logic Editor and synthesized in Micro wind

2.6a- Layout Editor under 0.12um technology with 1.2 volts as logic high voltage.

The first level of second block in the 16-bit SQRT CSLA is two bit RCA

which requires 56 transistors when implemented in CMOS logic. The second level of

second block is 3-bit SHM in the proposed logic design; it uses 24 transistors. The

third level of second block is multiplexer. A simple 2x1 multiplexer uses six

transistors CMOS technology. Block2 needs three 2x1 multiplexers hence eighteen

transistors are required for the implementation. The total number of transistors

required for the complete block 2 is only 98 when SHM is used. Otherwise it requires

106 Transistors with BEC technique. The number of transistors required for block3 is

only 146, for block4 are 194 and for block5 are 242 when SHM is used. Otherwise

block3 requires 158, block4 requires210 and block5 requires 262 Transistors with

BEC technique. Using SHM for the implementation of a 16 bit SQRL CSLA 736

transistors are required where it requires 792 transistors with BEC technique.

Finally the complete second block of16-bit SQRT CSLA with BEC and SHM

is implemented using CMOS technology and observed the results and are shown from

Table 5.1.

5.2 Comparative analysis of existing ALU and modified ALU

All the basic gates in the ALU such as AND, XOR, multiplexer and full adder

are designed using GDI technique. Here full adder is designed using 10 transistors as

well as 8 transistors. Final comparison on 8 bit ALU is considered by taking ripple

carry adder and carry select adder.



Design of 8-bit ALU using efficient carry select adder is speed advantageous

than the 8-bit ALU using ripple carry adders. ALU using efficient carry select adder

gives 42% advantage for 10 transistors adder and 46% advantage for 8 transistor

adder. Corresponding results are shown in table 5.3 and 5.4.

Table 5.1 Comparison of second level 2- bit RCA; 3-bit BEC and3-bit SHM

implemented using CMOS technology

Logic for Second

Level


Critical path delay

(ns)

Area

(µm2)

Power dissipation

(µw)

Static Dynamic Total

RCA using

CMOS

56 1.900 1342 6.706 42.565 49.271

BEC using

CMOS

32 1.200 781 3.269 25.746 29.015

SHM using

CMOS

24 2.350 486 3.100 22.843 25.943

Table 5.2 Comparison between second block with BEC and second block with

SHM using CMOS

Design Type Number of transistors

Critical path delay

(ns)

Area

(µm2)

Power dissipation

(µw)

Static Dynamic Total

RCA-BEC-MUX

106 3.240 3465 21.005 106 127.005

RCA-SHM-MUX

98 3.770 2996 20.138 98.624 118.762



MODEL(ALU) NUMBER OF TRANSISTORS

Critical path delay(ns)

Area(µm) Power(mw)

8BIT ALU USING

10 TRANSISTOR

RCA

448 3.195 12384 0.204

8BIT ALU USING

10 TRANSISTOR

CSLA

508 1.865 24682 0.205

Table 5.3 Comparison of 8-bit ALU using 10 transistor adder

Table 5.4 Comparison of 8-bit ALU using 8 transistor adder

MODEL(ALU) NUMBER OF TRANSISTORS

Critical path delay(ns)

Area(µm) Power(mw)

8BIT ALU USING

8 TRANSISTOR

RCA

432 3.745 11832 0.221

8BIT ALU USING

8 TRANSISTOR

CSLA

494 2.070 20988 0.262

CHAPTER 6



CONCLUSIONS AND FUTURE SCOPE

6.1 Conclusions

In the process of designing a low power ALU, various tradeoffs between area,

delay and power dissipation occurred. As the adder is the main block in the ALU,

always efficient adder is preferred. For that, SQRT carry select adder is moderated

with more power and area advantageous.

In this process all second level RCA blocks of 16-bit SQRT CSLA are

replaced by SHM and the results are compared with existing technique such as BEC.

From the comparisons in Table 5.1, it is observed that the variation between 2-bit

RCA and proposed technique 3-bit SHM are more comparable such as percentage of

utilization of number of transistors is reduced to 57.1%, correspondingly percentage

of area required also reduced to 63.7% along with power dissipation reduction

advantage of 47.3%. Whereas the variation between 2-bit RCA and existing technique

3-bit BEC is only 42.8% reduction of utilization of number of transistors, 41.8%

reduction of area required along with the 41.1% reduction of power dissipation.

Finally second block of 16-bit SQRT CSLA is designed using logic level modification

such as SHM in place of BEC. From the table 5.2, it is observed that number of

transistors is reduced by 7.5%, area is reduced by 13.5% and power is reduced by

6.4%, but critical path delay is increased by 16.3%. Once again it is proved that the

tradeoff between area, power and delay, the design is optimized for power and area

against to the delay over head. This delay overhead also can be overcome by using

various existing low power circuit level modifications.

By using the proposed efficient carry select adder and GDI technique 8-bit

ALU is designed for both 10 transistor and 8 transistor full adders and compared with

the existing technique such as 8-bit ALU using ripple carry adders in the tables 5.3

and 5.4. It is observed that speed is increased 41.6% in case 10 transistor full adder

and 44.7% in case of 8-transistor full adder.

The performance of the proposed design has been shown to outperform.

Satisfactory level of power consumption and propagation delay can be achieved using



the proposed technology without the need to purchase new technology libraries,

which may lead to design cost reduction. Consequently, the proposed design is

suitable for the application in the high-performance arithmetic and VLSI circuits in

the future.

6.2 Future Scope

The proposed work can be extended and carried further with an aim of

increasing the number of bits and approach to new technology such as 0.08, 0.06

micron meter technology. The resulting design with few numbers of transistors will in

turn result in reduction of total area and also reduction in the power consumption.

REFERENCES



[1] Arun Prakash Singh, Rohit Kumar, “Implementation of 1-bit Full Adder Using

Gate Diffusion Input (GDI) cell”, International Journal of Electronics and

Computer Science Engineering J. Clerk Maxwell, A Treatise on Electricity and

Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.

[2] N. M. Chore, R. N. Mandavgane , “ A survay of low power high speed one bit

full adder”,recent advances in networking, VLSI and signal processing, ISSN:

1790-5117. ISBN: 978-960-474-162-5.

[3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A System

Perspective . Reading, MA: Addison- Wesley, 1993.

[4] Pardeep Kumar / International Journal of Engineering Research and

Applications(IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December

2012, pp.599-606

[5] M.sreedevi and p.jeno.paul “ Design and Optimization of a High Performance

Low-Power CMOS Flex Cell “, International Journal of Signal System Control

and Engineering Application, 2010, vol.3, no.4, pp.65-69. DOI:

10.3923/ijssceapp.2010.65.69.

[6] A good over view of leakage and reduction methods are explained in the book

Leakage and reduction in Nanometer CMOS Technologies ISBN 0-387-25737-3.

[7] M.Parvathi, N.Vasantha, K. Satya Prasad “Design of High Speed -Low Power-

High Accurate (HS-LP-HA) Adder “, ICECT, Internation conference on

Electronics Computer Technology Proceedings, 2012, pp: 523-527, 978-1-4673-

1850-1/12@2012, IEEE.

[8] K Allipeera, S Ahmed Basha, “An Efficient 64-Bit Carry Select Adder With Less

Delay And Reduced Area Application“, International Journal of Engineering

Research and Applications( IJERA) .ISSN: 2248-9622 www.ijera.com Vol. 2,

Issue 5, September- October 2012, pp.550-554

[9] O.J.Bedrij, “Carry Select Adder”, IRE Trans. Electron. Comput.pp. 340-344,1962.

[10] U.Sreenivasulu, T.Venkata Sridhar, “Implementation of An 4 Bit - ALU Using

Low-Power And Area-Efficient Carry Select Adder”, International Conference on

Electronics and Communication Engineering, 20th, May 2012, Bangalore, ISBN:

978-93-81693-29-2.


http://www.ijera.com/

http://dx.doi.org/10.3923/ijssceapp.2010.65.69


[11] A.Andamuthu, S.Rithanyaa, ”Design Of 128 Bit Low Power and Area

Efficient Carry Select Adder”, International Journal of Advanced Research in

Engineering (IJARE) Vol 1, Issue 1,2012 Page 31-34.

[12] B.Ramkumar, H.M.Kittur, and P .M.Kannan, “ASIC implementation of

modified faster carry save adder”, EUR .J. Sci .Res. vol.42, no.1, pp.53-58, 2010.

[13] T.Y.Ceaing and M.J.Hsaio, “carry –select adder using single ripple carry

adder”, Electron. Lett. Vol.34,no.22,pp.2101-2103, oct.1998

[14] Y.Kim and L.S.Kim, “64-bit carry select adder with reduced area”, Electron.

Lett. Vol.37,no.10,pp.614-615, May.2001.

[15] B RamKumar and Harish M Kittur, “Low –Power And Area -Efficient Carry

Select Adder”, IEEE Transactions on Very Large Scale Integration(VLSI)Systems

APPENDIX



About Microwind2

The MICROWIND2 program allows the student to design and simulate an

integrated circuit at physical description level. The package contains a library of

common logic and analog ICs to view and simulate. MICROWIND2 includes all the

commands for a mask editor as well as original tools never gathered before in a single

module (2D and 3D process view, Verilog compiler, tutorial on MOS devices). You

can gain access to Circuit Simulation by pressing one single key. The electric

extraction of your circuit is automatically performed and the analog simulator

produces voltage and current curves immediately. This includes details on the device

modeling, simulation at logic and layout levels.

Figure A: MICROWIND window as it appears at the initialization stage.

We use MICROWIND2 to draw the MOS layout and simulate its behavior.

Go to the directory in which the software has been copied (By default microwind2).

Double-click on the MicroWind3 icon. The MICROWIND2 display window includes

four main windows: the main menu, the layout display window, the icon menu and



the layer palette. The layout window features a grid; scaled in lambda (λ) units. The

lambda unit is fixed to half of the minimum available lithography of the technology.

The default technology is a CMOS 6-metal layers 0.12μm technology, consequently

lambda is 0.06μm (60nm).

Simulation of a layout

MICROWIND3 includes a 3D process viewer for that purpose. Click

Simulate → Process steps in 3D. The simulation of the CMOS fabrication process is

performed, step-by-step by a click on Next Step.

The picture on the left represents the nMOS device, pMOS device, common

polysilicon gate and contacts. The picture on the right represents the same portion of

layout with the metal layers stacked on top of the active device.The inverter

simulation is conducted as follows. Firstly, a VDD supply source (1.2V) is fixed to

the upper metal2 supply line, and a VSS supply source (0.0V) is fixed to the lower

metal2 supply line. The properties are located in the palette menu. Simply click the

desired property, and click on the desired location in the layout. Add a clock on the

inverter input node (The default node name clock1 has been changed into Vin) and a

visible property on the output node Vout

The command Simulate → Run Simulation gives access to the analog

simulation. Select the simulation mode Voltage vs. Time. The analog simulation of

the circuit is performed. The time domain waveform, proposed by default, details the

evolution of the voltages in1 and out1 versus time. This mode is also called transient

simulation



The command simulate→ run simulation gives access to four simulation

modes.voltage vs time, voltage and current vs time, static voltage vs voltage and

frequesncy vs time. all these simulation modes are applicable to inverter simulation.

Due to the fact that the layout Inv steps. Msk not only includes the inverter correctly

polarized but also several other MOS devices without any simulation properties, a

warning window appears prior to the anolog simulation, in this case you may click

simulate as it, In normal cases. All n-well regions should be stuck at VDD.

Select the simulation mode voltage vs time. The analog simulation of the

circuit is performed. The time domain waveform. Proposed by default, details the

evolution of the voltages in1 and out1 versus time. This mode is also called transient

simulation.

The inverter consumes power during transitions, due to two separate effects.

The first is short circuit power arising from momentary short circuit current that flows

from VDD to VSS when the transistor functions in the complete on/off state. The second

is charging/discharging power, which depends on the output wire capacitance. With

small loading the short circuit power loss is dominant. With huge loading, that is a

large output node capacitance, the load power is dominant.

The power consumption occurs briefly during transitions of the output, either

from 0 to 1 or from 1 to 0.the simulation contains the supply currents in the upper

window, and all voltage waveforms in the lower window. The current consumption is

important only during a very short period corresponding to the charge or discharge of

the output node. Without any switching activity, the current almost equals zero.

Delay

As the number of gates connected to the inverter output mode increase, the

load capacitance increases. The fan-out corresponds to the number of gates connected

to the cell output. Physically a large fan-out means a large number of connections that

is a large load capacitance.

An inverter circuit is simulated by using different clock, fanout and supply

conditions. The initial configuration is based on one inverter controlled by a 2 GHz



clock, with its output connected either to a single inverter or to four inverters. The

supply voltage is 1.2V, with a 0.12μm CMOS technology.

Now we connect four inverter circuits to the output node, thus increasing the

charge capacitance. In the simulation chronograms the inverter delay is significantly

increased. When we investigate the delay variation with the output capacitance load.

In the curve we can see that the gate delay variation with the loading capacitance is

quite linear. A 100fF load leads to around 300ps delay in CMOS 0.12μm technology.

In Microwind we obtain this type of screen, thanks to the command

parametric analysis. Load the file Invcapa.MSK, invoke the command parametric

analysis. By default the capacitance of output node is increased step-by-step from its

default value Cdef to Cdef +100fF.for each value of the output capacitance, the

analog simulation is performed, and the last computed rise time is plotted, appearing

as one single red dot in the graphs. The complete graph is built once all analog

simulations have been compelted.The memory button enables us to store one curve

prior to a new parametric simulation, for comparison purposes. Three main

parameters may vary in the parametric analysis: the capacitance voltage, temperature.

Several analog parameters may be monitored: rise and fall delay, oscillating

frequency, power consumption, final voltage of a node, cross talk etc.

Power consumption

The power consumption P is computed by micro wind as the average product

of the supply voltage VDD and the supply current IDD, computed at each iteration step-

in other words

P = Σ IDD.VDD/steps

Three main factors contribute to power consumption P: the load capacitance C, the

supply voltage VDD and the clock frequency for a CMOS inverter, this relation is

usually represented by the first order approximation below .The following equation

shows a linear dependence of the power consumption P with the total capacitance C

and the operating frequency father power consumption is also proportional to the

square of the supply voltage VDD.



P = 0.5ή.C.V2dd.f

ή = switching activity factor.

C = output load capacitance

Vdd= supply voltage

f= clock frequency.

Frequency dependence

We can verify the linear dependence of the power consumption with the

operating frequency by simulating a CMOS inverter circuit. At each time domain

analog simulation, we get a value of the power consumption, which is computed by

micro wind as the average product of the supply voltage VDD and the supply current

IDD.as the power consumption is linearly proportional to the clock frequency, a usual

metric found in most cell libraries is the μWGhz.

Supply voltage dependence

It can be considered as a first order approximation that the average power

consumption is proportional to VDD^2.we use the parametric analysis tool in micro

wind to control the incremental change of the supply voltage from 0.5 to 2.0 V.the

supply voltage step is 0.1 V.in the measurement window, the item dissipation is

selected. The result shows a non linear dependence of the power dissipation with

VDD.the square law fits with the experimental data form 0.8 to 1.5 V.we notice a

very important rise of the power consumption over 1.5 V, due to the avalanche effects

in n channel MOS devices. The simulation demonstrates the interest for a minimum

supply operation to achieve optimum low power operation.

Minimum supply voltage



We must know the supply voltage for which the inverter does not work any

more and the answer is given by the parametric analysis focusing this time on the

inverter delay dependence versus the supply voltage. Load the file cmosload.msk for

this study. Invoke the command parametric analysis of the analysis menu. click the

layout region corresponding to the node VDD. Verify that the voltage menu is selected

in the parametric analysis window. Verify that the node VDD is selected. Modify the

VDD voltage range from 0.5 to 1.5 V, step 0.1.finally in the measurement menu,

select the item rise delay and click start analysis.

We observe that the delay is significantly increased as we decrease VDD from

its nominal value 1.2V down to 0.6V.below 0.7V the inverter delay is higher than the

default transient simulation time so that the delay evaluator does not work anymore.

Static characteristics

The static characteristics of the inverter correspond to the variation plot of the

output voltage versus the input voltage. The simulation involves a step by step

increase of Vin, and the monitoring of Vout. In the simulation window, the static

characteristics are obtained by a click on the item voltage versus voltage situated in

the selection menu, at the bottom of the chronograms.

When Vin is low, Vout is high which corresponds to one logic state of the

inverter. When Vin increases Vout starts to decrease slowly, and suddenly crosses the

VDD/2 boundary. At that point the value of Vin is the commutation point of the

inverter called Vc.then when Vin rises to VDD, Vout reaches 0.which corresponds to

the other logic state of the inverter.

About DSCH3

The DSCH3 program is a logic editor and simulator. DSCH3 is used to

validate the architecture of the logic circuit before the microelectronics design is

started. DSCH3 provides a user-friendly environment for hierarchical logic design,

and fast simulation with delay analysis, which allows the design and validation of

complex logic structures. Some techniques for low power design are described in the



manual. DSCH3 also features the symbols, models and assembly support for 8051.

DSCH3 also includes an interface to SPICE.

Features

Figure B: DSCH schematic editor

user friendly environment for rapid design of logic circuits.

Supports hierarchical logic design.

Handles both conventional pattern-based logic simulation and intuitive on-

screen mouse-driven simulation.

Built-in extractor, which generates a SPICE net list from the schematic

diagram (Compatible with PSPICETM and WinSpiceTM).

Current and power consumption analysis.

Generates a VERILOG description of the schematic for layout editor.

Immediate access to symbol properties (Delay, fan-out).

Models and supports 8051 micro controller



An example of the design of the schematic diagram in the DSCH and the generation

of its layout in the MICROWIND is shown. The CMOS inverter design is detailed in

the figure C below. First click new on main menu then draw the circuit diagram on

DSCH window by dragging the components from symbol library. Draw the circuit

diagram as shown below.

Figure C: Inverter circuit

Save the file and Click Simulate→ Start simulation in the main menu. Then,

click inside the buttons situated on the left part of the diagram. The result is displayed

on the LED. Here the p-channel MOS and the n-channel MOS transistors function as

switches as shown in the figure D. When the input signal is logic 0as shown in figure

5.4 the NMOS is switched off while PMOS passes VDD through the output. When the

input signal is logic 1 shown in figure 6.12, the PMOS is switched off while the

NMOS passes VSS to the output.



Figure D: Circuit diagram of CMOS inverter, CMOS inverter While simulation

The fan-out corresponds to the number of gates connected to the inverter

output. Physically, a large fan-out means a large number of connections that is a large

load capacitance. If we simulate an inverter loaded with one single output, the

switching delay is small. Now, if we load the inverter by several outputs, the delay

and the power consumption are increased. The power consumption linearly increases

with the load capacitance.

This is mainly due to the current needed to charge and discharge that

capacitance. Click the button Stop simulation shown in the figure below. You are

back to the editor.



Figure E: Timing diagram of inverter

Click the chronogram icon to get access to the chronograms of the previous

simulation. As seen in the waveform, the value of the output is the logic opposite of

that of the input.

Generation of layout of the schematic diagram

Next open the Microwind window and click on open in the main menu. Then

open CMOS inverter circuit diagram. Then click on compile the verilog file to

generate the verilog file of corresponding circuit diagram.It generates the

corresponding stick diagram of the inverter circuit as shown in the figure. Then click

on simulate icon in main menu to generate the waveforms.

Verilog program

// DSCH Ver 3.0

// G:\project\dsch microwind\self\example.sch

module example (in1, out1);

input in1;



output out1;

wire ;

pmos #(17) pmos_1(out1,vdd,in1); // 2.0u 0.12u

nmos #(17) nmos_2(out1,vss,in1); // 1.0u 0.12u

endmodule

// Simulation parameters in Verilog Format

always #1000 in1=~in1; in1 CLK 10

Layout

In this paragraph, the procedure to create manually the layout of a CMOS

inverter is described. Click the icon MOS generator on the palette. The following

window appears. By default the proposed length is the minimum length available in

the technology (2 lambda), and the width is 10 lambda. In 0.12μm technology, where

lambda is 0.06μm, the corresponding size is 0.12μm for the length and 0.6μm for the

width.. Click on the top of the nMOS to fix the pMOS device. The result is displayed

in figure F.



Figure F: Layout of inverter in MICROWIND

Figure G: Selecting the NMOS device



Connection between devices

Within CMOS cells, metal and polysilicon are used as interconnects

for signals. Metal is a much better conductor than polysilicon. Consequently,

polysilicon is only used to interconnect gates, such as the bridge (1) between pMOS

and nMOS gates, as described in the schematic diagram of figure G. Polysilicon is

rarely used for long interconnects, except if a huge resistance value is expected. In the

layout shown in figure G, the Polysilicon Bridge links the gate of the n-channel MOS

with the gate of the p-channel MOS device. The polysilicon serves as the gate control

and the bridge between MOS gates.

Figure H: Connections required to build the inverter (CmosInv.SCH)


Download - body 12-12-13

Top Related