32 x 32 multiplier

A REPORT

ON

DESIGN, ANALYSIS AND SIMULATION OF

VLSI SYSTEMS AND MODULES (Design of Low Power High Speed 32 ×32 Multiplier)

Prepared in partial fulfillment of the course BITS C314

(Lab Oriented Project)

By

SANDEEP GUPTA 2002B5A3503

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, Pilani-333031 II semester 2005-2006

Table of Contents

2

Acknowledgements Abstract . 1. Introduction 5

2. The ESA in 2’C representation 5

3. Reducing the switching activity 8

4. The algorithm and architecture 9 4.1 Conversion from 2’C to SM notation 9

4.2 Speeding up the PP accumulation 10

4.3 Converting the RB number into 2’C number 12

4.4 The algorithm and its VLSI architecture 12

5. VERILOG CODE 14 6. RESULTS 56 6.1 Multiplier Interface 56 6.2 Partial Product generator 57

6.3 First stage of the adder 58

6.4 Schematic of final stage of adder 59

6.5 Schematic of one bit adder(sum and carry) 60

6.6 Layout (in tsmc018) 61

7. CONCLUSION 62

8. FUTURE SCOPE 62

References 63

Acknowledgements

3

I would like to express my sincere gratitude to Dr. D. Sriram, Instructor In-charge Lab

oriented Project Bits C314, for providing me an opportunity to work in the methodology

of research, for cultivating a logical and creative thinking and for making me express my

findings in the form of a scientific report.

I would also like to express my gratitude to Dr.(Mrs.)Anu Gupta, Assistant

professor, EEE Group, for giving me an opportunity to work under her guidance. The

work under her supervision, gave me an opportunity to comprehend my subject

knowledge and apply it to the given problem.

Last but not the least; I would like to thank Mr. Pawan Sharma, for allowing me

to use the various tools in OYSTER LAB.

Abstract

4

A low power multiplication algorithm and its VLSI architecture using a mixed number

representation is proposed. The reduced switching activity and low power dissipation are

achieved through the Sign-Magnitude (SM) notation for the multiplicand and through a

novel design of the Redundant Binary (RB) adder and Booth decoder. The high speed

operation is achieved through the Carry- Propagation-Free (CPF} accumulation of the

Partial Products (PP) by using the RB notation. Analysis showed that the switching

activity in the PP generation process can be reduced on average by 90%. Compared to the

same type of multipliers, the proposed design dissipates much less power and is 18%

faster on average

1: Introduction

5

It has been shown that by the use of the SM notation for the multiplicand, the use of

Two’s Complement (2’C) representation for the multiplier, and the use of RB

representation for the PP accumulation, the Expected Switching Activity (ESA), and

therefore the power dissipation, can be significantly reduced. The ESA reduction occurs

any time the negation of the multiplicand is needed in order to generate the PPs upon the

radix-4 Booth’s algorithm. High speed operation is sustained through the RB notations

for accumulating the PPs, since a CPF addition can be executed with RB numbers. The

inputs and outputs of the multiplication unit are assumed to be in 2’C notation. It is

interesting to point out the fact that although the proposed algorithm and its VLSI

architecture is complex in terms of the number conversions, it is more energy efficient

and has an operating speed close to the Wallace tree architecture and faster than the other

proposed multipliers.

2: The ESA in 2’C representation

2’C numbers and the radix-4 Booth’s algorithm are predominantly used for multiplier

design, since the arithmetic operations can be easily carried out with 2’C numbers and the

Booth’s algorithm can largely reduce the number of PPs. But, the Booth’s algorithm

often requires the negation of the multiplicand, and the negation of a 2’C number requires

many bits to be switched which results in high switching activity. Without losing

generality, the radix-4 Booth’s algorithm can be used to demonstrate the probability of

the negation of the multiplicand to be generated and how many bits on average have to be

switched. This would give the ESA during the PP generation.

As shown in Table I, the radix-4 Booth’s algorithm requires -Y and -

2Y, where Y is the multiplicand. For 2’ C representation -Y = Y + 1, and, to generate -Y

given Y, all the bits of Y have to be switched and then the ‘1’ be added to get the correct

2’C result. The same operations are needed to generate -2Y, except a left shift is needed

before the bit complementation takes place. The negation process is highly energy

consuming, as it requires the charging and discharging of all the nodes associated with

the PP. Indeed, let an n-bit multiplier be X=xn-1,xn-2……….x1,x0 according to Table I,

where k = 0, 1, . . . . [(n-1)/2] and x -1=0. So, it scans 3 bits for one PP with one bit

overlap between two adjacent triplets. If n is odd, then the largest index 2[(n-1)/2]+1=n

6

Therefore, an extra bit x n= x n-1 (sign extension) must be appended to the left of

x n-1 to make the triplet x n , x n-1,x n-2 .If n is even, then the largest index 2[(n-1)/2]+1=n-1

Therefore, multiplier X can be exactly grouped into n / 2 triplets and no sign extension is

needed. For parallel multiplication, all triplets can be scanned at the same time.

From Table I, when the radix-4 Booth’s algorithm catches the multiplier patterns ‘l l 0’,

‘101’ and ‘l 0 0’, it has to generate -Y or -2Y. These patterns, which will be referred to as

the NEG - the negation patterns hereafter - are directly related to the ESA in the Booth

PP generator. The average probability of a NEG patterns to occur in any given triplet

x2k+1, x2k, x2k-1 of the multiplier can be analyzed as follows.

Assume an n-bit 2’C number X=xn-1,xn-2……….x1,x0 and the probability of being ‘1’ for

each bit of the multiplier is 0.5.

Case 1: n is even, (n-1)/2 = (n-2)/2. Therefore n/2 triplets are needed to cover all the bits

of the multiplier and the sign extension is not needed. For x1x0, since the Booth’s

algorithm assumes bit xel to be always zero, there are only four choices for the triplet x1

xo x-1: 000, 010, 100 and 110. Two of them are NEGs. Hence, the probability of a NEG to

appear in x1x0 and x-, positions is l/2. For the remaining (n-2) bits, each triplet (x2k+l,

2k, 2k-l) has 8 possible patterns and 3 of them are NEGs. So, the probability of a NEG to

appear in the remaining (n-2) bits is 3/8. Therefore, the average probability of a NEG that

may appear in a triplet x2k+1, x2k , x2k-1 is

7

Case 2: n is odd, [(n-1)/2] = (n-1)/2. Therefore, number must be sign extended and

(n+1)/2 triplets are needed to cover all the bits of the multiplier. Based on the sign

extension rule, the triplet x,x,-,x,,-~ has four possible patterns: 000, 001, 110, 111.

Among them there is just one NEG. So, the probability of a NEG to occur in triplet x n, x

n-1, x n-2, is l/4. For x1x0, same as the case when n is even, the probability of a NEG to

occur in the triplet x,x0x-, is l/2. For the remaining (n-3) bits, the probability Of a NEG to

occur in a triplet x2k+lx2kx2k-1 is 3/8. Therefore, the average probability of a NEG that may

appear in a triplet x2k+lx2kx2k-1 is :

Combining cases 1 and 2, the average probability for a NEG to appear in triplet

x2k+lx2kx2k-1 is

Since, for 2’C numbers -Y = y’+l and the generation of Y’ requires the complementation

of every bit of Y, the ESA in the PP generation process is:

On the average, the ESA in the partial product generation process is about 0.40. This

results in a large power dissipation!

3: Reducing the switching activity

Clearly, the high switching activity in the Booth PP generator is caused by the generation

of-Y and -2Y and the fact that the 2’C representation is chosen for the multiplicand Y.

8

The latter holds as the negation of a given 2’C number is equivalent to the

complementation of all its bits and then adding ‘1’. On the other hand, the negation of a

SM number is simple – just complementing the sign bit. Hence, if one uses the SM

representation instead of 2’C for the multiplicand Y, a significant reduction of ESA

during the Booth PP generation process should be expected. Consequently if SM

representation is used for the multiplicand Y, yet keeping the multiplier X in the 2’C

form. The correctness of the radix-4 Booth’s algorithm applying to this mixed number

representation can be proved as follows: the radix-4 Booth’s algorithm gives correct

results when applied to 2’C numbers and the validity of the Booth coding results depends

exclusively on the pattern of the multiplier. Since the multiplier is kept in 2’C notation,

the radix-4 Booth’s algorithm stands valid for mixed number representation.

Now, let us evaluate the ESA of SM numbers. Since the multiplier is

in its 2’C form, the average probability of a NEG pattern to appear in any triplet

x2k+lx2kx2k-1, of an n-bit multiplier is the same as in (4). Also, negation of a SM number is

just to complement the sign bit, therefore, the ESA for SM number in the Booth PP

generation process is:

A comparison of ESA for the SM and 2’C number is reported in Table III. The reduction

of the ESA is significant, ranging from 87.5% for 8 bit operands to 98.4% for 64 bit

operands. As the operand length increases, the ESA for the even bit 2’C numbers

decreases with the asymptotic value of 318 and the ESA for the odd bit 2’C numbers is a

constant value of 3/8. For the SM numbers, the ESA decreases at the rate of 0(1/n) and

asymptotically reaches zero. Thus, for longer operands the ESA reduction and therefore

the power saving is more profound.

9

4: The algorithm and architecture 4.1: Conversion from 2’C to SM notation:

A SM number can be expressed as

and a 2’C number can be expressed as

For positive numbers, the 2’C and SM notations are identical - no conversion is needed.

For negative numbers, the conversion from 2’C to SM can be implemented by

complementing all the bits except the sign bit yn-1, and adding the ‘1’ to the final result. If

one assumes an uniform distribution of positive and negative numbers, then the

probability that the number has to be converted is 0.5. Although the conversion adds

some delay, it does not offset the power dissipation “gain” due to the SM representation

for the multiplicand. Indeed, if the multiplicand is in 2’C notation one has to execute the

negation process for about 40% of all the PPs needed and the number of the negation

processes increases as the operand length increases, while the conversion from 2’C to SM

takes place only once for any operand length. For the “add 1” operation, instead of using

an n-bit adder which introduces delay and power overhead we generate a correction term

associated with each PP and then add this correction term to all PPs through the binary

addition tree as shown in Figure 3. In this manner, only one more input for the addition

tree is added while the whole n-bit addition operation is avoided. The correction term can

be generated according to Table IV.

10

The logic for Cl and C2 is trivial: Cl= yn-1*lY and C2=yn-1*2Y. The block diagram, as

shown in Figure 1, indicates that the 2’C-to- SM conversion adds only one inverter delay

or about 0.5 gate delay2 which comes from the complementation operation of the 2’C

number. The correction term does not introduce extra power overhead compared to the

traditional 2’C implementation, since in the traditional 2’C implementation one also

needs a similar correction term generator (‘adding 1”) to generate the negation of the

multiplicand.

4.2: Speeding up the PP accumulation

We have substantially reduced the ESA in the PP’s generation, but SM numbers

are hard to manipulate for arithmetic operations, since the signs of the operands have to

be identified separately through a sequence of decisions - costing excess control logic,

execution time and power dissipation. On the other hand, the RB numbers are represented

in the form

11

with digit ri Є { 1,0,-1), are more suitable for high speed parallel arithmetic computations

[ 1, 61. Due to the redundancy in RB numbers one can perform the CPF addition through

the selection of different numbers for the same value. Hence, we further convert the PPs

into the RB representation. We are adopting the selection rule proposed by Takagi in [l]

to perform CPF addition for the PP accumulation. The rule is shown in Table VI. Let us

give an example. The CPF addition of and is shown in

Figure 2. One can see that, the carry is limited within adjacent digits and there is no

global carry propagation.

The conversion of SM-to-RB can be carried out as following: as the RB representation

uses a digit set of {-1, 0, 1), one needs two bits rilri

r to represent one digit ri. If we use a

SM coding to represent a RB digit, that is, rilto represent the sign and ri

r to represent the

magnitude, we can easily convert a SM number into an RB number. For a SM number

X= xn-1xn-2,xi...xlxo, the sign of the number is decided by the sign bit xn-1 .Therefore, we

12

can group the sign bit xn-1, with all the rest bits in a pair by pair fashion, (xn-1,xn-2),( xn-

1,xn-3),...,( xn-1,x1),( xn-1,x0), and interpret the pairs according to the SM coding rule

shown in Table V. Clearly, we do not need any operations except some wiring.

4.3: Converting the RB number into 2’C number

The summation of the PP’s is in RB form and it has to be converted back into 2’C

form. This conversion is carried out easily in the following manner: from Table V, every

digit xRBi = (rilri

r) of the RB number XRB, is composed of two bits. The left bit ril

represents the sign and the right bit rir represents the magnitude. One can easily form a

number XRB+ from the positive digits of XRB, and form another number XRB

-, from the

negative digits of XRB. Then, subtracting XRB+ from XRB

-, one can get the result in the

2’C form. The process can be implemented using a fast adder. Since a fast adder is

essential for all the multiplication algorithms to carry out the final result, the RB-to-2’C

conversion does not introduce any extra overhead.

4.4: The algorithm and its VLSI architecture

THE ALGORITHM:

Step 1: Convert the multiplicand from 2’C into the SM representation and keep the

multiplier in 2’C form.

Step 2: Apply the radix-4 Booth’s algorithm to generate all the PPs represented in SM

notation.

Step 3: Convert all the partial products from SM into RB representation.

Step 4: Sum up all the PPs through a RB adder tree.

13

Step 5: Convert the final result from RB into 2’C notation.

The corresponding VLSI architecture for the algorithm is shown in Figure 3. It is

composed of two major parts: the PP generator and the redundant binary addition tree.

The key components in this architecture are: the RB adder in the

addition tree and the Booth decoder in the PP generator.

14

RESULTS ( Snapshots of the RTL schematic)

1) Multiplier Interface.

2. Partial Product generator

15

3) First stage of the adder

16

4)Schematic of final stage of adder

17

5)Schematic of one bit adder(sum and carry)

18

6. Layout (in tsmc018)

19

7.CONCLUSION:

20

This architecture has been chosen keeping low power as main objective. All the stages in

above architecture have been coded in VERILOG HDL. In implementation special care

has been taken to meet our objective. All the modules involved are verified functionally.

After testing logic synthesis has been carried out. From logic synthesis delay involved in

each stage has been calculated. Whole design has been synthesized in tsmc018

technology and the delay obtained is around 16 ns. Further semi custom layout of the

design has been done in Autocell.

8.FUTURE SCOPE

Power vs delay optimization is the main aim of all the designs, various techniques can be

applied for achieving it. Since the whole design is modular wherein single one bit adder

has been repeated for the whole adder tree, optimization of this adder can increase the

speed. For this transmission gate designs can be further exploited and fast RB adder can

be designed, which could not be done here because of technology library constraints.

Various circuits level power reduction techniques can also be applied to further reduce

the power consumption.

REFERENCES

21

[l] N. Takagi, et al, “High-Speed VLSI Multiplication Algorithm with a Redundant

Binary Addition Tree,” IEEE Trans. on Computers, Vol.C-34, No.9, pp.789-796,

September 1985.

[2] H.Makino, et al, “A 8.8-ns 54x54-bit Multiplier Using New Redundant Binary

Architecture,” Proceedings of 1993 International Conference on Computer Design,

Cambridge, MA, USA, pp.202-205, October 3-6, 1993.

[3] X.Huang, et al, “A High-Performance CMOS Redundant Binary Multiplication-and

Accumulation (MAC) Unit,” IEEE Trans. on Circuit and Systems-I: Fundamental Theory

and Applications, Vo1.41, No.1, pp.33-39, January 1994.

[4] C.Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans. on Electronic

Computer, Vol.EC- 13, pp. 14- 17, February 1964.

[5] L. P. Rubinfield, “A Proof of the Modified Booth’s Algorithm for Multiplication,”

IEEE Trans. on Computers, Vo! C-24, No.10, pp.1014-1015, October 1975.

[6] A. r\vizienis, “Signed-Digit Number Representations for Fast Parallel Arithmetic,”

IRE Trans. on Electronic Computer, Vol.EC-10, pp.389-400, September, 1961.

[7] N.Weste and K.Eshraghian, Principles of CMOS VLSI Design: A System

Perspective, 2nd Edition, pp. 555, Addison-Wesley Publishing Company, 1993.

32 x 32 multiplier

Documents