chapter 3 spst based power optimized...

26
24 CHAPTER 3 SPST BASED POWER OPTIMIZED MULTIPLIER 3.1 INTRODUCTION Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand; the number of times that it is added is the multiplier; and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. As for adders, it is possible to enhance the intrinsic performance of multipliers. Acting in the generation part, the booth algorithm is often used because it reduces the number of partial products. Multiplication occurs frequently in finite impulse response filters, Fast Fourier transforms, discrete cosine transforms, convolution, and other important DSP and multimedia kernels. Owing to the importance of multiplication in DSP and multimedia applications, several designs have been developed for sub word-parallel multipliers and Multiply-And-Accumulate (MAC) units. The objective of a

Upload: others

Post on 07-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

24

CHAPTER 3

SPST BASED POWER OPTIMIZED MULTIPLIER

3.1 INTRODUCTION

Multiplication can be considered as a series of repeated additions.

The number to be added is the multiplicand; the number of times that it is

added is the multiplier; and the result is the product. Each step of addition

generates a partial product. In most computers, the operand usually contains

the same number of bits. When the operands are interpreted as integers, the

product is generally twice the length of operands in order to preserve the

information content. This repeated addition method that is suggested by the

arithmetic definition is slow that it is almost always replaced by an algorithm

that makes use of positional representation. It is possible to decompose

multipliers into two parts. The first part is dedicated to the generation of

partial products, and the second one collects and adds them.

As for adders, it is possible to enhance the intrinsic performance of

multipliers. Acting in the generation part, the booth algorithm is often used

because it reduces the number of partial products. Multiplication occurs

frequently in finite impulse response filters, Fast Fourier transforms, discrete

cosine transforms, convolution, and other important DSP and multimedia

kernels. Owing to the importance of multiplication in DSP and multimedia

applications, several designs have been developed for sub word-parallel

multipliers and Multiply-And-Accumulate (MAC) units. The objective of a

25

good multiplier is to provide a physically compact, good speed and low power

consuming chip.

The basic multiplication principle is two fold:

evaluation of partial products and

accumulation of the shifted partial products

It is performed by the successive additions of the columns of the

shifted partial product matrix. The ‘multiplier’ is successfully shifted and

gates the appropriate bit of the ‘multiplicand’. The delayed, gated instance of

the multiplicand must all be in the same column of the shifted partial product

matrix. They are then added to form the product bit for the particular form.

Multiplication is therefore a multi operand operation. To extend the

multiplication to both signed and unsigned numbers, a convenient number

system would be the representation of numbers in two’s complement format.

Some of the advantages of using two’s complement number system include:

representation of negative numbers in two’s complement

allowing for an easy subtraction function, since – M = M’ + 1

sufficient redundancy allowing for the annihilation of carry or

borrows chains and also, in fact, propagation free addition and

subtraction.

Booth’s recoding technique, used for signed binary numbers

enabling faster multiplications.

3.2 SPURIOUS POWER SUPPRESSION TECHNIQUE (SPST)

The SPST uses a detection logic circuit to detect the effective data

range of arithmetic units, namely adders or multipliers. When a portion of

26

data does not affect the final computing results, the data controlling circuits of

the SPST latch this portion to avoid useless data transitions occurring inside

the arithmetic units. Besides, there is a data asserting control realized by using

registers to further filter out the useless spurious signals of arithmetic unit

every time when the latched portion is being turned on. This asserting control

brings evident power reduction. Figure 3.1 shows the five cases of a 16-bit

addition in which the spurious switching activities occur (Chen et al 2006).

The data are separated into the Most Significant Part (MSP) and the Least

Significant Part (LSP). To know whether the MSP affects the computation

results or not, the detection logic unit is used to detect the effective ranges of

the inputs. The Boolean logical equations from 3.1 to 3.6 given below express

the behavioral principles of the detection logic unit in the MSP circuits of the

SPST-based adder/subtractor:

AMSP = A [15:8]; B MSP = B [15:8]; (3.1)

Aand = A [15].A [14]……A [8]; (3.2)

Band = B [15].B [14]……B [8]; (3.3)

Anor = A [15] + A [14] + …+A[8] ; (3.4)

_____________________

Bnor = B [15] + B [14] + …+B[8] ; (3.5)

____________________

Close = (Aand + Anor). (Band + Bnor) (3.6)

Where ‘A[m]’ and ‘B[n]’ respectively denote the mth

bit of the

operands, A and the nth

bit of the operand ‘B’ and ‘AMSP’ and ‘BMSP’

respectively denote the MSP parts, i.e. the 9th

bit to the 16th bit, of the

operands ‘A’ and ‘B’. When the bits in ‘AMSP, and/or those in ‘BMSP’ are all

one, the value of Aand and/or that of Band respectively become one, while the

bits in AMSP and/or those in BMSP are all zeros, the value of ‘Anor’ and/or that

27

of ‘Bnor’ respectively turn into one. Being one of the three outputs of the

detection logic unit, close

Figure 3.1 Five Cases of 16-bit Addition

denotes whether the MSP circuits can be neglected or not. When the two input

operand can be classified into one of the five classes as shown in Figure 3.1,

the value of close becomes zero which indicates that the MSP circuits can be

closed (Chen et al 2005). Figure 3.1 also denotes that it is necessary to

compensate the sign bit of computing results. Accordingly, the Karnaugh

maps are derived leading to the Boolean equations (3.7) and (3.8) for the

Carr_ ctrl and the sign signals, respectively. In equations (3.7) and (3.8), CLSP

denotes the carry propagated from the LSP circuits:

Carr_ctrl = (CLSP .Aand. Band).(Aand+Anor)(Band +Bnor) (3.7)

Sign = CLSP. (Aand+ Band) + CLSP .Aand. Band (3.8)

28

3.2.1 Proposed Spurious Power Suppression Technique

Figure 3.2 showcases a 16-bit adder/subtractor design based on the

proposed SPST. This example, the 16-bit adder/subtractor, is divided into

MSP and LSP at the place between the 8th

bit and the 9th

bit. Latches

implemented by simple AND gates are used to control the input data of the

MSP. When the MSP is required to produce the result, the input data of MSP

remain the same as usual. When the MSP is not required to produce the result,

the input data of the MSP become zeros to avoid switching power

consumption. From the derived Boolean equations (3.1) to (3.8), the detection

logic unit of the SPST is designed as shown in Figure.3.2 which can

determine whether the input data of MSP should be latched or not. Moreover,

the three 1-bit data is added to control the assertion of the close, sign, and

Carr-ctrl signals in order to further decrease the glitch signals occurring in the

cascaded circuits. These circuits are usually adopted in VLSI architectures

designed for video coding.

Figure 3.2 16-Bit Adder/Subtractor Design Example

29

The SPST uses a detection logic circuit to detect the effective data

range of arithmetic units, e.g., adders or multipliers. When a portion of data

does not affect the final computing results, the data controlling circuits of the

SPST latch this portion to avoid useless data transitions occurring inside the

arithmetic units. Besides, there is a data asserting control realized by using

registers to further filter out the useless spurious signals of arithmetic unit

every time when the latched portion is being turned on. This asserting control

brings appreciable power reduction. In Figure 3.2, SPST is illustrated through

a low power adder/subtractor design example. The adder subtractor is divided

into two parts, MSP and the LSP. The MSP of the original adder/subtractor is

modified to include

1. detection logic circuits

2. data controlling circuits

3. sign extension circuits and

4. logics for calculating carry in and carry out signals

The most important part of this thesis is the design of the control

signal asserting circuits, denoted as asserting circuits in Figure 3.2. Although

this asserting circuit brings sufficient power reduction, it may induce

additional delay. There are two main approaches for the control signal

assertion circuits. They are (1) using registers and (2) using AND gate

3.2.2 Control Signal Assertion Circuit using Registers

The first implementing approach of control signal assertion circuit

using registers is exhibited in Figure 3.3. The three output signals of the

detection logic are

30

1. Close

2. Carr-ctrl and

3. Sign

The three output signals of the detection logic unit are given a

certain amount of delay before they assert. This is demonstrated in the timing

diagram shown in Figure 3.4.

Figure 3.3 Control Signal Assertion Circuit using Registers

Figure 3.4 Timing Diagram for Control Signal Assertion Circuit using

Registers

31

The delay , used to assert the three output signals must be set in

the range of < < , where denotes the data transient period and

denotes the earliest required time of all the inputs. This will filter out the

glitch signals as well as keep the computation results correct. The restriction

that must be greater than to guarantee the registers from latching the

wrong values of control usually decreases the overall speed of the applied

designs. This issue is noticed in high-end applications which demand both

high speed and low power requirements. To solve this problem the other

implementing approach of control signal assertion circuit using AND gate is

adopted.

3.2.3 Control Signal Assertion Circuit using an AND Gate

The detection logic circuit using an AND gate to assert control

signal is exemplified in Figure 3.5. The timing control of delay in this

implementation is slightly different from the one in the earlier

implementation. That is, the range of can be set as 0 < < to filter out

the glitch signals and to keep the computation results correct.

Figure 3.5 Control Signal Assertion Circuit using AND Gate

32

This feature allows the upper level system to assert the close signal

with an arbitrarily short delay closing to the positive edge of the clock signal.

This will provide a more flexible controlling space for the delay . When

speed is seriously concerned, this approach has an extremely high flexibility

by adjusting the data asserting time of the SPST equipped multipliers.

Therefore the proposed advanced SPST can benefit multipliers on both high

speed and low power features.

3.2.4 Comparison

The asserting circuit using AND gate achieves a significant

improvement in speed. It has been found that when implemented with AND

gate, the asserting circuits take the less timing delay and require less power

than when implemented using registers. For the latest Xilinx, Spartan 2

family, has been found that the gate delay for the asserting circuit using AND

gate is 13.507ns, while it is 24.107ns r using registers. The power dissipation

for asserting circuit using AND gate is 3mw while for using register it is

4mw. So the asserting circuit using AND gate is chosen for SPST

implementation. The results are compared in Table 3.1 and based on the

comparison in asserting circuit using AND gate is selected for low power

design.

Table 3.1 Comparison of Asserting Circuits

S.NOAsserting

circuit usingVendor Device

Estimated

delay (ns)

Power

dissipation

(mW)

1 Registers Xilinx Sparton 2 24.107 4

2 AND gate Xilinx Sparton 2 13.507 3

33

3.3 MODIFIED BOOTH MULTIPLIER

In order to achieve high-speed multiplication, multiplication

algorithms using parallel counters, such as the modified Booth algorithm has

been proposed (Booth 1951) and some multipliers based on the algorithms

have been implemented for practical use. This type of multiplier operates

much faster than an array multiplier for longer operands because its

computation time is proportional to the logarithm of the word length of

operands. The modified Booth multiplier on the processor is a major source of

energy consumption for Digital Signal Processing programs. Given a pair of

values to be multiplied, power should be reduced if the value is put in with the

lower recoding weight into the second input of the modified Booth multiplier.

Booth multiplication is a technique that allows for smaller, faster

multiplication circuits, by recoding the numbers that are multiplied. It is the

standard technique used in chip design, and it provides significant

improvements over the "long multiplication" technique. First, the "shift and

add", or normal "long multiplication” multiplier design is briefly reviewed.

3.3.1 Reduction of Partial Products: The Modified Booth Coding

It is possible to reduce the number of partial products by half, by

using the technique of radix 4 Booth recoding. The basic idea is that, instead

of shifting and adding for every column of the multiplier term and multiplying

by 1 or 0 for second column, multiply by ±1, ±2 or 0 to obtain the same

results. So, to multiply by 7, (-1), multiply the partial product aligned

against the least significant bit by -1, and multiply the partial product aligned

with the third column by 2:

34

Partial Product 0 = Multiplicand × -1, shifted left 0 bits (x -1)

Partial Product 1 = Multiplicand × 2, shifted left 2 bits (x 8)

This yields the same result as the equivalent shift and add method:

Partial Product 0 = Multiplicand × 1, shifted left 0 bits (x 1)

Partial Product 1 = Multiplicand × 1, shifted left 1 bits (x 2)

Partial Product 2 = Multiplicand × 1, shifted left 2 bits (x 4)

Partial Product 3 = Multiplicand × 0, shifted left 3 bits (x 0)

The advantage of this method is the halving of the number of

partial products. This is important in circuit design as it relates to the

propagation delay in the running of the circuit, the complexity and power

consumption of its implementation.

It is also important to note that there is comparatively little

complexity in multiplying by 0, 1 or 2. All that is needed is a multiplexer or

equivalent, which has a delay time that is independent of the size of the

inputs. Negating 2's complement numbers has the added complication that

needs to add a "1" to the LSB, but this can be overcome by adding a single

correction term with the necessary "1"s in the correct positions.

3.3.2 Radix – 4 Modified Booth Recoding

In partial product reduction step of multiplier, the Booth recoding is

technique is used. In this technique, recoding is applied to multiplier term, in

which three bits in blocks are considered, such that each block overlaps the

previous block by one bit. Grouping starts from the LSB, and the first block

only uses two bits of the multiplier (since there is no previous block to

overlap). Figure 3.6 depicts the grouping of bits from the multiplier term for

35

use in modified booth recoding. The least significant block uses only two bits

of the multiplier, and assumes a zero for the third bit:

Figure 3.6 Grouping of bits from the Multiplier term

Table 3.2 Partial Product Selection and Generation Process

Block Re - coded digit Operation on X

000 0 0X

001 +1 +1X

010 +1 +1X

011 +2 +2X

100 -2 -2 X

101 -1 -1 X

110 -1 -1 X

111 0 0 X

This overlap is necessary so that it is known what happened in the

last block, as the MSB of the block acts like a sign bit. Each block is decoded

to generate the correct partial product. The encoding of the multiplier Y, using

the modified booth algorithm, generates the following five signed digits, -2, -

1, 0, +1, +2. Each encoded digit in the multiplier performs certain operations

on the multiplicand, X, as illustrated in Table 3.2. The five possible

operations are given in Table 3.3.

36

Table 3.3 Five Possible Operations

Re-coded

digit

Operation on X

0 Add 0 to the partial product

+1 Add X to the partial product

+2 Shift left X one position and add it to the partial

product

-2 Add two’s complement of X to the partial product

-1 Take two’s complement of X and shift left one

position

One possible implementation is in the form of a Booth re-coder

entity, such as the one in Figure 3.7 with its outputs being used to form the

partial product:

Figure 3.7 Booth Re-Coder and its Associated Inputs and Outputs.

In Booth Re-Coder

The zero signal indicates whether the multiplicand is made

zero before being used as a partial product.

The shift signal is used as the control to a 2:1 multiplexer, to

select whether or not the partial product bits are shifted to left

one position.

37

Finally, the negative signal indicates whether or not to invert

all of the bits to create a negative product (which must be

corrected by adding "1" at some later stage).

The described operations for booth recoding and partial product

generation can be expressed in terms of logical operations if desired. But, for

synthesis, it is found to be better to implement the tables in terms of VHDL

case and if/then/else statements.

3.4 PROPOSED HIGH SPEED LOW POWER MULTIPLIER

DESIGN

The proposed high speed low power multiplier is designed by

equipping the SPST on a tree multiplier. There are two distinguished design

considerations in designing the proposed multiplier. They are

1. Applying the SPST on the modified Booth encoder.

2. Applying the SPST on the compression tree.

3.4.1 Applying the SPST on the Modified Booth Encoder

Figure 3.8 shows a computing example of Booth multiplying two

numbers “2AC9” and “006A”. The shadow denotes that the numbers in this

part of Booth multiplication are all zero so that this part of the computations

can be neglected. Saving those computations can significantly reduce the

power consumption caused by the transient signals. Based on the analysis on

multiplication, the SPST-equipped modified-Booth encoder, which is

controlled by a detection unit, is proposed.

38

Figure 3.8 An Example of Booth Multiplying Two Numbers

The detection unit has one of the two operands as its input to decide

whether the Booth encoder calculates redundant computations. As shown in

Figure 3.9, the latches can, respectively, freeze the inputs of MUX-4 to MUX-

7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7

are zero, to reduce the transition power dissipation.

Figure 3.9 SPST Equipped Modified Booth Encoder

39

3.4.1.1 Partial Product Generation

In order to reduce the number of partial products by half, the

technique of radix 4 Booth recoding is used. The basic idea is that, instead of

shifting and adding for every column of the multiplier term and multiplying

by 1 or 0, the second column is multiplied by ±1, ±2, or 0, to obtain the same

results. To Booth recode the multiplier term, consider the bits in blocks of

three, such that each block overlaps the previous block by one bit. Grouping

starts from the LSB, and the first block only uses two bits of the multiplier

(since there is no previous block to overlap). The least significant block uses

only two bits of the multiplier, and assumes a zero for the third bit. The

overlap is necessary so that its shows what happened in the last block, as the

MSB of the block acts like a sign bit. Each block is decoded to generate the

correct partial product. Figure 3.10 presents the booth partial product

generation circuit. It includes AND/OR/EX-OR logic.

Figure 3.10 Booth Partial Product Selector Logic

40

3.4.1.2 Addition of Partial Products

In order to add the partial products that are generated by the

modified booth encoder, the SPST adder is proposed. The operands are

divided into two parts as the Most Significant Part (MSP) and the Least

Significant Part (LSP). The SPST uses a detection logic circuit to detect the

effective data range of arithmetic units, e.g., adders or multipliers. When a

portion of data does not affect the final computing results, the data controlling

circuits of the SPST latch this portion to avoid useless data transitions

occurring inside the arithmetic units. Besides, there is a data asserting control

realized by using registers to further filter out the useless spurious signals of

arithmetic unit every time when the latched portion is being turned on. This

asserting control brings sufficient power reduction. In SPST adder in order to

perform the normal addition the full adder circuit is used.

3.4.1.3 Full Adder Circuit

Full adder is a one-bit adder which represents a basic building

block for all the more elaborate addition schemes. A full adder has three input

signals and two output signals. The input signals consist of the two operands

to be added and the carry input signal from the previous stage. The output

signals are the sum and carry out signals. The truth table for a full adder

circuit is given in Table 3.4. Operations of a full adder are defined by the

Boolean equations for the sum and carry signals.

The Boolean equations for the sum and carry out signals are given

in equation (3.9) and (3.10) respectively:

Si = Ai Bi Cin (3.9)

Ci+1=Ai . Bi + Cin. Ai + Cin. Bi (3.10)

41

Table 3.4 Truth Table for a Full Adder Circuit

Ai Bi Carry in

(Cin)

Sum (Si) Carry out

(Ci+1)

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

0 1 1 0 1

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 1 1

Where Ai, Bi are the input operand to be added in the ith

stage and Cin

is the carry from the previous stage. Si is the sum of ith

stage and Ci+1 is the

carry out to the next stage. From the above equation it is possible to

understand that the realization of the sum functions requires two XOR logic

gates. The Carry function is further rewritten defining the Carry-Propagate Pi

and Carry-Generate Gi terms. The Boolean expression for Carry-Propagate Pi

and Carry-Generate Gi terms are given in equations (3.11 and (3.12)

respectively:

Pi = Ai Bi (3.11)

Gi = Ai . Bi (3.12)

At a given stage i, a carry is generated if Gi is true (i.e. both Ai and

Bi are ONEs), and if Pi is true, the stage propagates an input carry to its output

(i.e. either Ai or Bi is ONE). The logical implementation of the full adder is

shown in Figure 3.11.

42

Figure 3.11 Full Adder Implementation

3.4.2 Applying the SPST on the Compression Tree

The proposed SPST-equipped multiplier is illustrated in Figure

3.12. The PP generator generates five candidates of the partial products, i.e.

{-2A, -A, 0, A, 2A}. These are then selected according to the Booth

encoding results of the operand B. When the operand besides the Booth

encoded one has a small absolute value, there are opportunities to reduce the

spurious power dissipated in the compression tree. According to the

redundancy analysis of the additions, the adders are replaced in compression

tree of the multiplier with the SPST-equipped adders, which are marked with

oblique lines.

43

Figure 3.12 SPST- Equipped Multiplier

3.4.2 Flow Chart

The flow chart of power optimization algorithm for the proposed

SPST equipped multiplier is shown in Figure 3.13. In the first step, the HDL

code for conventional array multiplier and the proposed multiplier is written.

Then the functional verification of both multipliers is ensured with the help of

Model-Sim software tool. After checking the functional verification, the

synthesis process is started by using Xilinx tool. During the synthesis, process

power, delay and area reports are generated. The synthesis reports conclude

that the proposed algorithm consumes less power, lower delay and area

reduction because the modified booth encoding is used to reduce the partial

product. In this regard, SPST is used in adder section for minimizing

unwanted addition.

44

Figure 3.13 Power Optimization Algorithm (Continued)

45

Figure 3.13 Power Optimization Algorithm

46

3.5 SIMULATION RESULTS

In the power optimization method the performance of the proposed

high speed low power multiplier is evaluated by comparing this design with a

conventional array multiplier. Both multipliers can be implemented using

VHDL coding. In order to get the power report and delay report, these

multiplier circuits are synthesized using Xilinx software.

3.5.1 Array Multiplier

The main module of the array multiplier is the array 4 entity. Here

“x” and “y” are the four bit input vectors and “p” is the eight bit output

vector.

Figure 3.14 Simulation Result for Array Multiplier

The two 4 bit input values are forced and their s corresponding

results of 8 bits are obtained in Figure 3.14.

47

3.5.2 Proposed High Speed Low Power Multiplier

The main module of the proposed high speed low power multiplier

is designed by using modified booth algorithm. Here the main module uses

three packages, namely the partial product generation, SPST adder, and Full

adder. The simulation result for the partial product generation using modified

booth encoder is represented in Figure 3.15. The simulation results for 4 x 4

high speed low power multiplier using SPST is arrayed in Figure 3.16.

Figure 3.15 Simulation Results for Partial Product Generator

Figure 3.16 Simulation Results for 4x4 Low Power Multiplier

48

Figure 3.17 shows the simulation results for the 16 x 16 high speed

low power multiplier using SPST. In this input A is the 16 bit multiplicand

vector and input B is the 16 bit multiplier vector, while the output is the

product of 16 x 16 Multiplier.

Figure 3.17 Simulation Result for the 16 X 16 Low Power Multiplier

3.5.3 Synthesis Report

The proposed high speed low power multiplier achieves a

significant improvement in speed and power dissipation. It has been found

that the high speed low power multiplier using SPST takes less time delay and

requires less power than array multiplier. For the latest Xilinx, Spartan 2

family, it has been found that the time delay for the high speed low power

multiplier using SPST is 14.886 ns, while it is 21.068 ns for array multipliers.

The power dissipation for high speed low power multiplier using SPST is

35mW while it is 44 mW for array multipliers. The results are summarized in

Table 3.5.

49

Table 3.5 Comparison of Multipliers

S.NO Multiplier type Vendor DeviceEstimated

delay (ns)

Power

dissipation(mW)

1. Array multiplier XilinxSparton

221.068 44

2.

Proposed high

speed low power

Multiplier

XilinxSparton

214.886 35

In this approach, the reduced power and delay of the low power

Multiplier using SPST on modified booth encoders are evaluated. This design

is verified using VHDL behavioral simulation. The SPST equipped

multipliers using AND gates save about 20.45 % power dissipation and they

having 29.3% more speed than the array multiplier.

3.6 SUMMARY

In the power optimization technique delineated so far, the low

power multiplier is designed by equipping the SPST on a modified Booth

encoder which is controlled by a detection unit using an AND gate. The

modified booth encoder will reduce the number of partial products generated

by a factor of 2. The SPST adder avoids the unwanted addition and thus

minimizes the switching power dissipation. The SPST multiplier

implementation with AND gates shows an extremely high flexibility on

adjusting the data asserting time. This facilitates the robustness of SPST to

attain 29.3% speed improvement and 20.45% power reduction when

compared with the conventional multipliers. This design can be verified using

Modelsim 6.5 using VHDL coding and power delay estimated using Xilinix

software.