prof. v.g. oklobdzijavlsi arithmetic1 vlsi arithmetic adders & multipliers prof. vojin g....

Prof. V.G. Oklobdzija VLSI Arithmetic 1

VLSI ArithmeticAdders & Multipliers

Prof. Vojin G. Oklobdzija

University of California

http://www.ece.ucdavis.edu/acsel


Introduction• Digital Computer Arithmetic belongs to

Computer Architecture, however, it is also an aspect of logic design

• The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way.

• Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation.


Basic Operations

• Addition

• Multiplication

• Multiply-Add

• Division

• Evaluation of Functions


Addition of Binary NumbersFull Adder. The full adder is the fundamental building block of most arithmetic circuits:

The sum and carry outputs are described as:

iiiiiiiiiiiiiiiiiii cbcabacbacbacbacbac 1

iiiiiiiiiiiii cbacbacbacbas

FullAdder

CinCout

si

ai bi


Addition of Binary Numbers

Propagate

Propagate

Generate

Generate

Inputs Outputs

ci ai bi si ci+1

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

0 1 1 0 1

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 1 1


Full-Adder Implementation Full Adder operations is defined by equations:

iiiiiiiiiiiiiiiiii cpcbacbacbacbacbas

iiiiiiiiiiii cpgbacbacbac 1

One-bit adder could be implemented as shown

Carry-Propagate:and Carry-Generate gi

iii bap

iii bag cout c in

s i

a i b i


High-Speed Addition

iii cps

iiii cpgc 1

One-bit adder could be implemented more efficiently

because MUX is faster

iii bap iii bag

0

1s

b ia i

cout

s i

c in


The Ripple-Carry Adder


The Ripple-Carry AdderA0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2

A3 B3

S3

Co,3

(= Ci,1)FA FA FA FA

Worst case delay linear with the number of bits

tadder N 1– tcarry tsum+

td = O(N)

Goal: Make the fastest possible carry path circuit

From Rabaey


Inversion Property

A B

S

CoCi FA

A B

S

CoCi FA

S A B Ci S A B Ci

=

Co A B Ci Co A B Ci

=

From Rabaey


Minimize Critical Path by Reducing Inverting Stages

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2 Co,3FA’ FA’ FA’ FA’

A3 B3

S3

Odd CellEven Cell

Exploit Inversion Property

Note: need 2 different types of cellsFrom Rabaey


Manchester Carry-Chain Realization of the Carry Path

• Simple and very popular scheme for implementation of carry signal path

V dd

Carry out Carry in

Propagatedevice

Predischarge& kill device

Generatedevice

++++++++

V ddV ddV ddV ddV ddV ddV dd


Manchester Carry Chain

P0

Ci,0

P1

G0

P2

G1

P3

G2

P4

G3 G4

VDD

Kilburn, et al, IEE Proc, 1959.

•Implement P with pass-transistors•Implement G with pull-up, kill (delete) with pull-down•Use dynamic logic to reduce the complexity and speed up


Ripple Carry Adder Carry-Chain of an RCA implemented using multiplexer from the standard cell library:

a i+1 b i+1 a i b ia i+2 b i+2

cout

c i+1 c i

s is i+1s i+2

c in

Critical Path

Oklobdzija, ISCAS’88


Pass-Transistor Realization in DPL A

A

B

B

C C

V C CS

S

XO R /XN O R M U LT IPLEX ER B U FFER

C C

M U LT IPLEX ER

V C CC

O

CO

B U FFER

V C C

V C C

O R /N O R

A N D /N A N D

A

A

B

B

A

A

B

B


Carry-Skip Adder

MacSorley, Proc IRE 1/61Lehman, Burla, IRE Trans on Comp, 12/61


Carry-Skip Adder

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci ,0

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,2Co,1Co,0Ci,0

Co,3

Mul

tipl

exer

BP=PoP1P2P3

Idea: If (P0 and P1 and P2 and P3 = 1)then Co3 = C0, else “kill” or “generate”.

Bypass

From Rabaey


Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups

G r G r-1

...

SN-k-1S N-1

a N -1bN -1 b N -k-1a N -k-1

S(r-1)k-1 S (r-2)k

G 1G o

...

Sk

S2k-1

a 2k-1b 2k-1 b kak

Sk-1

S0

...

...a (r-1)k b(r-1)k a (r-1)kb (r-1)k

...a k-1 b k-1 a0 b 0

...

C in

... ... ... ... ... ... ... ...

P r-1P r-2 P 1 P 0

C out + + + +

A N D

O RO RO R O R

A N DA N DA N D

critica l pa th , de lay =2(k-1)+(N /2-2)


Carry-Skip Adder

SKIPRCAd tN

tkt

2

212

N

tp

ripple adder

bypass adder

4..8

k


Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Carry-chain of a 32-bit Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

G 0

... ...

a0 b

0

...

...

ai

bi

aN-1

bN-1

S j

P m -2

C inC out

C ou

t

G 2G m -2G m -1G m

G 0G 1G 2G m -2G m -1G m

S N-1S i

S 0

P 2P 0P m -1P m

.....

G 1

P 1

C in

.....

aj b

j

Carry signal path

skip ing

ripp ling



1 12 23 34 4

5 56

=9

Any-point-to-any-point delay = 9 as compared to 12 for CSKA


Carry-chain block size determination for a 32-bit Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Delay Calculation for Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

P0

Ci,0

P1

G0

P2

G1

P3

G2

BP

G3

BP

Co,3

Delay model:


Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

Variable Group Length

Oklobdzija, Barnes, Arith’85

321 cNcctd



Variable Block Lengths

• No closed form solution for delay• It is a dynamic programming problem


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Delay Comparison: Variable Block Adder

0

2

4

6

8

10

12

14

16

4 11 18 25 32 39 46 53 60

Size N

Del

ay

VBA- Multi-Level

CLA

VBA


Fan-Out Dependency


Fan-In Dependency


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Carry-Lookahead Adder(Weinberger and Smith)

Weinberger and J. L. Smith, “A Logic for High-Speed Addition”,

National Bureau of Standards, Circ. 591, p.3-12, 1958.



1111

111

1112

)(

cppgpg

cpgpg

cpgc

iiiii

iiii

iiii

iiiiiiiiiiii cpgbacbacbac 1

iiiiiiiiii

iiiiiiii

iiii

cpppgppgpg

cppgpgpg

cpgc

1212122

11122

2223

)(


Carry-Lookahead Adder

jiiiiiiiiij cpppgppgpgG 123123233

iiiij ppppP 123

jiij cPGc 4)1(4

One gate delay to calculate p, g

One to calculateP and two for G

Three gate delaysTo calculate C4(j+1)

Compare that to 8 in RCA !

a i b i

Cin Cj

G jP j

a i+1 b i+1

g i+1p i+1 g i p i

a i+2 b i+2a i+3 b i+3

g i+1p i+1g i+1p i+1

C4(j+1)

C4j+1C4j+2C4j+3

P , G G roup



iiiiiiiiiij GPPPGPPGPG 123123233*G

iiiij PPPPP 123*

jkkj cPGc 4)1(4 **

P j

G* P*

C 4j+1

G jP j+1G j+1P j+3G j+3P j+2G j+2

C4jC4(j+1)

C 4j+2C 4j+3

Additional two gate delays

C16 will take a total of 5 vs. 32 for RCA !


32-bit Carry Lookahead Adder

C in

C out C in

C 4C 8C 12

C out

C 20C 24C 28

C in

C 16

a ib i

ind ividua l addersgenera ting: g i, p i,

and sum S i

C arry-lookahead b locks o f4-b its generating:

G i, P i, and C in fo r theadders

C arry-lookahead super- b locks o f4-b its b locks genera ting:

G * i, P * i, and C in fo r the 4-b itb locks

G roup producing fina lcarry C out and C 16

C ritica l pa th de lay = (fo r g i,p i)+2x2 (fo r G ,P )+3x2 (fo r C in)+1XO R - (fo r S um ) = appx. 12of de lay


Carry-Lookahead Adder(Weinberger and Smith: original derivation )


Carry-Lookahead Adder (Weinberger and Smith)please notice the similarity with Parallel-Prefix Adders !


Delay Optimized CLA

B. Lee, V. G. Oklobdzija

Journal of VLSI Signal Processing, Vol.3, No.4, October 1991


Delay Optimized CLA: Lee-Oklobdzija

‘91(a.) Fixed groups and levels

(b.) variable-sized groups, fixed levels

(c.) variable-sized groups and fixed levels

(d.) variable-sized groups and levels


Two-Levels of Logic Implementation of the Carry Block


Two-Levels of Logic Implementation of the Carry-Lookahead Block


Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)


Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)


Delay Optimized CLA: Lee-Oklobdzija ‘91

Delay: Two-level BCLA Delay: Three-level BCLA


Delay Optimized CLA: Lee-Oklobdzija ‘91

(a.) 2-level BCLA =8.5nS (b.) 3-level BCLA =8.9nS


Motorola: CLA Implementation Example

A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”,

Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.


Critical path in Motorola's 64-bit CLA

C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63

G4

P7

G0

P0

G1

P1

G2

P2

G3

P3

...

CARRYBLOCK

G8

P1

1

... G1

2

P1

5

... G1

6

P3

1

... G3

2

P4

7

... G4

8

P5

1

G6

0

P6

0

G6

1

P6

1

G6

2

P6

2

G6

3

P6

3

... G5

2

P5

5

... G5

6

P5

9

...

PG BLOCK

PG BLOCK

PG BLOCK

PG BLOCK

P,G

0

P,G

1:0

P,G

2:0

G3

:0

P3

:0

G7

:4

P7

:4

G1

1:8

P1

1:8

G1

5:1

2

P1

5:1

2

G3

:0

P3

:0

G7

:0

P7

:0

G1

1:0

P1

1:0

G1

5:0

P1

5:0

G1

5:0

P1

5:0

G3

1:1

6

P3

1:1

6

G3

1:0

P3

1:0

G4

7:3

2

P4

7:3

2

G4

7:0

P4

7:0

G5

1:4

8

P5

1:4

8

G5

5:5

2

P5

5:5

2

G5

9:5

6

P5

9:5

6

C6

4

G5

1:4

8

P5

1:4

8

G5

5:4

8

P5

5:4

8

G5

9:4

8

P5

9:4

8

P,G

60

P,G

61

:60

P,G

62

:60

G6

3:6

0

P6

3:6

0

G6

3:4

8

P6

3:4

8

G6

3:0

P6

3:0

C0

C4

C8

C1

2

C1

6

C3

2

C4

8

C1

6

C3

2

C4

8

C5

2

C5

6

C6

0

C6

3

PG BLOCK

C6

2

C6

1


Motorola's 64-bit CLA

conventional PG Block


Motorola's 64-bit CLA

Modified PG Block

Intermediate propagate signals Pi:0 are generated to speed-up C3


Ling’s Adder

Huey Ling, “High-Speed Binary Adder”

IBM Journal of Research and Development, Vol.5, No.3, 1981.


Ling AdderVariation of CLA:

Ling, IBM J. Res. Dev, 5/81

1 iiii GpgG

1 iii GpS

iii bap

iii bag

11 iiii HtgH

11 iiiiii HtgHtS

iii bat

iii bag

Ling’s equations:


Ling Adder

1 iiii GpgG

1

11

iiii

iiiiii

Gpgg

GpGggG

1 iiii GtgG11 iiii GtgH

Ling’s equation

Doran, Trans on Comp 9/88

Propagates informationon two bits


Ling Adder

01231232333 gtttgttgtgG

0121223

00121122233

gttgtgg

gtttgttgtgH

Conventional:

Ling:


S. Naffziger, ISSCC’96


Results:S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96

• 0.5u Technology

• Speed: 0.930 nS

• Nominal process, 80C, V=3.3V


ConditionalSum Adder

J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic

Computers, EC-9, p.226-231, 1960.


Carry-Select Adder

O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June

1962, p.340-34


Carry-Select AdderAddition under assumption of Cin=0 and Cin =1.


Carry Select Adder:combining two 32-b VBAs in select mode

Delay =VBA32+ MUX


Addition Under Non-equal Signal Arrival Profile

Assumption

P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer

Academic Publishers, Vol.14, No.3, December 1996


Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree

Prof. V.G. Oklobdzija VLSI Arithmetic 78Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995


Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995


Performing Multiply-Add Operation in the Multiply Time

P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the

Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific

Grove, California, July 5 - 9, 1997.


Final Adder: Implementation


Recurrence Solver Based Adders

Koggie and Stone, IEEE Trans on Computers, August 1973

Bilgory and Gajski, 18th DAC, 1981

Brent and Kung, IEEE Trans on Computers, March 1982


Recurrence Solver Based Adders• 1973, Koggie and Stone published a general

recurrence scheme for parallel computation• 1979, Brent and Kung published Tech. Report on

regular layout for parallel adders• 1980, Guibas and Vuillemin, developed a layout

scheme based on recurrence equation for addition• 1980, Ladner and Fisher published “parallel prefix

computation”, Jo of ACM• 1981, Bilgory and Gajski published a paper on

recurrence structures for automatic cell generation


Recurrence Solver Based Adders

They are based on recurrence equation for P,G

(what is new there since Weinberger ?!!):

Or: and

jiiiiiiiiij cpppgppgpgG 123123233

iiiij ppppP 123

11 iiii GpgG11 iii PpP


Recurrence Solver Based Adders C 16 C 13C 14C 15 C 7 C 1C 2C 3C 8 C 4C 5C 6C 12 C 9C 10C 11

(g1 , p

1 )

(g3 , p

3 )

(g4 , p

4 )

(g2 , p

2 )

(g5 , p

5 )

(g7 , p

7 )

(g8 , p

8 )

(g6 , p

6 )

(g9 , p

9 )

(g11 , p

11 )

(g12 , p

12 )

(g10 , p

10 )

(g13 , p

13 )

(g15 , p

15 )

(g16 , p

16 )

(g14 , p

14 )

generationof carry

generationof g i, p i


Carry-Lookahead Adder (Weinberger and Smith)

Just to remind you !please notice the similarity with Parallel-Prefix Adders !


Multiplexer Based Adder

Farooqui and Oklobdzija1999 Int’l Sym. on VLSI Technology, Taipei,

Taiwan, June 8-10, 1999


Multiplexer Based Adder

• Based on the realization that MUX circuit is faster than a logic gate due to its transmission gate implementation

• Based on Carry-Lookahead method (W-S), or recurrence solver.


Multiplexer Based AdderA. A. Farooqui, V. G. Oklobdzija , F. Chechrazi, 1999 Int’l Sym. on VLSI

Technology, Taipei, Taiwan, June 8-10, 1999.

a3b2a2 b2a2b3a3

0 1

b0 a0 a1b0 a0 b1 a1

0 1

01

g01g23

p23

p3p1

g03p03

g03 p03

g3p

3

g2p

2

g1p

1

g0p

0




4 -b it M U Xb a se d g ro u p

c a r ry g e n .


c a r ry g e n .


c a r ry g e n .


c a r ry g e n .

M U X an d N O RM U X an d N O R

M U X an d N A N DM U X an d N A N D

A 03B 03A 47B 47A 811B 811A 1215B 1215

G 0 -3

P 0 -3G 4 -7P 4 -7G 8 -11

P 8 -11G 1 2 -1 5

P 1 2 -1 5

C 3C 7C 11C 1 5

P 0 -7

G 0 -7

P 8 -1 5 G 8 -1 5

G 0 -11G 0 -1 5P 0 -11P 0 -1 5

B 811 A 811B 811A1215B1215 A1215B1215

S um 0-3

4 -b itS u m

4 -b itS u m

C in0C in1

S um 4-7

1 0

A 47B 47 A 47B 47

4 -b itS u m

4 -b itS u m

C in0C in1

S um 8-11

1 0

A 811

4 -b itS u m

4 -b itS u m

C in0C in1

S um 12-15

1 0

4 -b itS u m

C in0A 03B 03

AND

AND

P art_C ont

P art_C ont

CSA CSACSA




0 10 1

g0p1

p0

a0b0

0 1

01

a1b1

p2

g1

g1

0 1

01

a2b2

p3

g2

0 1

g2g1

Cin

Sum0Sum1Sum2Sum3




• Results in a very fast structure• 7-MUX delays for a 64-b adder• Delay using standard cell 0.25u, 2.5V, 25oC :

Adder Size (bits)

Delay

(pS)

8 625

16 665

32 710

64 903


DEC "Alpha" 21064 Adder

• Combination:– 8-bit tapered pre-discharged Manchester Carry

Chains, with Cin = 0 and Cin = 1

– 32-bit LSB Carry Lookahead Adder– 32-bit MSB Conditional-Sum Adder– Carry-Select on most significant 32-bits– Latches in the middle: pipelined addition


DEC "Alpha" 21064 Adder Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

Latch & X O R Latch & X O R Latch & X O R Latch & X O R

Latch & X O R Latch & X O RLatch & X O RLatch & X O R

PG K C ellPG K C ell PG K C ell PG K C ell PG K C ellPG K C ell PG K C ell PG K C ell

LookA head

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

M UX

10

10

10

10

10

10

10

C in

Input O perandsB yte 7








R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt


DEC "Alpha" 21064 Adder: Results

• The first 200MHz processor

• Built using 0.75u technology

• V=3.3V, 30W

• Pipelined (two-latches) allowing 5nS throughput and 10nS latency


ConclusionVLSI Implementation of Addition


Conclusion: VLSI Implementation of Addition

• Currently, implementation parameters are not reflected in algorithms used for development

• Layout and wire delays effects are largely neglected and this is becoming intolerable in the next generation of technology

• Transistor sizing has a large effect which can outweight the algorithm

• There is a great disconnect between algorithm and implementation

• New rules and measures of goodness are needed


Multiplication

Parallel Multiplier Implementation


Multiplication Algorithm:

in

i

iin

i

i ryXryXXYP

1

0

1

0

0 p)(0

)(1)1(

jnjj Xyrp

rp for j=0,....,n-1

initially

p(n)=XY after n steps


Parallel MultipliersParallel Multipliers

Step 0

S tep 1

S tep 2

S tep 3

S tep 4


4:2 Compressor

4-2

I4 I1I2I3

C 0 C i

C S


Re-designed 4:2 Compressor with 3 XOR Delay

C inI1

I2

I3

I4

0

1

S

C

C out


Three-Dimensional optimization Method: TDM

(Oklobdzija, Villeger, Liu, 1996)

Sum

Carry

A

BCin

Sum

Carry

A

BCin

I1

I2

I3

I4

C out

C in 3 XO Rdelays


Generation of the Partial Product Reduction Tree in TDM multiplier

E x am ple o f a1 2 X 12 M u l tip lic a tio n

1 0 1 1 0 1 0 1 0 1 0 01 0 1 1 0 1 0 1 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 01 0 1 1 0 1 0 1 0 1 0 0

1 0 1 1 0 1 0 1 0 1 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 1 0 1 0 1 0 01 0 1 1 0 1 0 1 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 01 0 1 1 0 1 0 1 0 1 0 0

Ve rtica l C o m pre sso r S lice - VC S

(P ar tia l P rod uc t fo r X *Y = B 54 * B 1B )

FA FA

F A

FA

0 0 1 1 0 1 0

FA

Tim e

Fin a l A d d er


Speed of Partial Product Reduction for Various Schemes


Booth Recoding Algorithm

xi+2xi+1xi Add to partial product

000 +0Y

001 +1Y

010 +1Y

011 +2Y

100 -2Y

101 -1Y

110 -1Y

111 -0Y


Organization of Hitachi's DPL multiplier

4-2 4-2

4-2

4-2 4-2

4-2

4-2 4-2

4-2

4-2

4-2

4-2

4-2

54 b it 54 b it

B ooth 's E ncoder

108-b C LA A dder

108 b it

W alace 's tree

C onditiona l C arry S e lection (C C S )


Hitachi's 4:2 compressor structure

M UX

M UX

M UX

M UX

I4

I3

I1

I2

M UX

M UX

I1

I3

I4

C i

C i

C o

C

S

3 G ATES


DPL multiplexer circuit

L

H

M U X

D 0

D 1

D 0

D 1

S S

O U T

O U T

O U T

S

D 1

D 0


ConclusionReferences:

1. E. Swartzlander, "Computer Arithmetic". Vol. 1&2, IEEE Computer Society Press, 1990.

2. K. Hwang, "Computer Arithmetic : Principles, Architecture and Design", John Wiley and Sons, 1979.

3. M. Ercegovac, “Digital Systems and Hardware/Firmware Algorithms”, Chapter 12: Arithmetic Algorithms and Processors, John Wiley & Sons, 1985.

4. A. Chandrakasan, W. Bowhill, F Fox, Editors, "Design of High Performance Microprocessors Circuits", IEEE Press, July 2000.

5. V. G. Oklobdzija, “High-Performance System Design: Circuits and Logic”, IEEE Press, July 1999.

Also: http://www.ece.ucdavis.edu/acsel/Publications.html

prof. v.g. oklobdzijavlsi arithmetic1 vlsi arithmetic adders & multipliers prof. vojin g....

Documents