a 13.3ns double-precision floating-point alu and multiplier

8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

1/5

A

13.3ns Double-precision Floating-point ALU and Multiplier

H.

Yamada, T. Hottat, T. Nishiyama, F. Murabayashi, T. Yamauchi, and

H.

Sawamoto

General P urpose Com puter Division, Hitachi Ltd. THitachi Research Labo ratory, Hitachi Ltd.

1

Horiyamashita, Hadano City, Kanagaw a Prefecture,

259- 13

Japan

Abstract

One-bit pre-shifting before alignment shift,

normalization with anticipated leading '1' bit and

pre-rounding techniques have been developed for a

floating-point arithmetic logic unit (ALU). In addition,

carry select addition and pre-rounding techniques have

been developed for a floating-point multiplier. A noise

tolerant precharge (NTF') circuit was designed and

applied to the ALU and multiplier. These techniques

reduced the delay time of

the

critical path by 24%. Each

unit was fabricated in 0.3 ym 2.5V four-layer-metal

CMOS

technology and achieved a two-cycle latency at

150 MHz.

1. Introduction

Scientific and engineering applications demand

exceptionally high floating-point performance which in

turn requires high speed floating-point ALUs and

multipliers to reduce executing time. In recent years a

number of high speed floating-point execution units

have been presented [ll

-

[61.

A floating-point ALU and multiplier were designed

which are each capable of 13.311s ex ecution . Th e ALU

nd multiplier can each individually produce a result in a

one-cycle pipelined pitch, achieving a peak execution

rate of 300MFLOPS at 15OMHz. The units re in

full

compliance with the IEEE Standard for Binary

Floating-point Arithmetic (Std. 754-1985) [7].

Th e ALU performs add, subtract, compare, convert

to smaller/larger floating-point precision value, and

convert floating to/from integer instructions for both

double and single precision operands. The ALU can

produce a denormalized number without requiring an

additional cycle.

The multiplier performs floating-point

multiplication for both double

nd

single precision

operands and integer multiplication for sing le precision

operands. The multiplier

is

unable to produce a

denormalized number, but it can optionally generate a

correctly signed zero instead of a denonnalized number

to avoid decre se of performance

due

to a trap.

To accom plish the 13.311s executin g time, these

execution units were designed with several new

arithmetic and circuit techniques and fabricated with the

most advanced silicon technology. This paper describes

the arithmetic and circuit techniques developed for the

ALU

nd

multiplier.

2. ALU

A block diagram of the floating-point ALU is

shown in Figure 1.

It

is a tw o stage pipelined m achine.

In the first stage, the exponen t of the larger operand is

selected as the common exponent and the fraction of the

operand with the sm aller exponent is shifted to

the

right

by the alignment shifter. In the second stage,

addition/subtraction of the fraction of

the

larger

exponent operand and the right shifted fraction, as well

as normalization, IEEE rounding, and correction of the

common exponent

re

performed.

Three arithmetic techniques are used in the ALU.

The fist is one-bit pre-shifting of both fractions in

effective addition cases. This technique

is

useful for

making the rounding process easier. The second is

normalization with the anticipated leading '1' bit of

addition/subtraction results. This normalization process

is

fast even if the anticipated bit

is

wrong, because the

incorrectly shifted fraction can be adjusted by a simple

one-bit right shift. The third technique utilized is

pre-rounding, which prepares all p ossible rounded

results in parallel with addition/subtraction of aligned

fractions and selects the correct one with the leading '1'

bit of the addsubtract result. By using this technique,

the rounding proce ss is acceralated by

5 1%.

466

1063-6404/95 4.00 1995

EEE


2/5

S1.El S2.E2 F1 F2

SWAP

I

Q

X

a,

g l

I

shift number(Ediff)

E

8

I I

16527

+ +

SELECT

ST

ET

FT(1:51) FT(52)

ES l One-bit pre-shifting

23 Normalization with anticipated leading '1' bit

re-rounding

Figure

1. ALU

block diagram

2 . 1 One bit pre shifting

When effective addition is performed, both fractions

of the operands are shifted right by one bit first, and

then the shifted fraction of the smaller exponent operand

is right shifted

the

amount of the operand exponent

difference (Ediff). Th e addition result of the aligned

fractions lies between

0.1

and

1.1

11..

.

(represented in

binary) and may exceed the IEEE format bit length.

Normalization shift left by one/zero bit position and

rounding

re performed

if necessary.

When effective subtraction is performed, the

fraction of th e smaller exponent operand is right shifted

the amount of Ediff. If

Ediff=O

or 1, the subtraction

result of the aligned fraction is less than or equal to 1,

so

performing

a

large normalization shift is necessary.

However, the normalized result already complies

to

the

IEEE format bit len gth,

so

rounding is not

performed.

If

Ediff>l, the substraction result lies between 0.1 and

1.111.. and may exceed the IEEE format bit length . In

such

cases, normalization shift left by zero/one bit

position

nd

rounding

re performed

if necessary.

2 .2 Normalization with anticipated leading

1

bit

The normalization process consists of the

following four steps:

1)

leading

'1'

bit anticipation of

the add subtract result;

2)

shift control signal generation

with priority encoding of the leading

'1'

bit anticipation

result;

3)

left shift of the addsubtrac t result by the shift

control signal;

4)

one-bit right shift adjustment if the

anticipated leading

'1'

bit is incorrect.

The algorithm

used

for the leading '1' bit

anticipation is as follows, The leading

'1'

bit

anticipation signal Z is

:

(1)

where

the

i-th bit of signal Z is defined as

(2)

nd

a,

and b, are the i-th bits of the fractions to be dded

(02

2 52). In equation

2 ) ,

represents an

EXCLUSIVE-OR and

I

represents an

OR.

Producing

signal

Z

requires

a

maximum of

2

gate delays

(2

EXCLUSIVE-ORs) which

is

far smaller than

the 7-8

gate delays necessary for a bit carry lookahead adder.

The

leading '1' bit position of signal

Z is

equal to or

only one bit lower than that of the adds ubt ract result. If

the anticipated bit is w rong, the normalization shift i s

incorrect by one bit position and can be adjusted by a

simple on e bit right sh ift. If the anticipated bit is

correct, no further shifting is required. Table 1 shows

examples of the leading '1' bit anticipation.

z = z,

z, z2

* * * z, . * * 5

= (abl

b,,)

(4

b3

Table

1.

Examples

of

leading

1

bit ant icipat ion

(a) Correct anticipation

A

0 1 . 0 1 0 0 0 1 1 0 0 0 1 1 1

B 1 1 . 0 0 0 1 1 0 1 0 1 0 0 0 1

z

0 . 0 1 1 1 0 0 0 0 1 1 1 0 0

(sum 0 .

0 1

1 0 0 0 0 0 1 1 0 0 0 )

shift number=2 (adjustment shift=O)

(b) Incorrect anticipation

A 0 1 . 0 1 1 0 0 1 1 0 0 0 1 1 1

B 1 1 . 0 1 0 1 1 0 1 0 1 0 0 0 1

z 0 . 0 1 1 0 0 0 0 0 1 1 1 0 0

(sum

0 . 1 1 0 0 0 0 0 0 1 1 0 0 0 )

t----l

shift numberla (adjustment shift=l)

2 .3

Pre rounding

Figure

2

shows the pre-rounding scheme. The

pre-rounding process of the ALU calculations consists

of four steps.

467


3/5

The first step involves incrementing the

addsubtra ct result at the 52nd decimal place by one.

This incrementation is performed in parallel with the

additionlsubtraction, and the result is ignored if no carry

arises from rounding. In the second step, three

independent pre-roundings are performed for the three

possible positions of

the

leading

1

bit (type

1,

type 2,

type 3 ) . Type 1, 2, and

3

are the cases when the leading

'1' bit is located one bit left, one bit right, and two or

more bits right of

the

decimal point. Bits 52 to 55 of

the addsubtract result, sign bit, and rounding mode

signals are used to calculate the three rounding carries

and the three least significant bits of the rounded results

in pre-rounders RO, R1, and R2. In the third step, the

correct pre-rounded result is selec ted according to the

most significant two bits of the addsubtract result. If

the two bits are '10' or '1 l', the re sults of RO a re used.

If the bits are

O l ,

the results of

R1

xe used.

Otherwise,

the

results of R 2 a re used. In the four step,

the selected carry is used to select either the incremented

result calculated

in

the first step, or the addsubtract

result.

Calculation of

the

most significant two bits of the

addsubtra ct result followed by the selection of the

rounding carry signal is one of the most critical paths,

so normalization shifters were intentionally removed

from the critical path. In this way they can execute in

parallel with the rounding carry calculation.

ahresult SO SI

s2 - -

s52 s53 s54 s55

type l (R0) 1

X

X

X X

L R 1 L:

least

~

type2(R1) 0 1

x x

x

x L R

S R:round

S: sticky

type

3(R2)

0 0 x x x x x 0 0 x: o

m

01 -> R1

-> R2

rounding carry

r52

Figure 2. Pre-rounding scheme

3. Multiplier

A

block diagram of the floating-point multiplier is

shown in Figure

3 .

Like the

ALU,

it is also a two stage

pipelined machine. In the first stage, one of the

fractions is encoded using a Radix 4 Booth algorithm.

The generated twenty seven 54-bit

partial

products

are

summed by the Wallace tree [8]. The partial product

array utilizes a 4-2 compressor tree rather than a 3-2 full

dder

in order to

reduce

tree depth and to simplify

layout. Exponent addition and rebias are also performed

in the first stage. In the second stage, carry propagate

addition of the partial product sum (carry save form ), as

well as normalization, IEEE rounding, and exponent

correction

re

performed.

Two arithmetic techniques are used in the

multiplier. The first involves spliting the Wallace tree

sum and performing the upper 52-bits and lower 54-bits

addition calculations in parallel. The second technique is

pre-rounding which is similar to that of

the

floating-point ALU.

El

E2

f I F2

Radix 4 BOOTH ENCODER

PARTIAL PRODUCTS

.c

FT(52)

f

FT(1:51)

f

ET

64 Carry

select addition

[

Pre-rounding

Figure

3.

Mult ipl ier lock diagram

3 .1 Carry sel ect addition

Partial product sum carry save form)

is

divided i n t o

two pairs (one is a pairing of the upper 52-bits and the

other is a pairing

of

the lower 54-bits). With-carry and

without-carry cases are calculated for the upper 52-bits,

and the correct sum is selected by

the

carry

from the

lower 54-bit sum. Addition of the lower pair is also

performed in parallel with the upper pair calculation,

and the signal P (propagate carry from the most

468


4/5

significant bit),

L

(least bit),

G

(guard bit), R (round

bit),

nd S

(sticky bit) are output.

(2)

With

arithmetic

3.2

Pre rounding

round r

I

The pre-rounding

of

multiplication results consists

of

three

steps. In the first step, the rounding carry CO

C, and the rounded results

Lo,

Go,

L,

are calculated. CO

Lo and Go are

the

results when G is the least significant

bit, and

C ,

and

L,

re the results when

L

is the least

significant bit. In the second step, the correct rounding

carry signal C, and rounded results

L,, G

are selected.

G

has

no meaning when

L

is the least significant bit.

Finally either the

dded

result or incremented result is

selected by

the

carry signal C, as the upper portion of

the rounded result.

(3) With

circuit

technique

4. Design methodology

round 1

I

add/su

4 . 1 Circuit

The noise tolerant precharge (NTP) circuit, a high

speed

and high noise tolerance CMOS circuit was

developed and adopted for critical paths of the

ALU

and

multiplier [9 ] . Figure

4

shows a block diagram of the

NTP circuit. The NTP circuit has a noise tolerant

PMOS logic which provides high noise immunity. The

NTP

circuit is precharged when the clock is low, and

the circuit is evaluated when the clock is high. The

delay time

of

the circuit is determined by the NMOS

logic. The NTP circuit has a

30-36

delay time

advantage over a conventional

CMOS

circuit. Three

types of NTP circuits were designed in order to

accelerate

the

time critical paths in cany lookahead

adders nd

leading '1' bit anticipator.

Noise-tolerant 7 7

OUT

IN2

IN3

I

CK

Discharge

NMOS

Figure

4.

NTR circuit block diagram

4 .2

Performance

Figure

5

shows the delay time of the floating-point

ALU.

Each delay time was calclated by a circuit

simulation. By using the above arithmetic techniques,

thedelay time

of

the max imum critical path i s reduced

by

15.4 .

Moreover, by using the NTP circuit, the

delay time of carry propagation in additiodsubtraction

and leading

'1'

bit anticipation in normalization is

reduced

s well, reducing the total delay time by 24 .

Delay time (ns)

0 5

10

15

1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 1 1

.format alian etc.

/

- 17.5ns

(') Without normalize/

I

_

lalianment shift1 addhub

I

m , , n A

techniques

4 . 3

Floating point unit

A

floating-point unit utilizing the

ALU

and the

multiplier were fabricated in 0.3pm four-layer-metal

CMOS technology. A block diagram of the

floating-point unit is shown in Figure

6.

The

floating-point unit contains four major sub-units: a

128x64-bit register file,

an ALU,

a multiplier, and

a

dividdsquare root unit (Div/Sqrt). The register file has

four write ports and four read ports, which allows

parallel execution of a load, an

ALU,

and

a

multiply

operation.

A

microphotograph of the floating-point unit

is shown

in

Figure

7. All

of the cells were placed

manually

to

shorten the wire length, and the routing

of

the macro was made automatically except for

the

critical

parts. Table

2

summarizes the floating-point latency nd

throughput.

5 .

Conclusion

One-bit pre-shifting before alignment shift,

Normalization with the anticipated leading 1 bit and

pre-rounding techniques have been developed for

a

floating-point

ALU.

Carry select addition and

469


5/5

pre-rounding techniques have been developed for a

floating-point multiplier. A high

speed

and high noise

toleranct precharge (NTP)

CMOS

circuit was developed

in order to accelerate critical paths of the ALU and

multiplie r. The se technique s reduced the delay tim e of

critical path by 24 .Each unit was fabricated in 0.3pm

four-layer-metal

CMOS

technology

nd

achieved a

two

cycle latency at 150 MHz.

Acknowledgements

The authors would like to thank A. Anzai, M.

Hashimoto, R. Yamagata,

T.

Kumagai, E. Kamada, T.

Nakano,

K. Kaneko,

N.

Ido, Y . Kiyoshige, S . Muto,

S

Tanaka, K. Shimamura, K. Matsuo, T. Shimizu, nd S

Nakahara of Hitachi Ltd. for their technical support,

discussions, d guidance.

References

[ l] R.

K.

Mon toye et al., Design of the IBM RISC

Syste m/6 000 Floating-Point Execution Unit, IBM J. Res.

Develop. Vol. 34, No. 1 , pp. 59 -70, January 199 0.

[2] J. Yetter, A 100-MHz Superscalar PA-RISC

CPU/Co prosessor Chip, Digest of Technical Papers,

Symp. VLSICircuits , pp.12 -13, 19 92.

[3] D. W. Dobberpuhl et al., A 200-MHz 64-b Dual-Issue

CM OS Microprocessor, IEEE J. Solid -state Circuits, Vol.

2 7 , No . 1 1 , p p . 1 5 5 5 - 1 55 7 , No v e mb e r1 9 9 2 .

[4] L. Gwennap, Digital Leads the Pack with 21164,

Microprocessor Report, Vol. 8, No. 12, pp. 6-10,

September 1994.

[5] L. Gwe nnap, MIPS RlOOOO Uses Decoup led

Architecture, Microprocessor Report, Vol. 8,

No.

14,

pp.

18-22 , October 199 4.

[6]

L.

Gwennap, PA-8000 Combines Complexity and

Speed, Microprocessor Report, Vol.

8,

No. 15, pp. 6-9,

November 1994 .

[7] IEEE Standard

for

Binary Floating-point Arithmetic,

A N S E E E Standard No.754, 1988.

[8] C.S. Wallace,

A

Suggestion for a Fast M ultiplier,

Trans. IEEE Electronic Computers, Vol. EC -13, pp. 1 4-17,

February 1964.

191 F. Murab ayashi et al., 2.5V

NOVEL

CMOS CIRCUIT

TECHNIQUES FOR A 150MHz SUPERSCALAR RISC

PROCESSO R, to be published in ESS CIR CP5 , September

1 9 9 5 .

Figure

6.

Float ing-point u ni t b lock d iagram

Register File ALU Multiplier DivISqrt

I

I

I I

I I

Figure

7.

Float ing-point uni t m icrophotograph

Table 2. Float ing-point la tency and th rough put

Multiply

Divide

Doubl

Latency

(Cycleln5

2113.3

211

3.3

811

20.0

31/206.7

-precision

Throug hpu

(Cyclelns)

116.7

116.7

171113.3

30/200.0

470

a 13.3ns double-precision floating-point alu and multiplier

Documents