a 13.3ns double-precision floating-point alu and multiplier

Upload: dycsteizn

Post on 06-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

    1/5

    A

    13.3ns Double-precision Floating-point ALU and Multiplier

    H.

    Yamada, T. Hottat, T. Nishiyama, F. Murabayashi, T. Yamauchi, and

    H.

    Sawamoto

    General P urpose Com puter Division, Hitachi Ltd. THitachi Research Labo ratory, Hitachi Ltd.

    1

    Horiyamashita, Hadano City, Kanagaw a Prefecture,

    259- 13

    Japan

    Abstract

    One-bit pre-shifting before alignment shift,

    normalization with anticipated leading '1' bit and

    pre-rounding techniques have been developed for a

    floating-point arithmetic logic unit (ALU). In addition,

    carry select addition and pre-rounding techniques have

    been developed for a floating-point multiplier. A noise

    tolerant precharge (NTF') circuit was designed and

    applied to the ALU and multiplier. These techniques

    reduced the delay time of

    the

    critical path by 24%. Each

    unit was fabricated in 0.3 ym 2.5V four-layer-metal

    CMOS

    technology and achieved a two-cycle latency at

    150 MHz.

    1. Introduction

    Scientific and engineering applications demand

    exceptionally high floating-point performance which in

    turn requires high speed floating-point ALUs and

    multipliers to reduce executing time. In recent years a

    number of high speed floating-point execution units

    have been presented [ll

    -

    [61.

    A floating-point ALU and multiplier were designed

    which are each capable of 13.311s ex ecution . Th e ALU

    nd multiplier can each individually produce a result in a

    one-cycle pipelined pitch, achieving a peak execution

    rate of 300MFLOPS at 15OMHz. The units re in

    full

    compliance with the IEEE Standard for Binary

    Floating-point Arithmetic (Std. 754-1985) [7].

    Th e ALU performs add, subtract, compare, convert

    to smaller/larger floating-point precision value, and

    convert floating to/from integer instructions for both

    double and single precision operands. The ALU can

    produce a denormalized number without requiring an

    additional cycle.

    The multiplier performs floating-point

    multiplication for both double

    nd

    single precision

    operands and integer multiplication for sing le precision

    operands. The multiplier

    is

    unable to produce a

    denormalized number, but it can optionally generate a

    correctly signed zero instead of a denonnalized number

    to avoid decre se of performance

    due

    to a trap.

    To accom plish the 13.311s executin g time, these

    execution units were designed with several new

    arithmetic and circuit techniques and fabricated with the

    most advanced silicon technology. This paper describes

    the arithmetic and circuit techniques developed for the

    ALU

    nd

    multiplier.

    2. ALU

    A block diagram of the floating-point ALU is

    shown in Figure 1. 

    It

    is a tw o stage pipelined m achine.

    In the first stage, the exponen t of the larger operand is

    selected as the common exponent and the fraction of the

    operand with the sm aller exponent is shifted to

    the

    right

    by the alignment shifter. In the second stage,

    addition/subtraction of the fraction of

    the

    larger

    exponent operand and the right shifted fraction, as well

    as normalization, IEEE rounding, and correction of the

    common exponent

    re

    performed.

    Three arithmetic techniques are used in the ALU.

    The fist is one-bit pre-shifting of both fractions in

    effective addition cases. This technique

    is

    useful for

    making the rounding process easier. The second is

    normalization with the anticipated leading '1' bit of

    addition/subtraction results. This normalization process

    is

    fast even if the anticipated bit

    is

    wrong, because the

    incorrectly shifted fraction can be adjusted by a simple

    one-bit right shift. The third technique utilized is

    pre-rounding, which prepares all p ossible rounded

    results in parallel with addition/subtraction of aligned

    fractions and selects the correct one with the leading '1'

    bit of the addsubtract result. By using this technique,

    the rounding proce ss is acceralated by

    5 1%.

    466

    1063-6404/95 4.00 1995

    EEE

  • 8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

    2/5

    S1.El S2.E2 F1 F2

    SWAP

    I

    Q

    X

    a,

    g l

    I

    shift number(Ediff)

    E

    8

    I I

    16527

    + +

    SELECT

    ST

    ET

    FT(1:51) FT(52)

    ES l One-bit pre-shifting

    23 Normalization with anticipated leading '1' bit

    re-rounding

    Figure

    1. ALU

    block diagram

    2 . 1 One bit pre shifting

    When effective addition is performed, both fractions

    of the operands are shifted right by one bit first, and

    then the shifted fraction of the smaller exponent operand

    is right shifted

    the

    amount of the operand exponent

    difference (Ediff). Th e addition result of the aligned

    fractions lies between

    0.1

    and

    1.1

    11..

    .

    (represented in

    binary) and may exceed the IEEE format bit length.

    Normalization shift left by one/zero bit position and

    rounding

    re performed

    if necessary.

    When effective subtraction is performed, the

    fraction of th e smaller exponent operand is right shifted

    the amount of Ediff. If

    Ediff=O

    or 1, the subtraction

    result of the aligned fraction is less than or equal to 1,

    so

    performing

    a

    large normalization shift is necessary.

    However, the normalized result already complies

    to

    the

    IEEE format bit len gth,

    so

    rounding is not

    performed.

    If

    Ediff>l, the substraction result lies between 0.1 and

    1.111.. and may exceed the IEEE format bit length . In

    such

    cases, normalization shift left by zero/one bit

    position

    nd

    rounding

    re performed

    if necessary.

    2 .2 Normalization with anticipated leading

    1

    bit

    The normalization process consists of the

    following four steps:

    1)

    leading

    '1'

    bit anticipation of

    the add subtract result;

    2)

    shift control signal generation

    with priority encoding of the leading

    '1'

    bit anticipation

    result;

    3)

    left shift of the addsubtrac t result by the shift

    control signal;

    4)

    one-bit right shift adjustment if the

    anticipated leading

    '1'

    bit is incorrect.

    The algorithm

    used

    for the leading '1' bit

    anticipation is as follows, The leading

    '1'

    bit

    anticipation signal Z is

    :

    (1)

    where

    the

    i-th bit of signal Z is defined as

    (2)

    nd

    a,

    and b, are the i-th bits of the fractions to be dded

    (02

    2 52). In equation

    2 ) ,

    represents an

    EXCLUSIVE-OR and

    I

    represents an

    OR.

    Producing

    signal

    Z

    requires

    a

    maximum of

    2

    gate delays

    (2

    EXCLUSIVE-ORs) which

    is

    far smaller than

    the 7-8

    gate delays necessary for a bit carry lookahead adder.

    The

    leading '1' bit position of signal

    Z is

    equal to or

    only one bit lower than that of the adds ubt ract result. If

    the anticipated bit is w rong, the normalization shift i s

    incorrect by one bit position and can be adjusted by a

    simple on e bit right sh ift. If the anticipated bit is

    correct, no further shifting is required. Table 1  shows

    examples of the leading '1' bit anticipation.

    z = z,

    z, z2

    * * * z, . * * 5

    = (abl

    b,,)

    (4

    b3

    Table

    1.

    Examples

    of

    leading

    1

    bit ant icipat ion

    (a) Correct anticipation

    A

    0 1 . 0 1 0 0 0 1 1 0 0 0 1 1 1

    B 1 1 . 0 0 0 1 1 0 1 0 1 0 0 0 1

    z

    0 . 0 1 1 1 0 0 0 0 1 1 1 0 0

    (sum 0 .

    0 1

    1 0 0 0 0 0 1 1 0 0 0 )

    shift number=2 (adjustment shift=O)

    (b) Incorrect anticipation

    A 0 1 . 0 1 1 0 0 1 1 0 0 0 1 1 1

    B 1 1 . 0 1 0 1 1 0 1 0 1 0 0 0 1

    z 0 . 0 1 1 0 0 0 0 0 1 1 1 0 0

    (sum

    0 . 1 1 0 0 0 0 0 0 1 1 0 0 0 )

    t----l

    shift numberla (adjustment shift=l)

    2 .3

    Pre rounding

    Figure

    shows the pre-rounding scheme. The

    pre-rounding process of the ALU calculations consists

    of four steps.

    467

  • 8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

    3/5

    The first step involves incrementing the

    addsubtra ct result at the 52nd decimal place by one.

    This incrementation is performed in parallel with the

    additionlsubtraction, and the result is ignored if no carry

    arises from rounding. In the second step, three

    independent pre-roundings are performed for the three

    possible positions of

    the

    leading

    1

    bit (type

    1,

    type 2,

    type 3 ) . Type 1, 2, and

    3

    are the cases when the leading

    '1' bit is located one bit left, one bit right, and two or

    more bits right of

    the

    decimal point. Bits 52 to 55 of

    the addsubtract result, sign bit, and rounding mode

    signals are used to calculate the three rounding carries

    and the three least significant bits of the rounded results

    in pre-rounders RO, R1, and R2. In the third step, the

    correct pre-rounded result is selec ted according to the

    most significant two bits of the addsubtract result. If

    the two bits are '10' or '1 l', the re sults of RO a re used.

    If the bits are

    O l ,

    the results of

    R1

    xe used.

    Otherwise,

    the

    results of R 2 a re used. In the four step,

    the selected carry is used to select either the incremented

    result calculated

    in

    the first step, or the addsubtract

    result.

    Calculation of

    the

    most significant two bits of the

    addsubtra ct result followed by the selection of the

    rounding carry signal is one of the most critical paths,

    so normalization shifters were intentionally removed

    from the critical path. In this way they can execute in

    parallel with the rounding carry calculation.

    ahresult SO SI

    s2 - -

    s52 s53 s54 s55

    type l (R0) 1

    X

    X

    X X

    L R 1 L:

    least

    ~

    type2(R1) 0 1

    x x

    x

    x L R

    S R:round

    S: sticky

    type

    3(R2)

    0 0 x x x x x 0 0 x: o

    m

    01 -> R1

    -> R2

    rounding carry

    r52

    Figure 2. Pre-rounding scheme

    3. Multiplier

    A

    block diagram of the floating-point multiplier is

    shown in Figure

    3 .

    Like the

    ALU,

    it is also a two stage

    pipelined machine. In the first stage, one of the

    fractions is encoded using a Radix 4 Booth algorithm.

    The generated twenty seven 54-bit

    partial

    products

    are

    summed by the Wallace tree [8]. The partial product

    array utilizes a 4-2 compressor tree rather than a 3-2 full

    dder

    in order to

    reduce

    tree depth and to simplify

    layout. Exponent addition and rebias are also performed

    in the first stage. In the second stage, carry propagate

    addition of the partial product sum (carry save form ), as

    well as normalization, IEEE rounding, and exponent

    correction

    re

    performed.

    Two arithmetic techniques are used in the

    multiplier. The first involves spliting the Wallace tree

    sum and performing the upper 52-bits and lower 54-bits

    addition calculations in parallel. The second technique is

    pre-rounding which is similar to that of

    the

    floating-point ALU.

    El

    E2

    f I F2

    Radix 4 BOOTH ENCODER

    PARTIAL PRODUCTS

    .c

    FT(52)

    f

    FT(1:51)

    f

    ET

    64 Carry

    select addition

    [

    Pre-rounding

    Figure

    3.

    Mult ipl ier lock diagram

    3 .1 Carry sel ect addition

    Partial product sum carry save form)

    is

    divided i n t o

    two pairs (one is a pairing of the upper 52-bits and the

    other is a pairing

    of

    the lower 54-bits). With-carry and

    without-carry cases are calculated for the upper 52-bits,

    and the correct sum is selected by

    the

    carry

    from the

    lower 54-bit sum. Addition of the lower pair is also

    performed in parallel with the upper pair calculation,

    and the signal P (propagate carry from the most

    468

  • 8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

    4/5

    significant bit),

    L

    (least bit),

    G

    (guard bit), R (round

    bit),

    nd S

    (sticky bit) are output.

    (2)

    With

    arithmetic

    3.2

    Pre rounding

    round r

    I

    The pre-rounding

    of

    multiplication results consists

    of

    three

    steps. In the first step, the rounding carry CO

    C, and the rounded results

    Lo,

    Go,

    L,

    are calculated. CO

    Lo and Go are

    the

    results when G is the least significant

    bit, and

    C ,

    and

    L,

    re the results when

    L

    is the least

    significant bit. In the second step, the correct rounding

    carry signal C, and rounded results

    L,, G

    are selected.

    G

    has

    no meaning when

    L

    is the least significant bit.

    Finally either the

    dded

    result or incremented result is

    selected by

    the

    carry signal C, as the upper portion of

    the rounded result.

    (3) With

    circuit

    technique

    4. Design methodology

    round 1

    I

    add/su

    4 . 1 Circuit

    The noise tolerant precharge (NTP) circuit, a high

    speed

    and high noise tolerance CMOS circuit was

    developed and adopted for critical paths of the

    ALU

    and

    multiplier [9 ] .  Figure

    shows a block diagram of the

    NTP circuit. The NTP circuit has a noise tolerant

    PMOS logic which provides high noise immunity. The

    NTP

    circuit is precharged when the clock is low, and

    the circuit is evaluated when the clock is high. The

    delay time

    of

    the circuit is determined by the NMOS

    logic. The NTP circuit has a

    30-36

    delay time

    advantage over a conventional

    CMOS

    circuit. Three

    types of NTP circuits were designed in order to

    accelerate

    the

    time critical paths in cany lookahead

    adders nd

    leading '1' bit anticipator.

    Noise-tolerant 7 7

    OUT

    IN2

    IN3

    I

    CK

    Discharge

    NMOS

    Figure

    4.

    NTR circuit block diagram

    4 .2

    Performance

    Figure

    5

    shows the delay time of the floating-point

    ALU.

    Each delay time was calclated by a circuit

    simulation. By using the above arithmetic techniques,

    thedelay time

    of

    the max imum critical path i s reduced

    by

    15.4 .

    Moreover, by using the NTP circuit, the

    delay time of carry propagation in additiodsubtraction

    and leading

    '1'

    bit anticipation in normalization is

    reduced

    s well, reducing the total delay time by 24 .

    Delay time (ns)

    0 5

    10

    15

    1 1 1 1 1 1 1 1 1 [ 1 1 1 1 1 1 1 1

    .format alian etc.

    /

    - 17.5ns

    (') Without normalize/

    I

    _

    lalianment shift1 addhub

    I

    m , , n A

    techniques

    4 . 3

    Floating point unit

    A

    floating-point unit utilizing the

    ALU

    and the

    multiplier were fabricated in 0.3pm four-layer-metal

    CMOS technology. A block diagram of the

    floating-point unit is shown in Figure

    6.

    The

    floating-point unit contains four major sub-units: a

    128x64-bit register file,

    an ALU,

    a multiplier, and

    a

    dividdsquare root unit (Div/Sqrt). The register file has

    four write ports and four read ports, which allows

    parallel execution of a load, an

    ALU,

    and

    a

    multiply

    operation.

    A

    microphotograph of the floating-point unit

    is shown

    in

    Figure

    7. All

    of the cells were placed

    manually

    to

    shorten the wire length, and the routing

    of

    the macro was made automatically except for

    the

    critical

    parts. Table

    2

    summarizes the floating-point latency nd

    throughput.

    5 .

    Conclusion

    One-bit pre-shifting before alignment shift,

    Normalization with the anticipated leading 1 bit and

    pre-rounding techniques have been developed for

    a

    floating-point

    ALU.

    Carry select addition and

    469

  • 8/16/2019 A 13.3ns Double-precision Floating-point ALU and Multiplier

    5/5

    pre-rounding techniques have been developed for a

    floating-point multiplier. A high

    speed

    and high noise

    toleranct precharge (NTP)

    CMOS

    circuit was developed

    in order to accelerate critical paths of the ALU and

    multiplie r. The se technique s reduced the delay tim e of

    critical path by 24 .Each unit was fabricated in 0.3pm

    four-layer-metal

    CMOS

    technology

    nd

    achieved a

    two

    cycle latency at 150 MHz.

    Acknowledgements

    The authors would like to thank A. Anzai, M.

    Hashimoto, R. Yamagata,

    T.

    Kumagai, E. Kamada, T.

    Nakano,

    K. Kaneko,

    N.

    Ido, Y . Kiyoshige, S . Muto,

    S

    Tanaka, K. Shimamura, K. Matsuo, T. Shimizu, nd S

    Nakahara of Hitachi Ltd. for their technical support,

    discussions, d guidance.

    References

    [ l] R.

    K.

    Mon toye et al., Design of the IBM RISC

    Syste m/6 000 Floating-Point Execution Unit, IBM J. Res.

    Develop. Vol. 34, No. 1 , pp. 59 -70, January 199 0.

    [2] J. Yetter, A 100-MHz Superscalar PA-RISC

    CPU/Co prosessor Chip, Digest of Technical Papers,

    Symp. VLSICircuits , pp.12 -13, 19 92.

    [3] D. W. Dobberpuhl et al., A 200-MHz 64-b Dual-Issue

    CM OS Microprocessor, IEEE J. Solid -state Circuits, Vol.

    2 7 , No . 1 1 , p p . 1 5 5 5 - 1 55 7 , No v e mb e r1 9 9 2 .

    [4] L. Gwennap, Digital Leads the Pack with 21164,

    Microprocessor Report, Vol. 8, No. 12, pp. 6-10,

    September 1994.

    [5] L. Gwe nnap, MIPS RlOOOO Uses Decoup led

    Architecture, Microprocessor Report, Vol. 8,

    No.

    14,

    pp.

    18-22 , October 199 4.

    [6]

    L.

    Gwennap, PA-8000 Combines Complexity and

    Speed, Microprocessor Report, Vol.

    8,

    No. 15, pp. 6-9,

    November 1994 .

    [7] IEEE Standard

    for

    Binary Floating-point Arithmetic,

    A N S E E E Standard No.754, 1988.

    [8] C.S. Wallace,

    A

    Suggestion for a Fast M ultiplier,

    Trans. IEEE Electronic Computers, Vol. EC -13, pp. 1 4-17,

    February 1964.

    191 F. Murab ayashi et al., 2.5V

    NOVEL

    CMOS CIRCUIT

    TECHNIQUES FOR A 150MHz SUPERSCALAR RISC

    PROCESSO R, to be published in ESS CIR CP5 , September

    1 9 9 5 .

    Figure

    6.

    Float ing-point u ni t b lock d iagram

    Register File ALU Multiplier DivISqrt

    I

    I

    I I

    I I

    Figure

    7.

    Float ing-point uni t m icrophotograph

    Table 2. Float ing-point la tency and th rough put

    Multiply

    Divide

    Doubl

    Latency

    (Cycleln5

    2113.3

    211

    3.3

    811

    20.0

    31/206.7

    -precision

    Throug hpu

    (Cyclelns)

    116.7

    116.7

    171113.3

    30/200.0

    470