design of double precision ieee-754 floating-point units

GRIFFITH UNIVERSITY

Design of Double Precision IEEE-754 Floating-Point Units

Extending the Open Source LTH FPU for the Gaisler Leon2 SPARC8 Microprocessor

DISSERTATION

SUBMITTED BY

MICHAEL KENNEDY B.I.T.

SUPERVISOR

PROFESSOR PETER LISNER

SCHOOL OF MICROELECTRONICS

SUBMITTED IN FULFILLMENT OF THE REQUIREMENTS OF THE DEGREE OF MASTER’S OF ENGINEERING

IN VLSI

MARCH 2007

This work has not previously been submitted for a degree or diploma in any university. To the best of my

knowledge and belief, the thesis contains no material previously published or written by another person

except where due reference is made in the thesis itself.

Michael Kennedy

Table of Contents

LIST OF FIGURES 7

ABBREVIATIONS AND SYMBOLS 9

ACKNOWLEDGEMENTS 10

ABSTRACT 11

1 INTRODUCTION 12

2 LITERATURE REVIEW 14

2.1 INTRODUCTION 14

2.2 FLOATING POINT NUMBERS 14

2.3 THE LTH FPU 14

2.4 ARITHMETIC OPERATIONS 15

2.5 DESIGN 18

2.6 VERIFICATION 18

2.7 PERFORMANCE 19

2.8 COMPARISON 19

2.9 CONCLUSION 20

3 DESIGN 21

3.1 INTRODUCTION 21

3.2 LTH FLOATING POINT UNIT 21

3.3 MULTIPLICATION 23

3.4 SRT DIVISION 27

3.5 SRT SQUARE ROOT 32

3.6 CONCLUSION 36

4 IMPLEMENTATION 37

4.1 INTRODUCTION 37

4.2 HDL MODELING 37

4.3 SYNTHESIS 41

4.4 CONCLUSION 42

5 RESULTS 43

5.1 INTRODUCTION 43

5.2 TIMING AND AREA ANALYSIS 43

5.3 IC LAYOUT 49

5.4 PERFORMANCE 51

5.5 CONCLUSION 52

6 CONCLUSION 53

6.1 INTRODUCTION 53

6.2 ACHIEVEMENTS 53

6.3 FUTURE DIRECTIONS 54

6.4 CONCLUSION 55

BIBLIOGRAPHY 56

APPENDIX A: ARITHMETIC BY EXAMPLE 60

INTRODUCTION 60

RADIX-2 UNSIGNED MULTIPLICATION 60

RADIX-4 MULTIPLICATION USING MODIFIED BOOTH ENCODING 61

SRT RADIX-2 DIVISION 62

SRT RADIX-2 SQUARE ROOT 64

RADIX-2 ON-THE-FLY CONVERSION 66

APPENDIX B: COUNTER STUDY 67

INTRODUCTION 67

TIMING ANALYSIS 67

AREA ANALYSIS 68

CONCLUSION 68

APPENDIX C: HDL VERIFICATION 69

INTRODUCTION 69

MULTIPLICATION 69

DIVISION 69

SQUARE ROOT 70

APPENDIX D: CODE LISTING 72

INTRODUCTION 72

FILE LOCATIONS AND NAMING CONVENTIONS 72

List of Figures

Figure 3.1 LTH Overview .................................................................................................................................................................. 21

Figure 3.2 Extending the LTH ......................................................................................................................................................... 22

Figure 3.3 Floating Point Multiplication Operation Properties ....................................................................................... 23

Figure 3.4 Modified Booth-2 Encoding ....................................................................................................................................... 23

Figure 3.5 Partial Product Sign Extension................................................................................................................................. 24

Figure 3.6 Partial Product Counter Over Sizing ...................................................................................................................... 24

Figure 3.7 CSA Counter Configuration ........................................................................................................................................ 25

Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks .................................................................................................................. 26

Figure 3.9 Floating Point Division Operation Properties ................................................................................................... 27

Figure 3.10 SRT Radix-2 Division Equations ........................................................................................................................... 27

Figure 3.11 SRT Radix-2 Division Overview ............................................................................................................................ 28

Figure 3.12 SRT Radix-2 Division Circuit .................................................................................................................................. 28

Figure 3.13 SRT Radix-2 Quotient Selection ............................................................................................................................ 29

Figure 3.14 Radix-2 On-The-Fly Conversion Equations ..................................................................................................... 30

Figure 3.15 Radix-2 On-The-Fly Conversion for Q ................................................................................................................ 30

Figure 3.16 Radix-2 On-The-Fly Conversion for QM ............................................................................................................ 30

Figure 3.17 Stick Bit Formula for a Redundant Residual ................................................................................................... 31

Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages ........................................................................................ 31

Figure 3.19 Floating Point Square Root Operation Properties ........................................................................................ 32

Figure 3.20 SRT Radix-2 Square Root Equations ................................................................................................................... 33

Figure 3.21 SRT Radix-2 Square Root Overview .................................................................................................................... 33

Figure 3.22 SRT Radix-2 Square Root Circuit .......................................................................................................................... 34

Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations ........................................................................... 35

Figure 3.24 SRT Radix-2 Division and Square Root Equations ........................................................................................ 35

Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit ........................................................................ 36

Figure 4.1 Multiplier Pipeline ......................................................................................................................................................... 37

Figure 4.2 SRT Radix-4 Division and Square Root Pipeline .............................................................................................. 39

Figure 4.3 Leonardo Spectrum Settings ..................................................................................................................................... 41

Figure 4.4 Preparing a Netlist for IC Layout ............................................................................................................................. 42

Figure 5.1 Original LTH Timing Analysis ................................................................................................................................... 43

Figure 5.2 Original LTH Area Analysis ........................................................................................................................................ 43

Figure 5.3 Multiplier Timing Analysis ......................................................................................................................................... 44

Figure 5.4 Multiplier Area Analysis .............................................................................................................................................. 45

Figure 5.5 Combined Division and Square Root Timing Analysis .................................................................................. 46

Figure 5.6 Combined Division and Square Root Area Analysis ....................................................................................... 47

Figure 5.7 Exponent Component Timing Analysis ................................................................................................................ 48

Figure 5.8 Exponent Component Area Analysis ..................................................................................................................... 48

Figure 5.9 Multiplier IC Layout ...................................................................................................................................................... 49

Figure 5.10 Combined Division and Square Root IC Layout ............................................................................................. 50

Figure 5.11 Pre-Layout Clock Frequency .................................................................................................................................. 51

Figure 5.12 Processor Latency and Frequency Comparison ............................................................................................ 51

Figure A.1 Unsigned Radix-2 Multiplication Example ......................................................................................................... 60

Figure A.2 Booth-2 Multiplicaiton Example ............................................................................................................................. 61

Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram ..................................................................................... 61

Figure A.4 SRT Radix-2 Division Example ................................................................................................................................. 63

Figure A.5 SRT Radix-2 Square Root Example ........................................................................................................................ 65

Figure A.6 On-The-Fly Conversion Example ............................................................................................................................ 66

Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis ............................................................................................ 67

Figure B.2 Std. Cell v. Synthesized Counter Area Analysis ................................................................................................. 68

Figure C.1 FPU Multiplication Test Bench ................................................................................................................................. 69

Figure C.2 FPU Division Test Bench ............................................................................................................................................. 70

Figure C.3 FPU SQRT Test Bench ................................................................................................................................................... 70

Figure C.4 FPU Square Root Result Verification ..................................................................................................................... 71

Figure D.1 File Location Map .......................................................................................................................................................... 72

Figure D.2 File Naming Conventions ........................................................................................................................................... 72

Figure D.3 HDL Synthesis and Simulation Highlights .......................................................................................................... 73

Abbreviations and Symbols

ADK Academic Design Kit

CLA Carry Look-Ahead Adder

CPA Carry Propagate Adder

CSA Carry Save Adder or Full Adder

CSEA Carry Select Adder

D Divisor

DIV Division (may refer to combined division and square root unit)

FA Full Adder

F(i) Square root functional divisor (F(i) and –F(i), which are for positive and negative products)

FPU Floating Point Unit

G Guard Bit

I Iteration

J Bit Position

LTH Open Source FPU distributed by Gaisler Research (www.gaisler.com)

MUX Multiplexer

OTFC On-The-Fly-Conversion, used for quotient generation

PLA Programmable Logic Array

PPi Partial Product

PR(s,c) Partial Remainder in carry save form

Q Quotient

QM Negative quotient factor (Qi – 2-(i))

qi+1 QSLC result for current iteration

QSLC Quotient Selection Logic Circuit

r Radix

R Round bit

SQRT Square Root

Syn Synthesis or synthesized

T Sticky bit

ULP Unit in Least Significant Position

Acknowledgements

I would like to thank my supervisor, Professor Peter Lisner for his patience and support

My parents, Jennifer and Michael Kennedy, for making this all possible

Professor Nicholas Bellamy, for editorial advice

David Clarke, for research assistance

Mal Gillespie, for software support

Abstract

This thesis explores the design and implementation of a synchronous pipelined Floating Point Unit FPU

for the AMI 0.5 micron process. Specifically it focuses on designing a multiplier, divider and square root

unit for the LTH. The LTH is an open source FPU provided by Gaisler Research for their Leon2 SPARC8

compliant processor. It is a partial implementation of an arithmetic unit, for working with IEEE-754

Floating-Point single and double precision numbers. The new functions are modeled using the Verilog-97

hardware design language HDL, and implemented using the Mentor Graphics ADK design flow. The goal

is to produce a design that implements the essential operations of the floating point standard, and is

capable of running at and above 100MHz.

12

1 Introduction

This thesis presents an overview of designing a pipelined synchronous IEEE-754 double precision

floating point unit for the AMI 0.5 micron process, using the Mentor Graphics ADK 2.5 design flow. The

first section provides the relevant background information for working with floating point numbers and

the base implementation. The second section illustrates the choice of arithmetic design to provide the

missing functionality. The final two sections provide details of the implementation and the results

achieved.

Essentially floating point numbers are used to represent very small to very large numbers with varying

levels of precision. The normal set of mathematical operations such as addition, division, multiplication

and subtraction are defined, but are slightly more complex than those for standard integer numbers.

Floating point operations performed in software, where only an integer adder is available may take

several hundred or thousands of cycles. The speedup of these operations is quite important, because

floating point numbers are used in a wide range of applications including CAD, games, graphics,

multimedia, and scientific. Hence, the main objective to designing a FPU, is to decrease the number of

cycles to perform a floating point operation from that taken in software, down to a smaller number of

cycles performed in hardware.

The first section, literature review, provides the background for this thesis. It discusses the IEEE-754

floating point number standard and the existing implementation provided with the Gaisler Leon2 SPARC8

compatible processor. This first section also provides some details of how various implementations were

verified and their performance.

The second section, design, provides an overview of the arithmetic algorithms and architecture employed

to extend the LTH FPU. It is meant to illustrate the circuit design, from a high level perspective of how the

arithmetic algorithms are to be implemented in logic blocks, that fit together to provide the required

functionality.

In the third section, implementation, the design process is discussed with reference to the Mentor

Graphics ADK 2.5 design flow. It covers the basics from modeling the FPU with the Verilog HDL using

standard cells to layout with IC station. While not all elements that were used in the FPU are fully

detailed, a general outline for their construction is provided with a selection of examples from each stage

of the development cycle. Also covered in this section are issues discovered with the tool flow discovered

during the implementation process.

The fourth section, results, is a discussion of the developed FPU implementation. It is aimed at providing

some comments in relation to testing and verifying the design, and providing some performance

measurements. It covers verification of both the HDL design and the completed IC layouts for both logic

and electrical rule verification.

13

The final section, conclusion, outlines the success of the developed implementation. It details known

limitations and future improvements with the design. Performance measurements and comparisons with

industry implementations are also discussed. This section in brief provides a summary of the design,

results and where future modifications should be aimed to provide increased performance.

14

2 Literature Review

2.1 Introduction

This section provides an overview of the background literature required for understanding the floating

point standard, operations and the design of floating point units. In particular it focuses on

multiplication, division, and square root operations as they relate to extending the LTH into a more

complete implementation for processing IEEE-754 floating point numbers for the Leon2 Microprocessor.

2.2 Floating Point Numbers

Floating point numbers [8] [11] are essentially used to describe very small to very large numbers with a

varying level of precision. They are comprised of three fields, a sign, a fraction and an exponent field.

Floating point numbers are also defined for several levels of precision, which include single, double, quad,

and several extended forms. The IEEE-754 standard defines several floating point number formats and

the size of the fields that comprise them. Also making them more complex, is that several specific bit

patterns represent special numbers, which include zero, infinity, and Not-a-Number “NaN”. Additionally

floating point numbers may be in a normalised or de-normalized state. Further discussed in the IEEE-754

standard are several rounding schemes, which include round to zero, round to infinity, round to negative

infinity, and round to nearest [8] [ 9] [11] [14].

In particular for the design of this FPU, only double precision normalized floating point numbers are of

interest. This is because de-normalized numbers are not required to be implemented in an FPU for it to

be considered compatible with the IEEE-754 standard, and the LTH converts all floating point numbers to

double precision for internal use. Also, all special bit patterns result in special conditions, such as divide

by zero, which is instantly an exception, and requires no computation. This essentially means the floating

point numbers used will have a 1-bit sign field, 53-bit fraction field, and an 11-bit exponent field. The

format of the fraction and exponent fields is also specified, with the MSB of the fraction always being one

and the exponent never being all one’s or all zero’s.

2.3 The LTH FPU

2.3.1 The Leon2

The Leon2 [1] is a pipelined SPARC8 [4] software configurable synthesizable processor designed by

Gaisler Research for the European Space Agency and distributed under the GPL open source license [5].

Internally it does not include a floating point unit, but does define an interface for using one as a co-

processor [2]. It provides two methods of interfacing with a floating point unit, which include a direct

and an indirect approach. The main difference is that when a floating point operation is issued using the

direct approach, the CPU is stalled until the FPU completes that operation. The advantage to this is that

an operations queue is not needed, reducing the complexity of any implementation. While the Leon2

does not explicitly contain any FPU, Gaisler does distribute it with several options [3] [6], which include a

partial implementation called the LTH [3] that is made available under the same GPL license.

15

2.3.2 The LTH

The LTH [3] is a partial implementation of a floating point unit. In its original form it supports addition,

subtraction, and comparison operations on both single and double precision floating point numbers.

While the IEEE-754 standard [11] does not require all floating point operations to be implemented in

hardware, it does require several rounding schemes, and most implementations normally include

multiplier, divider and square root units [7] [14]. Further, as the LTH is aimed for the SPARC8 ISA it

should be able to handle single and double precision floating point operations. However while quad

precision instructions are listed in the SPARC8 standard, they are not often implemented in many

industry designs [29] [30]. Incorporating the new instructions into the LTH also requires limited changes

in the Leon2, for identifying the available floating point operations implemented in hardware.

2.4 Arithmetic Operations

2.4.1 Floating Point Arithmetic

Floating point units usually consist of two main data paths for computing the fraction and exponent fields

of a result [8]. However, the main idea of performing a floating point operation can be defined as

decoding, normalization, exception checking, fraction and exponent operations and rounding, where

operations on the fractions of the operands take the most computation [8] [11]. It should be noted that

the floating point standard is aimed at producing rounded results as if they were produced for an infinite

precision [11]. To accomplish this all operations are generally computed to several extra bits of precision

(guard, round, and sticky bits). Some designs can produce this extra information for some operations in

parallel to their normal execution [16] [25].

2.4.2 Multiplication

Multiplication of two floating point normalized numbers is performed by multiplying the fractional

components, adding the exponents, and an exclusive or operation of the sign fields of both of the

operands [9] [11] [14]. The most complicated part is performing the integer-like multiplication on the

fraction fields (Figure A.2). This can be done by utilizing a full tree [15], Wallace [20] or Dadda tree [12],

staged [17], or serial array multiplier architecture [8] to sum the partial products.

Essentially the multiplication is done in two steps, partial product generation and partial product

addition. For double precision operands (53-bit fraction fields), a total of 53 53-bit partial products are

generated [16]. To speed up this process, the two obvious solutions are to generate fewer partial

products and to sum them faster [8].

In it simplest form, producing the radix-2 unsigned partial products is accomplished by a logical AND

operation [8] on the operands. To produce fewer partial products, the normal method used is some form

of modified Booth encoding [15] [36]. This allows for computations at a higher radix, with partial

products produced in various forms. The choice of radix [14] determines how many partial products are

to be generated, and the style of encoding determines the form of partial product. Modified Booth-2

radix-4 encoding for double precision operands produces 27, and Booth-3 radix-8 produces 18. With an

increased radix also comes increased complexity and normally increased delay [8] [9]. The form of the

16

produced partial products can be as simple as 2’complement, or a more redundant number system [18]

that allows for a faster summation of the products.

After the partial products have been produced, they must all be added together. The fastest method is to

use a carry free method of addition that sums the products in parallel using a CSA tree [8] [15]. The CSA

tree can be implemented in many forms, the main variations including wiring regularity [8] [14] [20],

order of addition [8], and counter or compressor sizing [15] [16]. In the case of Booth-2 radix-4 encoding

with a full binary CSA tree, the benefits of producing fewer partial products do not include a major speed

increase, because only a decrease in the critical path of the one full adder is realized. Alternately, this

does provide significant area saving. Further, while it may be possible to generate fewer partial products

with higher levels of Booth encoding, the time and area penalties increases with higher radix encodings

and can outweigh any improvements offered by a smaller CSA tree [8]. Also care must be taken with the

layout of the counter tree [16] [20], as the natural shape is irregular and area intensive [19]. The CSA tree

from Samsung [15] attempts to eliminate much of the wiring irregularity, by defining the counter tree in

slices ranging from 28:2 to 4:2 counters built from seven levels of full adder cells. Alternately, the Parallel

Structured Full Array Multiplier [17] built using a Wallace tree from four levels of 4:2 counters offers

increased regularity. Assuming that the 4:2 counters are internally two full adder cells, the Samsung

design consists of one less level in the CSA tree, while offering a less regular wiring structure then the

Wallace tree design.

The last stage in the multiplication is a carry propagate addition using a wide 106-bit adder for double

precision numbers. Adders, while performing a simple operation, are still an area of interest and many

implementations exist for providing a final carry propagate addition CPA [8]. Amongst these are Carry-

Look-Ahead CLA [14], Carry-Select CSEA [11], Ling [14], and hybrid adder architectures comprised of

elements from several different design approaches [17]. Additionally, there are adders that operate on

different number systems with different levels of redundancy. Further, some partial products may be

produced in a highly redundant form making a full CPA irrelevant [18] and instead utilizing a slightly

faster but less area intensive conversion circuit to convert the redundant result into a standard binary

notation. However, many of the designs specified become less attractive when they are implemented

using standard cells, such as the 4ns 32-bit CMOS Ling Adder [14]. This presents its own problems, as the

selected adder must be feasible using standard cells and be able to perform the full addition in under

10ns. This makes the XOR based design offered by Samsung [15] more attractive.

Optimizing a multiplier is usually done by constructing the CSA tree into a regular layout instead of the

irregular trapezoid that results from the shifted partial products [20]. The goal here is to decrease the

total area used by the multiplier and to make the wiring delay between the adders more uniform. Some

other work has been performed using more redundant number systems, which allow for higher levels of

modified Booth encoding with out incurring the time penalties that normally result from forming the hard

multiples (such as the 3x) [18]. Generally however, the most common implements revolve around a

radix-4 Booth-2 encoding, some form of CSA tree, and a wide hybrid adder (at least 106-bits).

17

2.4.3 Division

Division of two floating normalized floating point numbers is performed by a subtraction on the

exponents, an integer style division on the fractions (Figure A.4), and an exclusive OR operation of the

sign fields of both of the operands [11] [14]. Similar to multiplication, the operation on the fraction fields

is the most time consuming part. There are two main ways of performing the integer division of the

fraction fields; a subtractive recursive approach [11] and a multiplicative approximation [8]. Both

present their own unique benefits [21] [22] in terms of area and performance, but also complexity. The

subtractive approach provides for a simpler implementation but produces a constant number of bits per

iteration. Alternately, the multiplicative approach produces an increasing number of bits per iteration,

and as a result requires fewer iterations, with each taking several cycles that need a modified multiplier.

The subtractive approach is a more popular architecture because it can be implemented independently,

and does need complicated modifications to the multiplier [28].

The subtractive approach, which is performed by a repeated addition, subtraction or shifting operation

produces a constant numbers of result bits per iteration (1-bit per iteration for radix-2, 2-bits per

iteration for radix-4). This does imply that for a 53-bit result, at least 53 iterations are needed. Two

ways to increase the performance are to increase the number of result bits generated during an iteration,

or to perform more iteration’s per cycle [8]. This can be done either through a higher radix more complex

iteration, or by layering several simple lower radix iterations together [23] [26]. Another way to reduce

the delay is to work with the partial remainder in a redundant form, such as carry-save [25], which

removes the need for a CPA in each iteration. The critical path then becomes the quotient selection logic

(QSL), which determines if a subtraction, addition or shift operation takes places during an iteration.

The quotient selection logic can be implemented as purely combinational logic or as a look up table (LUT),

or referred to in SRT as a Quotient Selection Table QST. In the case of the second approach several tools

are available to automatically generate the QST’s [8] [12] [13]. With a higher radix comes a more complex

QSL. Layering and overlapping of lower radix iterations allows for a computation at a higher radix

without the complexity [23]. A design by SUN Microsystems [25] demonstrates the layering of low radix

iterations, and also provides a more optimized SRT table than some other designs [24]. This optimization

has the potential to make unnecessary a final restoring iteration for the remainder normally required by

many SRT implementations. When used in conjunction with high radix cycles, this single cycle saving

provides a relatively large performance increase without any extra delay to the critical path.

2.4.4 Square Root

The square root operation is very similar to the division operation, where both multiplicative

approximation and subtractive recursive approaches are available [22] (Figure A.4). The reasons for

choosing an appropriate architecture are very similar to division [21], where either the multiplier can be

modified or an extra subtractive unit can be implemented. In the case of the latter choice, the difference

between the subtractive division and square root unit’s is essentially very small, and an SRT divider can

be modified to encompass square root operations [25] with only a few changes.

18

Essentially the biggest difference between the SRT division and square root operations, is that in the SRT

square root operation the divisor is adjusted during every iteration [11] [13], where as in SRT division it

remains unchanged. Another difference between the two operations is in the handling of exponents and

base exceptions. For the most part, the SRT division algorithm can be extended to implement the square

root operation [37], allowing for the advantages of the more complex division unit designs to be extended

for the square operations in a combined division square root unit.

2.5 Design

The design of the original LTH [3] is a single pipeline that has several pipeline stages operating on double

precision floating point numbers. Data are inserted into the pipeline in several stages including decode,

operands and rounding. The missing functionality is normally implemented in parallel independent

pipelines because of the marked difference in the requirements of their respective algorithms. They still

normally share some common stages, which may include decoding, rounding, and normalization. This

implies that the new functionality can be incorporated by adding the new functional blocks in parallel to

the existing addition/subtraction unit of the LTH. The new blocks will also be required to operate on a

single pair of double precision numbers. However, other industry designs offer a wider variety of word

sizes for computation [19] [30] [31].

The original design uses a pipelined architecture, but is not truly pipelined. While it is divided into stages,

it cannot accept new operations every cycle. This is an intended feature of the LTH because of its

interface with the Leon2 [2], which means with the inclusion of the new functional units, at most only one

stage in a single pipeline will be in operation at any time. While it may be possible to extend the modified

LTH to a more modern out-of-order superscalar design [7] [32], neither the SPARC8 standard [4] nor the

co-processor interface of the Leon2 requires such complexities, and therefore the need for executing

floating point instructions in parallel and out-of-order is beyond the scope of this project.

2.6 Verification

A floating point unit is a set of complex algorithms implemented in an integrated architecture [33] [34].

Demonstrating a satisfactory level of compliance with the IEEE-754 standard and implementation

reliability can be grouped into two separate concepts, design and implementation testing [36]. The first is

a logic proof of the well defined algorithm translated into Verilog HDL. These are tested by running a

limited number of test vectors against the developed HDL modules. The second, implementation testing,

is usually accomplished by generating test vectors to cover a specific set of test cases and possibly a

percent of total input coverage against a physic model. This testing can be accomplished through a

mechanized form on a taped out implementation. The importance of testing both the algorithm and

implementation can be seen by the problems encountered with the Intel Pentium division QRT table [35],

where the algorithms were sound but the implementation contained incomplete data sets.

19

2.7 Performance

Performance of a floating point unit is commonly expressed as a matter of its latency and throughput.

However the real world performance of an FPU may be more application specific, and is normally

measured using benchmarks that comprise tests from common applications [29]. While this does allow

for a comparison of different architectures based on execution cycles alone, it does not permit a

meaningful overall comparison of designs since area, power, and technology are not considered. FUPA, a

Floating-point Unit Performance Analysis metric discussed by Flynn and Obermann [14], is a more

complete metric for comparisons of different implementations as it incorporates latency, area, scaling,

power, minimum feature size, and application profiling.

2.8 Comparison

During this literature review several different approaches and designs were researched to determine the

optimal architecture for extending the LTH. The extended overall design is well defined by the original

LTH, and the marked differences between the functional requirements of the different operational units.

The largest issues are in concern to the additional units, which can be implemented using a variety of

different methods, but are limited to those designs that use standard cells [12] [13] and can achieve a

10ns cycle time.

The design of a multiplier is generally accepted to be a three stage process, modified Booth encoding, CSA

tree, and final CPA. The approach offered by Cho et al [15] offers both a complete modular description for

Booth-2 encoders and a CSA tree constructed from full adders. The other main advantage is that this

design avoids some of the layout issues with counter trees [8] as the blocks are clearly defined. Other

implementations utilizing a more redundant number system [18], while interesting, offer perhaps a faster

solution but are vastly more complex and at times not complete. Flynn [14] also suggests for both area

and latency, a modified Booth-2 encoder with a full binary adder tree is an optimal solution. The final

stage of multiplication, the 106-bit CPA, presents the only real choice. While many designs exist, only fast

adders that can be implemented using standard cells are relevant. The hybrid CSEA using 16-bit CLAs

offered by Mori [17], provides a suitable delay, a modular and well documented design optimal for layout,

especially in comparison to a fully synthesized random logic wide adder.

There is a larger difference in the design of division and square root units. Essentially the subtractive SRT

approaches are favored because they do not require any modification to the multiplier and can be

implemented independently. The next choice then becomes that of radix, where more bits per iteration

incurs more complexity, or more iterations per cycle increases the area. The implementation offered by

SUN Microsystems and others utilizing several lower radix iterations to form a higher radix iteration, is

clearly the best approach [25]. While the SUN design is not noticeably different from other layered

overlapped radix-2 designs, it does provide significantly improved QSLC, utilizing a fifth bit.

Essentially the methods used for testing the final product are not available for the extended LTH. This is

because the design approach of testing on an FPGA is not possible given the direct instantiation of ADK

standard cells in the HDL. Secondly, since there is no planned fabrication, the process of mechanical

20

testing of a physical processor is not possible, and if it were the approximate two month testing period

would be restrictive [36]. Testing is therefore limited to HDL testing and IC layout verification.

Finally in terms of performance, the least favored methods of comparison, latency and throughput, will be

the only metrics available for making a comparison between the extended LTH and other Floating Point

Units. This is because without a fabricated design it is impossible to verify the operational frequency or

power requirements. This makes running any application specific benchmarks [29] and generating any

FUPA metric [11] impossible. The other reason that running benchmarks is not possible is that while the

extended LTH may conform to the original LTH-Leon2 interface [2], the Leon2 needs some modification

to operate correctly with the new design, which is beyond the scope of this work.

2.9 Conclusion

Extending the LTH to perform division, multiplication and square root operations for single and double

precision floating point numbers can be accomplished by adding parallel pipelines to the

addition/subtraction pipeline. The best approach for designing a multiplier relies on the Samsung design

for the modified Booth encoding and CSA tree. However there are many alternate approaches to the final

wide fast adder, which is confined to those that are implementable using standard cells.

The optimal architecture of a joint division and square root unit is a layered subtractive radix-2 SRT style

approach, benefiting from quotient selection logic outlined by Sun. This design is appropriate because of

its independence from the multiplier, its ability to be layered to perform higher radix computations and

meet timing requirements.

In terms of implementation, the design decisions are fairly well dictated. To achieve a cycle time

comparable to that of industry designs using a similar process technology, optimizing for delay at the

expense of area is necessary, as a result of being confined to a standard cell library. The concept of using

standard cells for implementation also means a host of other designs are not feasible, including custom

transistor 4:2 counters and pseudo CMOS Ling adders. Further, since the design is intended to be

converted into an IC layout, a modular design is also beneficial. This can be seen in the selection of a CPA

for the wide adder, where a 106-bit CSEA made of 16-bit CLA blocks offers a more regular layout than the

random logic offered by a completely synthesized approach.

21

3 Design

3.1 Introduction

This section covers the basic details of the computation of the fractional component of double precision

floating point numbers. It illustrates the underlying elements for the multiplication, division and square

root designs selected. It also provides some details of how this new functionality can be incorporated into

the existing LTH architecture.

3.2 LTH Floating Point Unit

3.2.1 Introduction

The LTH is an open source FPU distributed with the Leon2 [3]. It is a partial implementation of a floating

point unit that uses the double precision format for all calculations. This section presents the original

datapath organization and the required changes for integrating missing functionality.

3.2.2 Original Design

The original LTH [3] is divided into five stages (Figure 3.1), “Decode”, “Prenorm”, “Add/Sub”, “Postnorm”,

and “Roundnorm”. The Decode stage unpacks the operands and instructions and checks for special

conditions. Prenorm, the second stage, formats the operands for addition or subtraction, by placing them

in the correct order and performing any necessary shift operations. The third stage Add/Sub, performs

the addition or subtraction operation. The Postnorm stage normalizes the result for the final stage, which

rounds and packages the result ready for output.

Figure 3.1 LTH Overview

This presents several problems, mostly related to the time taken to execute different operations. In the

case when an invalid operation is sent to the LTH, an exception should be generated by the floating point

unit indicating a problem with the instruction or operand [11]. In this design, when the error is known at

Decode

Prenorm

Add/Sub

Postnorm

Roundnorm

Input

Output

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Busy Signal

22

the end of the first stage, it takes a further four stages for the exception to be written to the output. The

same is true for the comparison operation which requires less then the five cycles. While this may not

appear to be a problem, since the Leon2 is stalled while any FPU operation is underway, for an invalid

operation, the CPU may have to wait four cycles for the FPU to do nothing.

3.2.3 Integrating Additional Functions

Figure 3.2 Extending the LTH

Integrating the additional features, such as division, square root, and multiplication, requires extending

the LTH’s decoder, modifying its rounding stage to accept inputs from multiple datapaths, and

redesigning its busy signal unit. Further, an extra path from the decoding stage to the last stage should be

added, which for a division operation with invalid operands would avoid over 57 stalled CPU cycles. The

key here is that essentially the decoder recognizes all valid operations and correct operands for the new

features, and can dispatch the operation down the correct datapath. At the end of each of datapath, where

a result is available, the already defined rounding module can be utilized to round and package it for

output. This leaves only the design of a multiplier, divider, square root unit, and associated normalization

units required to construct a full FPU (Figure 3.2).

Decode

Prenorm

Add/Sub

Postnorm

Roundnorm

Input

Output

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Multiplier

DIV/SQRT

Normalization

Normalization

23

3.3 Multiplication

3.3.1 Introduction

Multiplication Operation Properties

Property Value

Operation

Operand A Range (Dividend) 1 ≤ A < 2

Operand B Range (Divisor) 1 ≤ B < 2

Result R Range 1 ≤ R < 4 (May Require left shift and exponent

increment)

Result Sign Sign A XOR Sign B

Result Exponent Exponent A + Exponent B - Bias

Figure 3.3 Floating Point Multiplication Operation Properties

Multiplication includes four logically separate stages, partial product generation, partial product addition,

a final carry propagate addition, and result normalization. To accomplish these tasks Booth encoding, a

full CSA tree, a fast CSEA CPA and a left shifter were designed. This section outlines the fundamentals

behind the algorithms used to implement each of the four stages.

3.3.2 Partial Product Generation

The first part of designing a multiplier circuit is to produce the partial products that have to be summed.

Modified Booth-2 radix-4 encoding was selected, as it produces half the number of partial products as an

un-encoded method, and does not require the calculation of any hard multiples as would be needed by a

higher level of encoding.

Booth-2 Products

Xj+1 XJ Xj-1 Booth Multiple Dir. Shift Add

0 0 0 0Y 0 0 0

0 0 1 1Y 0 - 1

0 1 0 1Y 0 - 1

0 1 1 2Y 0 1 0

1 0 0 -2Y 1 1 0

1 0 1 -1Y 1 - 1

1 1 0 -1Y 1 - 1

1 1 1 0Y 1 0 0

Figure 3.4 Modified Booth-2 Encoding

Instead of a simple AND operation on a bit J from operand A with all bits of operand B to produce a single

partial product (Figure A.1), Booth encoding uses three bits from operand A to generate a multiple of

operand B (Figure A.2). Essentially this requires constructing an encoder for the truth table presented

above (Figure 3.4), and a set of connected MUX to select the correct multiples of operand B. The design

from Samsung [15] offers a full description of both the Booth encoders and the MUX required for

24

implementation. Their encoders produce three select signals, “add”, “dir”, and “shift” to generate the

require Booth multiple.

3.3.3 Multi-Operand Addition

The modified Booth encoding produces 27 54-bit partial products in a 1’s complement form. To convert

these to the required 2’s complement, the sign bit can also be extracted, and added to the shifted products

LSB. For the multiplication of radix-2 unsigned numbers, the shifted results can be simply added

together. However, when signed numbers are concerned, a special sign extension method (Figure 3.5) is

generally used to keep the operands short [14], which significantly reduces area. The alternate would be

to sign extend each partial product to the maximum width of the operation, which is obviously

undesirable.

Figure 3.5 Partial Product Sign Extension

Adding these products together using a carry save adder CSA tree can be done through at least seven

levels [8] of 3:2 counters (also known as full adders). The design presented by Samsung, offers an

optimized form of carry propagation (Figure 3.6), where each counter has bit zero of its partial product

input port optimized for late arrival (Figure 3.7). This allows for counters to receive an input to their

partial product input port from their neighbouring counter’s carry out with no added delay.

Figure 3.6 Partial Product Counter Over Sizing

0 S S S • • • • • • • • • •

0 C 1 S • • • • • • • • • • 0 S

0 C 1 S • • • • • • • • • • 0 S

1 S • • • • • • • • • • 0 S

• • • • • • • • • • 0 S

S S S • • • • • • • • • •

1 S • • • • • • • • • • S

1 S

S

• • • • • • • • • • S

1 S • • • • • • • • • • S

• • • • • • • • • • S

25

Figure 3.7 CSA Counter Configuration

While the Samsung design is fairly complete, it does present some concerns. It proposes a maximum

counter size of 28:2, which suggest that in a column there will be 28 bits to sum. As the last product

generated is never negative, because the MSB of the other operand is always zero (operand B [52:0], MSB

for Booth encoding is bit 53). As a result, the final partial product will always be a positive multiple,

which makes its sign also zero. This limits the maximum height of a column to 27, and the required

counter size to 27:2. In terms of implementation, this requires a range of counters to be constructed,

from 27:2 to 4:2, and have port zero optimized for late arrival to handle carry propagation. The other

result of carry propagation, is that some counters need to be oversized to handle the extra carry signals

generated by previous counters (Figure 3.6).

3.3.4 Fast Carry Propagation Addition

For the final stage, the wide adder, two candidates exist for performing the CPA. The fastest adder, but

also most area intensive is a Carry Look Ahead adder CLA [14], which is only suitable for at most 16 bits.

However, a hybrid approach using a Carry Select Adder CSEA comprised of 16-bit CLA blocks provides

most of the speed advantages of the CLA, without the vastly increased area requirements that would be

expected from such a wide adder [17]. The second possibility for performing the wide CPA is a fully

synthesized adder [8].

Using a 16-bit CLA with the output signal for bit 16 as a block carry output, it is possible to construct a

larger adder. This effectively makes each 16-bit CLA with no carry-out signal into a 15-bit adder cell with

both carry-in and carry-out signals defined. A 15-bit CSEA block can be made by using two of the CLA

blocks to compute the addition of the same input bits but with different carry in signals. Then a 106-bit

CPA can be constructed by chaining several 15-bit CLA CSEA blocks together, and for the most significant

block where no carry out signal is required, the 15-bit CSEA block can be used as a full 16-bit Adder cell.

FA

FA

FA

FA

FA

FA

FA

GND

SUM SUM SUM

CARRY CARRY CARRY

5:2 Counter 4:2 Counter 4:2 Counter

Late Arrival

GND

26

Thus, a 106-bit CSEA can be built from 13 16-bit CLA’s, with the delay of slightly more then one 16-bit

CLA and six 16:8 multiplexes (Figure 3.8).

Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks

3.3.5 Normalization and Sticky Bit Generation

The result of a double precision operand multiplication is 106-bits long in the range of 1 ≤ R < 4. This

implies that if the leading bit is a one, the result is above the required range, and to normalize it a left shift

is needed. If a left shift is required, the exponent must also be incremented. At this stage, when there is

more than the required 53 bits of precision available, before the extra 50 or 51 bits are discarded, a sticky

bit “T” must also be generated. This sticky bit indicates if the truncated bits are all zero and is required

for the next stage, rounding.

Carry out

16-Bit CLA 16-Bit CLA

16-Bit CLA GND

VDD

Result[14:0]

Result[15]

Carry out

Result[14:0] Result[15]

16-Bit CLA

16-Bit CLA

Result[14:0]

Result[15]

Carry out

VDD

GND

GND GND 15-bit 15-bit 16-bit

15-bit 1-bit 15-bit 15-bit

Result

GND

MUX MUX

27

3.4 SRT Division

3.4.1 Introduction

SRT Division Operation Properties

Property Value

Operation

Operand X Range (Dividend) 1 ≤ X < 2 and X ≤ D (Right shift X, 0.5 ≤ X < 1)

Operand D Range (Divisor) 1 ≤ D < 2

Result Q Range 0.5 ≤ Q < 2 (With right shift of X, 0.25 ≤ Q < 1)

Result Sign Sign X XOR Sign D

Result Exponent ( Exponent X – Exponent D ) + Bias

Figure 3.9 Floating Point Division Operation Properties

SRT is a non-restoring residual recurrence approach to division that estimates a constant number of

quotient bits per iteration [8] [11]. The estimation is performed by a quotient selection function QSLC,

which defines the type of operation that is to occur during an iteration. Further, as each quotient bit is

produced, it must be incorporated into the existing quotient. To accommodate for the possible negative

values, a scheme called “on-the-fly” conversion is presented. While predominately the designs presented

are for a radix-2 SRT division unit, extending the base architecture to perform calculation at a higher

radix is also possible.

3.4.2 Radix-2

Where

I = 0, 1, 2, 3,..., m

Q0 = 0, PR0 = X/2

Figure 3.10 SRT Radix-2 Division Equations

For a radix-2 or a bit-at-a-time implementation (Figure A.4), the iteration formula can be summarized in

the above figure. In its simplest form, it involves examining the most significant bits of a partial

remainder and performing an operation based on the result. Depending on the implementation of the

QSLC, the time to determine the correct operation can be quite long. By calculating all three possible

results in parallel to determine which operation to perform (Figure 3.11), it is possible to reduce the

critical path significantly [21].

28

Figure 3.11 SRT Radix-2 Division Overview

Generating the three different products for the new partial remainder when a carry-save form residual is

used (Figure 3.13), can be accomplished by using one level of Full Adders. When “qi+1” is zero, a left shift

of the partial remainder is sufficient. When “qi+1” is negative one or positive one, the new partial

remainder is the result of a left shifted remainder and addition or subtraction of the divisor respectively.

For the subtraction of the divisor, a 2’s complement form can be used, which involves only inverting the

divisor for the positive product and adding a one to the carry-in for the LSB.

Figure 3.12 SRT Radix-2 Division Circuit

To obtain the correct 53-bit result at least 56 iterations are required. These iterations are used to

produce 53 bits for the result, two guard bits G, and a round bit R. The first guard bit is used for the case

in which 0.25 ≤ Q < 0.5 and the second guard bit used to correct for any over estimation in the last

produced bit of the quotient. The round bit R, is a required bit for all floating point operations, it

CSA CSA

Registers

2PRsj 2PRcj Dj

2PRsj 2PRcj Dj

PRsj (i+1) PRcj (i+1)

+ Product - Product

-PScj+1

+PScj+1

+PScj -PScj

QSLC

0 Product

MUX

Parallel Product Cellj

QSLC -1 Product 0 Product +1 Product

Registers

MUX

Critical Path

Q Q-1

Logic

29

effectively calculates the result to an extra level of precision which can be used in subsequent rounding

stages [8].

3.4.3 Quotient Selection

Quotient Selection

CPA Result QSL Product

000.0 and 5th MSB of PRs and PRc = 0 0

0XX.X +1

111.1 0

1XX.X -1

Figure 3.13 SRT Radix-2 Quotient Selection

Quotient selection is the process of estimating the next bit in the final quotient. It does this by sampling

several of the most significant bits of the partial remainder and estimating the next “qi+1” quotient bit

(Figure 3.13). Each estimate is either correct or an over estimation of the next quotient bit, which may

require correction at a later iteration, which is why the negative quotient bit is required. Further, since

only several of the most significant bits are sampled, the partial remainder is not required to be fully

calculated, and a redundant form is suitable [11] [24], such as carry-save. However, traditional QSL does

not function as expected for an exact result, such as “25/5”. An improved version proposed by SUN

Microsystems [25] corrects this behaviour, by modifying the case where the first four most significants

bits of the partial remainder sum to zero. In terms of implementation, this requires a fast adder to sum

the four most significant bits of the residual and some form of logic to determine which case fits.

Alternatively, by examining each case, a ROM containing slightly more then 256 entries can be

constructed [35].

3.4.4 Partial Quotient Generation

Every iteration of a radix-2 operation generates a new quotient bit “qi+1”. Since this bit can be positive or

negative, it would at first appear a CPA might be required for negative results. By using the “On-The-Fly”

conversion method [11], two registers Q and QM are maintained, which remove the need for a full CPA of

the quotient (Figure A.6). Q is the current quotient at iteration I, and QM is its negative factor (Qi -2-i).

Calculating Q and QM involves selecting one of the two registers and concatenating a value to its end in

position “2-i” (Figure 3.14).

In terms of implementation, this requires less than two 2:1 MUX and the distribution of the new quotient

bit. However, since the parallel product generation also includes a 3:1 MUX for selecting the correct

product, the critical path of a single iteration is only increased by less then a 4:1 MUX.

30

Figure 3.14 Radix-2 On-The-Fly Conversion Equations

Figure 3.14 Radix-2 On-The-Fly Conversion Equations

The radix-2 form of these equations is quite simple in comparison to the generic formula. This leads to a

basic design for a single bit of Q and QM logic, which is depicted in the next two figures. The only

complication then becomes knowing what “2-(i+1)” is during an iteration. By having a single hot state

machine called “I”, where a single high signal is propagated from left to right in a register one bit per

iteration, it becomes trivial to select the bit position that will be changed.

Figure 3.15 Radix-2 On-The-Fly Conversion for Q

Figure 3.16 Radix-2 On-The-Fly Conversion for QM

NXOR2 QSL MUX

MUX

QSLLSB

QMj

Qj QMj Ij

Q if QSL > 0 e.g. QSL = 01

If Bit is in position 2-(i+1)

Bit = ~QSLLSB

QM if QSL <= 0 e.g. QSL = 00 or

11

Else Bit = Bit read from selected

register

QSL MUX

MUX

QSLLSB

Qj

Qj QMj Ij

Q if QSL ≥ 0 e.g. QSL = 01 or

00

If Bit is in position 2-(i+1)

Bit = QSLLSB

QM if QSL < 0 e.g. QSL = 11

Else Bit = Bit read from selected

register

QSLMSB

31

3.4.5 Normalization and Sticky Bit Generation

Like multiplication, the division operation may produce a result not in the required range, in this case 0.5

≤ Q < 2. By detecting if the leading quotient bit is a zero (accounting for the initial required dividend right

shift), the range of the quotient can be seen to either be in the range 0.5 ≤ Q < 1 or 1 ≤ Q < 2. In the case of

the first, the quotient must be left shifted and the exponent decremented. However, unlike multiplication,

the sticky bit T is more complicated to calculate. For division and square root, it is produced from a

residual in carry save form [25] (Figure 3.17). Since it is not based on the quotient, it is normally

performed in the same stage as the recurrence iteration.

Where M = Residual Length

Figure 3.17 Stick Bit Formula for a Redundant Residual

3.4.6 Radix-4

Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages

The largest delay of an SRT division algorithm is the quotient selection logic circuit QSLC and the fan-out

of its output. While it is possible to compute at a higher radix, the QSLC becomes more complicated, and

its delay and area increase [25]. Overlapping two lower radix stages allows for the second QSLC to begin

before the first has completed, essentially leaving only one radix-2 QSLC in the critical path (Figure 3.18).

This is done by calculating the second quotient products for the next stage for each of the three possible

products from the first stage before the correct product from the first stage has been selected. Once the

QSLC +1 Product 0 Product -1 Product

Registers

QSLC QSLC QSLC

+1 Product 0 Product -1 Product MUX

MUX

MUX

Critical Path

32

first stage QSLC has completed, and the correct product has been selected, the next parallel product

generation can commence, and by its completion, the second stage QSLC should have also completed.

In terms of the circuits described for the parallel product generation for radix-2, at radix-4 they remain

unchanged. While no extra logic may be required, the five most significant bits of each parallel product

need to be available for output to the second level QSLC before the correct product has been selected.

3.4.7 Higher Radix

It is possible to compute at radix-8 using three overlapped radix-2 stages. Essentially this would have a

critical path of a one QSLC, three 3:1 MUX and two parallel product generation stages. While this may be

possible, as the SUN design shows [25], using standard cells and an AMI0.5 process, the increased delay

and area compared to a radix-4 design may be restrictive. It is also important to remember that the

above diagram for radix-4 division does not indicate the extra delay of the first stage new input selection,

sticky bit generation, nor the partial quotient calculation. These are important to remember, as while it

may be possible to combine three stages in under 8ns, the extra delay may be problematic. Further, any

higher then radix-8 and the additional parallel product generation delay becomes a limiting factor [23].

3.5 SRT Square Root

3.5.1 Introduction

SRT Square Root Operation Properties

Property Value

Operation

Operand X Range 1 ≤ X < 2

Operand D Range N/A

Result Q Range 1 ≤ Q < 2

Result Sign Sign X (Should be positive from decoding)

Result Exponent

Figure 3.19 Floating Point Square Root Operation Properties

The SRT approach can be adapted to handle the square root operation [11]. However, unlike division,

where the integer SRT algorithm could handle the floating point numbers, the square root iteration has

some minor differences [37]. These differences can be seen by comparing the formula presented for both

division and square root, and noting the value of PRs0 for both. For the most part, the division and square

root algorithms can share an underlying design, and most approaches to speeding up SRT division are

applicable to the square root operation. This section covers a base SRT square root implementation and

the elements that make it different to division, such as the creation of a functional divisor.

33

3.5.2 Radix-2

Where

I = 0, 1, 2, 3,..., m

Q0 = 1, QM0 = 0, PR0 = (X-1) for an even exponent

Q0 =0, QM0 = 0, PR0 = (X/2) for an odd exponent

Figure 3.20 SRT Radix-2 Square Root Equations

The formula for SRT square root (Figure 3.20), unlike division has four base factor PRc, PRs, 2Q, and 2-

(i+1), which can be reduced to three PRc, PRs, and F(I) (Figure 3.23). This makes the approaches and

concepts of reducing the delay of SRT square root architectures (Figure 3.21) similar to that for division

(Figure 3.11). The only added complexity is the computation of F(i), which must occur before an iteration

can start.

Figure 3.21 SRT Radix-2 Square Root Overview

While for the positive product, F(i) could be produced as -(F(i)), as is shown later, it is simpler to keep it

in the form F(i) (Figure 3.24). However, unlike division, where there is a single divisor D, the square root

algorithm uses two, a separate one for both the positive and negative products.

QSLC -1 Product 0 Product +1 Product

Registers

MUX

Critical Path

Q Q-1

Logic

F[i] –F[i] Generation

34

Figure 3.22 SRT Radix-2 Square Root Circuit

For floating point numbers in the range 1 ≤ X < 2, the square root operation produces a result guaranteed

to be in the range 1 ≤ Q < 2, and the first bit of which is produced at the initial stage. This means only 54

iterations are required, 52 for the remaining 52 bits, one for the round bit R, and a guard bit G to correct

the last quotient bit produced.

3.5.3 Square Root Functional Divisor

The original square root formula (Figure 3.20) can be simplified to involve only three base factors instead

of four (Figure 3.23), with the new factor being a functional divisor F(i). Two different divisors, F(i) and –

F(i), are required for the positive and negative products. Both versions of F(i) can be generated by using

the two different versions of the current quotient, Qi and QMi. From these elements, the four base factors

of a square root operation, PRsi, PRci, 2Qi and 2-(i+1), can be treated as three factors, PRsi, PRci, and F(i) or -

F(i). The advantage of using three factors instead of four, is that the underlying division architecture can

be used and a single level of full adders is sufficient (Figure 3.12). In terms of implementation, this only

requires a left shift of Q or QM and two OR operations with the iterations marker.

CSA

2PRsj 2PRcj -F[i]j


+ Product - Product

+PSj+1

+PScj -PScj

QSLC

0 Product

MUX

CSA

-PScj+1

+F[i]j


35

Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations

3.5.4 Integration with Division

Figure 3.24 SRT Radix-2 Division and Square Root Equations

The division and square root operations present a great deal of similarities (Figure 3.24). Integrating the

two functions into a single unit requires the design of a mechanism for selecting the correct divisor

(Figure 3.25). In division, the divisor is a constant single factor, in a square root operation the divisor is

the result of multiple changing factors (Figure 3.23). By constructing the square root functional divisors

prior to their input in to the division iteration, the four base factors are limited to three and the circuit

only requires an extra MUX to choose the correct divisor for both positive and negative products. The

integrated square root and division architecture then has a critical path of the original division unit with

the addition of a MUX and F(I) generation (which at most is a three input OR gate).

36

Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit

3.6 Conclusion

Extending the LTH to provide most of the capabilities of a modern FPU requires the development of a

double precision divider, multiplier, and square root unit. To meet the clock frequency of 100MHz using

standard cells on a 0.5µ process, a focus needs to be placed on reducing delay. This can be done by using

a radix-4 multiplier that consists of modified Booth-2 partial product generation, a full CSA tree, and a

hybrid 106-bit CSEA. To accommodate for the division and square root operations, a combined SRT

radix-4 division and square root unit is proposed. While the design should meet the target frequency, it

may come with a large area cost.

CSA

2PRsj 2PRcj


+ Product - Product

+PSj+1

+PScj -PScj

QSLC

0 Product

MUX

CSA

-PScj+1

+F[i]j

MUX

Dj -

F(i)J

MUX

Operation


37

4 Implementation

4.1 Introduction

This section presents an overview of the HDL modeling, testing, and synthesis stages, used to develop the

multiplier and combined division and square root unit designs into an IC layout ready form. Due to the

complexity and large volumes of code and scripts produced, only a high level overview of the important

aspects of each design is presented. For a full code listing please see “Code Listing”.

4.2 HDL Modeling

4.2.1 Introduction

This section discusses the modeling of the designs presented in the last chapter using the Verilog-97 HDL.

It elaborates on the choice of coding styles, ranging from direct ADK standard cell instantiation to

synthesized alternatives. Further, the processes of macro and micro cell optimization are explored, and

the reasons for performing these levels of modeling are investigated.

4.2.2 Multiplication

Figure 4.1 Multiplier Pipeline

4.2.2.1 Partial Product Generation

The partial product generation step for multiplication is by all accounts one of the fastest stages [8] [11]

[14]. A design from literature was directly implemented [15], which consisted of two cells. The first cell

was a Booth encoder that read three bits from one of the operands, to determine the correct Booth

multiple. The second cell, Booth MUX, read in a single bit from the second operand, and based on the

outputs from the Booth encoder performed a shift or invert operation with its neighboring cells. As the

design was already defined, the only addition to the schematics presented by Samsung was an inverter

tree. The inverter tree was used to buffer the output of a Booth encoder and provide the signals to all

associated Booth MUX for that partial product. Therefore, each partial product contained a Booth MUX,

three inverter trees (one for each bit of the Booth encoder’s output), and 54 Booth MUX (one for each

Booth Partial Product Generation

CSA Tree

106-Bit CSEA

Normalization

Stage 1

Stage 2

Stage 3

Stage 4

Output

Input

38

possible bit of the partial product). These partial products were then grouped into a single module called

“Booth_Partial_Products” that produced all the required partial products.

4.2.2.2 CSA Tree

The CSA tree was the most complex part of this thesis. It involved designing a range of counters from

27:2 to 4:2 using full adders, connecting them in such a way that there was only a critical path of seven

adders, and retrieving columns of bits from the rows of partial products produced from the last stage for

each counter’s input.

The first step was to build the required counters. The Samsung design[15] specifies carry inputs and

outputs for each counter on all seven levels. For each carry signal on a level, there must also be an

associated counter. By using this as a guide, all 23 counters could be constructed, with the only

optimization being that port zero of the initial partial product column input for a particular counter,

would be used as late as possible. It was then observed, upon synthesizing the HDL, that the produced

netlist did not contain any full adder cells (ADK “Fadd1”). Since the design implied the cells were to be

used, a style of coding was adopted that directly instantiated the ADK standard cells. As a result, the

Verilog code is no longer portable outside of an ADK environment.

Once all counters were created, the CSA tree needed to be constructed. This entailed placing 106

counters ranging from 27:2 to half adder cells, in such a way that the carry signals propagated correctly,

and the partial product rows were correctly aligned into columns for counter input. At first by hand, this

seemed an impossible task due to the overwhelmingly vast amount of ports and nets used. Opting for a

more automated approach, a PERL script was used. This resulted in an approximately 1800 lines long

computer generated module that required only minor changes.

4.2.2.3 106-bit Wide Adder

In some texts, a fully synthesized design [8] is suggested to provide sufficient results for the final

multiplication CPA. This proposal resulted in a module with a single line of HDL in its body. While initial

timing and area reports were promising, the goal of 100MHz could not be met with this design.

Using a hybrid CSEA design [17] made from 16-bit CLA blocks [12] a better result was obtained.

However, the delay achieved was very close to the acceptable cut-off limit of 8ns. Since this design was

explicitly described in a structural Verilog, a version using ADK standard cells directly instantiated was

also created. This proved to be a better candidate.

4.2.2.4 Normalization

The normalization section for the multiplier was relatively trivial. It consists of a MUX and several OR

gates. The MUX is used to left shift the required bits if needed, and the OR gates are used to detect if any

of the discarded bits are ones.

4.2.3 Division and Square Root

39

Figure 4.2 SRT Radix-4 Division and Square Root Pipeline

4.2.3.1 QSLC

The quotient selection logic is one of the most critical sections of the SRT design. SUN [25] and Parhami

[8] suggested either using a 4-bit fast adder with some combinational logic, or implementing it as a ROM.

Both avenues were explored, but there was some hesitation with the second [35].

The first approach called for a fast 4-bit adder. The fastest adder suitable for this word size was taken to

be the CLA [12], and due to it being only four bits, the increased area used by this style of adder is

assumed not to be a problem.

The next approach was to be a ROM version. When testing the original CLA in ModelSIM, a test bench was

created that contained a simple behavioral QSLC. By modifying this, to output the resulting quotient bit

for all variations of the most significant four bits, and where the result was all zero adding an extra

internal condition, a Verilog module was created. This module has only a single CASE statement with 256

entries, which Leonardo Spectrum recognized as a ROM.

4.2.3.2 Parallel Product Generation

The parallel product generation essentially performs all three variations of the formula described in the

last chapter for division and square root at the same time (Figure 3.24). Transferring the designs

Input Selection

Square Root Functional Divisor Generation

Input

Parallel Product Generation

First Level QSLC

OTFC Q and QM Calculation

Square Root Functional Divisor Generation

Parallel Product Generation

Second Level QSLC

OTFC Q and QM Calculation

Normalization

Radix-4 Iteration

Normalization Stage

Output

Sticky Bit Calculation

Div4_Iteration

40

presented in the integrated division and square root section directly into an initial design was the first

target. This was done by defining a basic cell for a single bit that produced a single bit result for all three

variations of “qi+1” and had a final MUX to select the correct one. This resulted in a cell with a large

number of ports requiring to be connected to at least 58 other cells. Moving directly to an automated

approach, a PERL script was created that would instantiate all the required bit cells, and provide two

inverter trees to buffer the input bits from the QSLC.

In the first version, ADK cells were directly used. However the parallel product cell’s internal logic

presented itself as slightly more random then that of the CSA tree. By creating a second version of the

basic fraction bit cell, which used Verilog primitives and synthesizable full adder cells, as in the last

section, an alternative was produced. Initial timing reports were promising. However, it is important to

remember that the timing report tool assumed all inputs will be available at time zero. This is not the

case as the new quotient bits will arrive late as they have to be first generated and propagated. The

solution was to synthesize the full adders and provide the produced netlist as an input to a parallel

product bit cell synthesis run with the hierarchy set to “PRESERVE”.

The final step was to create a netlist for all bits using the synthesized bit cell. This could be described as

macro optimization of micro optimized cells. By noticing that the synthesis tool did not modify the types

of gates used when ADK cells were directly called in a module, a twofold approach was developed. The

first step was to synthesize the bit cells with the behavioral full adders. The second step was run the

synthesis tool again, providing it with the full 59 bit parallel product generation module and synthesized

netlist as inputs. The result is that synthesized logic is confined to a single micro cell, and the induced

added delay due to random logic wiring irregularity is prevented from propagating to the macro cell.

4.2.3.3 On-The-Fly Conversion

On-the-fly conversion is used to update the quotient at the end of each iteration. Its complexity derives

from the fact it must handle negative quotient bits. During the last chapter, a high level design was shown

for the updating of both Q and QM (Figure 3.14). This led to a basic module for a single bit consisting of

several MUX levels, which was expanded and replaced by two levels of NAND gates. These cells were

optimized in the same way as the parallel product generator, in that they were synthesized at the cell

level, linked together using a PERL script, and synthesized again together using the macro cell and the

synthesized micro cell netlists.

4.2.3.4 Radix-2 Iteration

This was one the easiest modules to design. Its purpose was to perform a single radix-2 iteration of a

division or square root operation on the fractional component of the operands. Essentially it only used

four sub modules. These include a QSLC, parallel fraction product generator, on-the-fly conversion, and a

square root functional divisor module. The inputs and outputs were all clearly defined by the internals.

While not used in the final design, it was an invaluable tool for testing, since with a higher radix, the

outputs are the result of multiple iterations, which may make it unclear where a problem has occurred.

4.2.3.5 Radix-4 Iteration

41

This module serves the same purpose as the radix-2 version. The difference is that it implements two

layered radix-2 iterations with overlapped QSLC. It uses slightly more then double the gates, as it

requires a total of four QSLC blocks instead of one.

4.2.3.6 Normalization

The normalization stage for division is much simpler than that of multiplication. It takes in a 57-bit

quotient and returns a 54-bit result, 53 bits for the quotient and an extra round bit R. For a square root

operation, the required result is the first 54 bits, where for division it is 54 bits but from the second or

third MSB. Essentially this can be done by using three levels of MUX.

4.3 Synthesis

4.3.1 Configuration

Mentor Graphics Leonardo Spectrum Synthesis Level 3 Configuration

Property Settings

Temperature 27

Voltage 5

Max Fan-out Load 5

Register 2 Register 10ns (Data arrival time 8.0ns)

Input 2 Register 10ns

Register 2 Output 10ns

Optimization Timing with Standard Effort (maximum possible)

Figure 4.3 Leonardo Spectrum Settings

Leonardo Spectrum was used to generate netlists for behavioral HDL modules and to flatten structural

hierarchies in preparation for layout. Further, it was also utilized to provide timing and area reports, and

to verify fan-out of no more than five for all key modules. While for the most part, the HDL was designed

in a structural form, the buffering and stage connections were done using a high level form of Verilog-97.

The most important elements of the above configuration are the timing settings. While any data arrival

times produced by Leonardo will only be a sum of the delay through each component, they do not include

wiring delays. As a result, any pre-layout timing will be increased post-layout. To counter for this, for a

cycle time of 10ns, approximately ~2ns must be set aside. For any design unit, a synthesized register to

register data arrival time of 8ns is assumed to indicate that it can meet a cycle time of 10ns post layout.

4.3.2 Automation and Reporting

The Mentor Graphics synthesis tool, Leonardo Spectrum, when run in a GUI mode was found to be highly

unresponsive. In comparison, the same tool when used under a terminal environment was more reliable.

This led to a series of scripts for generating the netlists and reporting timing and area reports.

4.3.3 IC Preparation

When creating an ADK netlist by flattening or synthesizing a design, the output from Leonardo may be

incompatible with the IC layout tool. This occurs in two areas, the first being signals assignment, and the

42

second power supply component problems. It is unclear why both of these issues occur, but the

workaround is to manually alter the produced netlist (Figure 4.4).

Preparing a Netlist for IC Layout

Synthesis Result IC Layout Compatible Verilog

Fake_gnd gFgXX(.Y(wGnd));

Fake_vcc gFvccXX(.Y(wVdd));

Assign j = k;

// Replace fake ground

Supply0 wGND;

// Replace fake power

Supply1 wVcc;

// Replace assign statements with fastest ADK buffer

Buf02 gJKXX( .Y(j) , .A(k) );

Figure 4.4 Preparing a Netlist for IC Layout

After synthesis and some fine tuning of the produced netlist, it was time to proceed to creating a layout.

This is done in two stages; the first is to import the netlist into the schematic editor to generate a

technology viewport. The second stage uses this viewport to generate the layout. By synthesizing each

pipeline stage, and feeding the produced netlist into the synthesis run of the macro cell, where essentially

all pipeline stages are instantiated with the addition of registers, a generic floor plan can be created. For

example, synthesizing the multiplier, the netlists for the Booth partial products, CSA tree, and CPA are

required. For this to produce the correct result, the synthesis run should have its hierarchy set to

preserve. In this way both a semi automated or fully automated IC layout is possible.

4.4 Conclusion

The implementation of division, multiplication, and square root units is a complicated task. By designing

the units in a hierarchical manner, where each micro or leaf cell is modeled using both ADK standard cells

and Verilog primitives, a variety of options becomes available for the final implementations. To handle

the large number of interconnects required, especially for the CSA tree, an automated approach using

PERL was adopted. Also, scripting was further embraced to design the QSLC, where a Verilog test bench

for the CLA QSLC was modified to generate a full ROM QSLC.

43

5 Results

5.1 Introduction

This section presents the results of the HDL design in terms of area, timing, and IC layouts, then discusses

the reasons for the achieved area and timing for both the multiplier and combined division and square

root units, and elaborates on why integrating them into the original LTH was not possible. Finally a

comparison between the project performance of the designed architecture and industry implementations

is presented.

5.2 Timing and Area Analysis

5.2.1 LTH

LTH Pre-Layout Data Arrival Times (ns)

Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast

Decode 8.65 4.64 3.13

Pre- normalization 23.72 12.59 9.94

Add/Sub 26.66 17.68 11.66

Post- normalization 30.37 15.98 11.23

Round-normalization 23.01 10.76 9.87

LTH (all stages and buffers) 36.87 24.73 16.78

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only.

Figure 5.1 Original LTH Timing Analysis

LTH Area by Gate Count


Decode 631 615 582

Pre-Normalization 1947 1771 1720

Add/Sub 5407 3975 4213

Post- normalization 2582 2384 2654

Round- normalization 1147 1049 1018

LTH (all stages and buffering) 11547 10200 9859

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only.

Figure 5.2 Original LTH Area Analysis

The above is a timing and area analysis of the Original LTH FPU HDL separated into pipeline stages. By

the authors own admission [3], it is not optimized. The HDL is coded in a very high level of behavioral

VHDL, and as a result, the timing is much higher then desired. Using the maximum possible

computational effort of the synthesis tool, the original LTH fails to meet the required cycle time of 10ns,

which becomes worse considering this is pre-layout, and does not include any allowance for wiring delay

incurred during layout. While it was possible to generate the above tables with apparently borderline

results, the time taken often exceeded a week for synthesizing a single stage for a single variation of the

44

AMI 0.5 micron process. The situation is worsened by most stages requiring additional logic, which

would add extra delay, and more than likely increased synthesis time. This then begs the question,

whether the task is one of integrating, or completely redesigning the LTH.

Some possible solutions to fixing the large delay would involve splitting each of the pipeline stages that

exceed the 8.5ns for data arrival. For example, the three internal logic operations of the “Add/Sub” stage

could be split into three stages, namely a pre-shift, an addition using the multipliers CSEA, and a result

conversion. Since several stages require a similar form of pipeline stage splitting, this method would

move the number of stages for addition or subtraction operations closer to 10. This latency does not

compare well against other FPU implementations [14], which normally perform these operations in fewer

than five cycles. The problem is magnified by the fact the Leon2 stalls until a single FPU operation has

completed, and extra stages in the pipeline have significant effects on overall system performance.

5.2.2 Multiplier

Multiplier Pipeline Pre-Layout Data Arrival Times (ns)


Partial Product Generation

Booth-2 Product Generation 6.26 3.59 2.31

Carry Save Adder Tree

CSA Tree (ADK Std. Cells) 21.06 11.68 7.34

*Synthesized CSA (Area) 13.74 7.88 5.03

*Synthesized CSA (Delay) 15.64 7.70 5.01

106-Bit Final CPA

106-Bit CSEA (ADK Std. Cells) 18.75 10.17 8.02

*Synthesized CSEA (Cells Preserved) 28.22 12.60 10.04

*Synthesized 106-Bit Adder 48.67 20.77 11.84

Normalization (Fraction Adjustment)

Normalization 4.56 2.66 1.73

Multiplier (Includes: Buffers and Exponent)

Radix-4 Multiplier 24.04 13.52 8.91

Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only.

“Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings.

Figure 5.3 Multiplier Timing Analysis

Multiplier Pipeline Area by Gate Count


Partial Product Generation

Booth-2 Product Generation 22946 22946 22946

Carry Save Adder Tree

CSA Tree (ADK Std. Cells) 7719 6415 6415

*Synthesized CSA (Area) 8202 7998 7997

45

*Synthesized CSA (Delay) 15303 13875 14397

106-Bit Final CPA

106-Bit CSEA (ADK Std. Cells) 3128 2963 2931

*Synthesized CSEA (Cells Preserved) 3074 3260 3126

*Synthesized 106-Bit Adder 2445 1588 1930

Normalization (Fraction Adjustment)

Normalization 188 185 185

Multiplier (Includes: Buffers and Exponent)

Radix-4 Multiplier 48908 47131 47132

Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only.

“Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings.

Figure 5.4 Multiplier Area Analysis

Generating the partial products is one of the fastest operations in the multiplication process as is shown

by the timing analysis for “Booth-2 Product Generation” (Figure 5.3). It is also one of the largest in terms

of gate count. This can be explained by the large number of Booth MUX required (1458), with one needed

for each bit of each partial product (27 products each 54-bits long). Further, connecting a Booth encoder

to each Booth MUX in a Booth product required buffering to solve the fan-out problem. The solution was

to use an inverter tree, with three of these required for each partial product, and a total of 81 for all the

partial products. At first it may appear that this stage could include more logic levels. However, the extra

slack is beneficial for placing the input registers for the next stage in a more optimal position for the CSA

tree.

Adding the generated partial products using a full binary tree of Full Adder cells achieved a balanced

result for both area and timing. This stage whether implemented using Full Adders or 4:2 counters,

requires a significant number of Adder cells, with long carry propagation paths. As it can be seen from

the timing analysis, the CSEA is actually slower then the CSA, meaning that a satisfactory delay was

achieved for the Adder tree using ADK Fadd1 cells. While it is clear that a 4:2 counter approach (Figure

B.1) would offer a significantly shorter delay, it would also come at a vastly increased gate count (Figure

B.2), and no overall increase in performance. Further, the synthesized options, while apparently faster,

may have a significantly worse critical path post-layout then the one using Fadd1 cells. Thus while a 4:2

counter and fully synthesized CSA trees offer increased pre-layout performance, the gain is either

irrelevant or unlikely to be realized post-layout and comes with a cost of significantly more gates.

For the final stage it was initially hoped that a synthesized 106-bit CPA may be able to provide

satisfactory performance in terms of both gate count and delay. As the above timing and area analysis

shows, the custom ADK standard cell based CSEA is around 33% faster then the fully synthesized 106-bit

Adder option on a fast process, but at a cost of 50% more gates. The synthesized version of the CSEA

provides no advantages for either delay or area. To meet the target of a 10ns or approximately 8ns pre-

layout cycle time, the only viable option is to use the ADK std. cell version of the CSEA on a fast process.

46

5.2.3 Combined Division and Square Root

Combined Divide Square Root Pre-Layout Data Arrival Times (ns)


Quotient Generation (Q and QM)

*OTFC (ADK Std. Cells) 4.84 2.79 1.79

OTFC (Syn.) 4.15 2.49 1.61

Quotient Selection Logic QSLC

*QSLC (CLA) 8.26 4.61 2.91

QSLC (ROM) 3.26 1.83 1.15

Square Root Function Divisor (F(i) and –F(i))

Square Root Functional Divisor 0.70 0.42 0.28

Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX)

*Bit Slice (ADK Std. Cells) 4.97 2.81 1.77

*Bit Slice (Syn.) 3.77 2.01 1.34

Bit Slice (Micro Syn.) 4.30 2.43 1.54

Sticky Bit Generation 4.19 2.35 1.55

Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC)

*Radix-2 Iteration (ADK Std. Cells) 10.68 6.06 3.92

*Radix-2 Iteration (Syn.) 9.30 6.19 4.02

*Radix-2 Iteration (Micro Syn.) 9.64 5.89 3.83

Radix-4 Input Selection (Syn.) 4.39 3.09 2.09

*Radix-4 Iteration (ADK Std. Cells) 17.51 9.95 6.57

*Radix-4 Iteration (Syn. Product) 14.01 9.90 6.41

Radix-4 Iteration (Micro Syn.) 16.03 9.39 6.06

Combined Division and Square Root Unit (Including Buffers and Exponent)

SRT Divider Radix-4 22.08 12.92 8.93

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also

includes sticky bit generation.

Figure 5.5 Combined Division and Square Root Timing Analysis

Combined Divide Square Root Area by Gate Count


Quotient Generation (Q and QM)

*OTFC (ADK Std. Cells) 856 856 856

OTFC (Syn.) 779 652 652

Quotient Selection Logic QSLC

*QSLC (CLA) 46 46 46

QSLC (ROM) 33 41 42

Square Root Function Divisor (F(i) and –F(i))

Square Root Functional Divisor 251 251 250

47

Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX)

*Bit Slice (ADK Std. Cells) 34 34 34

*Bit Slice (Syn.) 38 33 37

Bit Slice (Micro Syn.) 29 29 29

Sticky Bit Generation 423 420 417

Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC)

*Radix-2 Iteration (ADK Std. Cells) 3156 3155 3155

*Radix-2 Iteration (Syn.) 3125 3124 3124

Radix-2 Iteration (Micro Syn.) 2872 2875 2875

Radix-4 Input Selection (Syn.) 1057 955 955

*Radix-4 Iteration (ADK Std. Cells) 7046 6987 6898

*Radix-4 Iteration (Syn.) 7075 6969 6969

Radix-4 Iteration (Micro Syn.) 6714 6530 6532

Combined Division and Square Root Unit (Including Buffers and Exponent)

SRT Divider Radix-4 12418 12524 12419

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also

includes sticky bit generation.

Figure 5.6 Combined Division and Square Root Area Analysis

The two key areas of the combined division and square root unit are the QSLC and the layering of

iterations. The fast CLA based QSLC was larger (Figure 5.6) and slower (Figure 5.5) than the synthesized

ROM QSLC. Also from the timing analysis, it is possible to see the benefits from overlapping QSLC, where

a two iterations are performed with a data arrival time significantly less the twice what would be

expected by layering two non-overlapped radix-2 iterations. However the second iteration was helped by

micro synthesis, where effectively the parallel product used bit slices containing several smaller

synthesized cells (“micro syn.”), instead of a completely synthesized bit cell (“syn.”). Surprisingly, each of

these bit cells used two synthesized full adders (Figure B.2), which would expectantly make the cell larger

than the ADK version, but instead they were both faster and smaller.

5.2.4 Exponent

Important Exponent Component Pre-Layout Data Arrival Times (ns)


Addition or Subtraction of Exponents

*CLA 12-Bit (ADK Std. Cells) 11.17 6.52 4.12

CLA 12-Bit (Syn.) 9.67 5.73 3.63

Decrementing Exponent

*Dec 12-Bit (ADK Std. Cells) 6.55 3.26 2.06

Dec 12-Bit (Syn.) 4.79 2.74 1.99

Incrementing Exponent

*Inc 12-Bit (ADK Std. Cells) 5.14 2.96 1.92

Inc 12-Bit (Syn.) 3.73 2.25 1.45

48

*Unused Cells or Provided for Information Purposes only.

Figure 5.7 Exponent Component Timing Analysis

Important Exponent Component Area by Gate Count


Addition or Subtraction of Exponents

*CLA 12-Bit (ADK Std. Cells) 157 150 50

CLA 12-Bit (Syn.) 104 96 95

Decrementing Exponent

*Dec 12-Bit (ADK Std. Cells) 46 42 42

Dec 12-Bit (Syn.) 49 49 47

Incrementing Exponent

*Inc 12-Bit (ADK Std. Cells) 47 46 45

Inc 12-Bit (Syn.) 46 45 45

*Unused Cells or Provided for Information Purposes only.

Figure 5.8 Exponent Component Area Analysis

The exponent datapaths for both the multiplier and combined division and square root units have to this

point not been discussed in any detail. While their design is trivial in comparison to the fraction for

multiplication, division, and square root operations, it is important to understand the associated timing

(Figure 5.7) and area (Figure 5.8), that they add to an FPU. Essentially, there are three different

operations performed on operands exponents, these include addition, increment, and decrement. While

the exponent field of IEEE-754 operands is only 11 bits, the extra 12th bit of precision is required for

overflow and underflow detection.

49

5.3 IC Layout

5.3.1 Multiplier

Figure 5.9 Multiplier IC Layout

When transferring the multipliers netlist into an IC layout (Figure 5.9), more standard cell limitations

were discovered. For example, a four port OR gate was missing in the Design Architect installation. It

was not possible to avoid using this cell in a similar way as the “AND04”, where the synthesized CPA was

used instead of the faster ADK CSEA. The result was to use two NOR and a NAND gate. Similar

occurrences happened for other ADK standard cells. At this stage it was realized that any IC layout would

not accurately reflect the merits of the HDL model, and with limited time, a fully automated place and

route were generated. That being said, placing the Booth MUX closer to their respective connections in

the CSA tree, may have been very difficult with a full SDL design using a manual layout process.

50

5.3.2 Combined Division and Square Root

Figure 5.10 Combined Division and Square Root IC Layout

Unlike the multiplier, the layout for the division and square root unit is more regular, which should make

a non-automated IC layout much simpler. This can be seen by the regularity of the integrated division

and square root cell (Figure 3.25) in comparison to that of the CSA tree counters [15]. While the netlist

produced from the HDL for the division unit contained no missing cells, time constraints limited any IC

layout to that of a fully automated place and route (Figure 5.10). The internal wiring and cell placement is

noticeably more uniform then the multiplier.

51

5.4 Performance

5.4.1 Pre-Layout

Pre-Layout Clock Frequency for FPU Components on an AMI 0.5µ Fast Process

Component Pre-Layout Clock Frequency (MHz)

LTH FPU(Original) 58.8

Multiplier 109.3

Multiplier (Using Synthesized CPA for IC Layout) 102.7

Combined Division and Square Root Unit 108.9

Figure 5.11 Pre-Layout Clock Frequency

The maximum possible clock frequency of a FPU containing the new components is 102.7MHz (Figure

5.11). Post-layout this figure is likely to decrease as the wiring delays are incorporated. It does however

illustrate the point, that the newly designed components have achieved a pre-layout frequency almost

double that of the original LTH. This figure is more interesting, when it is considered that the speed of the

multiplier was sacrificed for software compatibility issues, as a slower synthesized CPA was used for IC

layouts. Alternately, assuming there were no standard cell compatibility issues, the maximum clock rate

would be 108.9MHz. Allowing roughly 10% added wiring delay, results in a projected post-layout clock

frequency of 98.01MHz or 100MHz for the optimal design.

5.4.2 Industry Comparison

1994-95 0.5µ Processor Latency and Clock Frequency Comparison

Operation Latency

Processor Multiplication Division Square Root Clock Rate (MHz)

Intel P6 5 38 38 150

MIPS R10000 2 19 19 200

Sun UltraSPARC 3 22 22 167

DEC21164 4 23 23 200

AMDK5 7 39 39 75

PA8000 3 31 31 200

LTH Extended 6(4+2LTHL) 30(28+2LTHL) 31(29+2LTHL) 100

All operation latencies are assumed to include FPU overhead (e.g. decoding, rounding, and operand packaging). LTHL = LTH FPU

operation overhead latency (this includes decoding and rounding stages). Statistics provided by Flynn and Obermann [14].

Figure 5.12 Processor Latency and Frequency Comparison

In a comparison of 0.5µ microprocessors released during 1994-95 (Figure 5.12), the new components for

the LTH performed quite well. While all statistics for the new design are projected, as the LTH was not

updated, and the clock frequency is based on pre-layout data arrival times, the extended LTH is not the

slowest architecture, even if the post-layout data arrival times are decreased by 25%. However, the

mentioned clock frequencies of industry implementations may not reflect the operation clock of the

respective FPU, but rather the operation frequency of its CPU.

52

Operation latency and clock frequencies alone,, are not however adequate metrics for an architecture

design. The area, feature size, and power consumed are also important [14]. This provides a restriction

on measuring the benefits of the new modules. As the existing LTH was not updated due to its own

limitations, the FPU area could not be approximated, and a clock frequency could only be related to the

critical path of the new units (Figure 5.11).

5.5 Conclusion

For the multiplier and combined division square root units, both a structural form of Verilog with direct

ADK instantiation, and a synthesizable version were implemented. This led to a variety of options for

creating the final layouts. The ADK based models provided a satisfactory delay with a low gate count,

while the synthesized modules provided, in general, a vastly decreased pre-layout delay and increased

gate count,

When selecting the optimal solution for the multiplier, the decisions were determined by the delay of the

final CPA. While for the other stages, various solutions were available, the ones with the lowest gate

counts were selected. This is because no matter how fast the other stages may be, the cycle time is only

altered by the slowest part in the pipeline, which is the CPA, and there is no need to add extra gates for no

improved performance.

In the division and square root unit, the use of synthesized components was increased. The original CLA

based QSLC provided an expected delay. However, the synthesized ROM version was significantly faster

and with fewer gates. For the parallel product generation, which included buffering of the QSLC output, a

synthesized cell was also appropriate. By confining the results of the synthesis tool to a small area, the

random logic was not expected to produce a large increase to post-layout data arrival times.

While a variety of options were investigated for both the multiplier and division units, the LTH proved

elusive. Working with the original LTH was problematic as a result of unsatisfactory synthesis times,

where the components produced could not meet the required register to register delays. Nearly all stages

required some optimization, which would extend their critical path, providing worse data arrivals times.

This leaves two solutions, the first being to completely redesign the LTH, and the second is to use a

different open source FPU, such as that provided by SUN Microsystems [7].

For the implemented units, an IC layout was constructed using a fully automated process. The results

were for illustration only, as many sacrifices had to be made on the original design due to software

compatibility issues. Thus, any post-layout timings would not accurately reflect the true merits of the

architecture, and performance evaluation must be based on pre-layout data arrival times.

53

6 Conclusion

6.1 Introduction

This section provides an overview of what was achieved, what remains to be done, and the future

direction of extending the FPU. It discusses each of the functional units implemented in terms of design,

area and timing, and layout. The directions for which the original LTH could be extended and the new

units improved to further increase performance are also elaborated. Finally, there is some discussion of

the overall achievements of this project, and directions for future implementations.

6.2 Achievements

6.2.1 Design

A double precision floating point multiplier was designed with four pipelined stages. The first was a

Booth-2 radix-4 partial product generator, that produced 27 54-bit partial products. The second stage

was a full CSA tree, where each column of the partial products produced was an input into a counter. The

third stage was a CPA, which was constructed from 15-bit CLA blocks to form a 106-bit CSEA. The fourth

and final stage was normalization, where the result produced was prepared for rounding.

The division approach used was a radix-2 SRT subtractive algorithm using a modified QSL from SUN

Microsystems. It consisted of two pipeline stages, a recursively called division stage and a normalization

stage. The first stage used two layered radix-2 iterations with overlapped parallel quotient selection,

which permitted the calculation of a double precision result in 27 cycles. The square root operation was

also incorporated into the division unit. Like the CSA from the multiplier, the design was realized through

a bit slice implementation, However, unlike the multiplier, the delay of the SRT division and square root

unit was never a concern, as the number of SRT iterations could be scaled.

Both the multiplication and division units were verified against a set of over 1000 random test vectors.

The square root unit was also subjected to a similar number of vectors, however its results had to be

check by hand.

6.2.2 Implementation

Over a 100 modules were created to model the architecture of the multiplier and division unit for both

direct ADK and synthesized implementations. This led to a low area ADK full adder based CSA tree, a

quick ADK CSEA, and a fast synthesize QSLC ROM. Without this depth and variety of HDL modeling, it is

doubtful that a sub 10ns pre-layout cycle time would have been possible under any 0.5µ process.

Transferring the HDL to an IC layout was more complicated than expected. This resulted from software

compatibility issues, where the synthesized netlists would consist of unknown or unavailable cells. The

solution was to model the missing cells using available ones, creating a slower and larger design. This

meant no produced IC layout could come close in terms of clock frequency or area to the pre-layout

calculations. Considering this, and time limitations, a fully auto-place and route IC layout was produced, if

for no other reason than to demonstrate size and complexity.

54

6.2.3 Timing and Area

Using the Mentor Graphics ADK2.5 standard cell design flow for an aggressive AMI 0.5 micron static

CMOS process, a target pre-layout speed of 100MHz was possible. However, using a more typical AMI 0.5

micron process, a maximum data arrival time of 13.52ns was obtained for the multiplier, in contrast to a

more aggressive fast process, where the cycle time was reduced to 8.91ns. Thus for a 10ns cycle time or

an FPU clocked at 100MHz, an aggressive process is required.

This does not include the timing of the original LTH. Due to the high level of HDL, extensive synthesis was

required to generate a netlist. This provided several problems. In particular, some of the original code

was not synthesizable and some modules provided a critical path of at least 50ns for a typical AMI 0.5

micron process. This caused significant problems with integrating the data paths, as only a complete

redesign of the original LTH code would permit it to achieve a cycle time of 10ns, similar to that of the

new pipeline stages.

Regarding the new datapaths, the pipeline stage with the largest area was the modified Booth-2 radix-4

partial product generator. While using a fast AMI process, it was estimated to have an area of at least 22K

gates, but surprisingly it had one the fastest pre-layout critical paths of 3.5ns. Overall the multiplier

required 47K gates, and the combined division square root unit 12K gates, and since multiplication is a

more used operation [29], the ratio of required area and utilization is appropriate.

6.3 Future Directions

6.3.1 Design

There are several possibilities for improving this design that may lead to achieving a sub 10ns cycle time

under a typical AMI 0.5µ process. The first is modifying the multiplier to perform more computations in

the Booth encoding stage, perhaps extending the Booth encoding to radix-8. This would lead to an

increased delay during this stage, along with increased area. A higher radix multiplication would reduce

the delay and area of the CSA tree, as a result of requiring fewer counters.

The CSA tree could be modified to have its first few levels of counters placed in the Booth encoding stage.

The result would be a slightly longer Booth encoding Stage, a slightly shorter CSA stage, and a significant

saving of area as less partial products would need to be buffered. This would come at the expense of

increased design time, as many counters would need to be created that had pipeline registers directly

inserted after the first few levels of counters.

The third possibility is to perform the final CPA over several stages, where each CLA block could produce

an output for the CSEA, and the MUX selection could occur in the next stage. This reduces the critical path

of the first stage of the CPA to one 16-bit CLA and the second stage to six multiplexers. The reason this

was not done initially was that it adds an extra pipeline stage, and more gates, due to the extra buffering

required.

For the division and square root unit, the goal of implementing the Sun design [25], which used three

layered overlapped stages seemed out of reach for standard cells on the AMI 0.5µ process. Without

55

considering the input selection required for the SRT iterations, three levels of parallel product generation

and overlapped QSLC comes close to 8.5ns. This limits the number of iteration per cycle for this design to

two. Future implements may wish to investigate alternate architectures, such as a radix-512 [37] or a

multiplicative approach [19]. However, for an SRT approach, there is probably not much more

performance to be gained, without significant increases in area [23].

6.3.2 Implementation

Clearly the different versions of the ADK offered by the synthesis tool, and the missing cells of the

schematic entry and layout software, are a cause for concern. With these issues resolved, and a

significant amount of time, the HDL implementation could be realized in a hand placed IC layout.

6.3.3 Architecture Integration

The LTH FPU offered most of the required functionality for integrating the new units. It did not however,

provide a satisfactory level of performance. While the missing (multiplier, divider and square root) units

were fully realized in HDL, integrating them with the LTH would result in halving their projected clock

frequencies (Figure 5.11). For the designed units to be used as part of a larger FPU capable of a 100MHz

clock frequency, a more optimal solution than the current LTH needs to be found, which may include a

complete redesign or the use of another open source FPU [7].

6.4 Conclusion

The LTH FPU provides for Gaisler Researches Leon2, a partial implementation of a floating point unit, that

includes instruction and operand decoding, special number detection, addition and subtraction, and

rounding. However it does not have a multiplier, divider, or a square root unit. These components have

been designed using Verilog HDL with the ADK 2.5 design flow, and verified to operate at 100MHz pre-

layout, using an aggressive AMI 0.5 micron process. While the components have been built and tested

they have not been integrated into the original LTH, nor realized in an IC layout.

While the designs of the two new units are complete, the implementation could benefit from a hand

placed IC layout, specifically the CSA tree. This would drastically reduce area consumed and improve the

post-layout critical path of the multiplier compared to the auto-place and route layout produced.

Since the start of development, an open source CPU including a fully pipelined FPU with many

improvements over the Leon2 and original LTH, has become available from SUN Microsystems, with the

release of the Open SPARC T1. The FPU provided with the T1 offers a better choice for future innovations,

as it is not confined by the limitations placed on it by the Leon2 and the original LTH. Further, the

original LTH would need to be completely redesigned, to achieve a similar performance level to that of

the new units, or to be a comparable option to the T1’s FPU for future projects.

56

Bibliography

1 Gaisler Research. “Leon 2 Processor” [Online]. Available

http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

2 Gaisler Research. “Leon2 Processor Users Manual” [Online]. Available

http://www.gaisler.com/doc/leon2-1.0.30-xst.pdf

3 M. Kasprzyk. “Floating Point Unit Design IC Project 2001” [Online]. Available

http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

4 SPARC International. “The SPARC Architecture Manual V8” [Online]. Available

http://www.sparc.com/standards/V8.pdf.

5 GNU. “General Public License GPL” [Online]. Available

http://www.gnu.org/copyleft/gpl.html

6 Gaisler Research. “GRFPU High-Performance Floating-Point Unit”. Available

http://www.gaisler.com/cms4_5_3/index.php?option=com_content&task=view&id=138&

Itemid=54

7 SUN Microsystems. “OpenSPARC-T1” [Online]. Available http://opensparc-

t1.sunsource.net/index.html

8 B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University

Press, 2000.

9 I. Koren, Computer Arithmetic Algorithms 2nd Ed., Prentice-Hall, 2002.

10 I. Koren. “Computer Arithmetic Algorithms Simulator” [Online]. Available:

http://www.ecs.umass.edu/ece/koren/arith/simulator/

11 M. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, 2004.

12 J. E. Stine, Digital Computer Arithmetic Datapath Design using Verilog HDL, Kluwer

Academic Publishers, 2004.

13 J. Deschamps, G. J. Antoine Bioul, and G. D. Sutter, Synthesis of Arithmetic Circuits, John

Wiley & Sons, 2006.

57

14 M. J. Flynn and S. F. Oberman, Advanced Computer Arithmetic Design, John Wiley & Sons,

2001.

15 K. Cho, J. Hong, and G Choi, “54x54-bit Radix-4 Multiplier based on Modified Booth

Algorithm,” in Proc. 13th ACM Symp. VLSI, pp 233-236, Apr. 2003.

16 R. K. Yu, G. B. Zyner, “167MHz Radix-4 Floating-Point Multiplier,” in Proc. 12th IEEE Symp.

On Computer Arithmetic, pp 149-154, Jul. 1995.

17 J. Mori, M. Nagamatsu, M. Hirano, ”A 10-ns 54*54-bit Parallel Structured Full Array

Multiplier with 0.5 micron CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 26,

no. 4, pp 600-606, Apr. 1991.

18 N. Besli, R. G. DeshMukh, “A 54*54-bit Multiplier with a new Redundant Booth’s

Encoding,” IEEE Conf. Electrical and Computer Engineering, vol. 2, pp 597-602, 12-15 May

2002.

19 E. M. Schwarz, R. M. Averill, L. J. Sigal, “A Radix-8 S/390 Multiplier,” in Proc. 13th IEEE

Symp. Computer Arithmetic, pp 2-9, Jul. 1999.

20 N. Itoh, Y. Naemura, H. Makino, Y. Nakase, “A Compact 54*54-bit With Improved Wallace-

Tree Structure,” in Dig. Technical Papers of Symp. VLSI Circuits, pp 15-16, Jun. 1999.

21 M. J. Flynn and S. F. Oberman, “Division Algorithms and Implementations,” IEEE Trans.

Computers, vol. 46, no. 8, Aug. 1997.

22 P. Soderquist and M. Leeser, “Area and Performance Tradeoffs in Floating-Point Divide

and Square-Root Implementations,” ACM Computing Surveys, vol. 28, no. 3, Sep. 1996.

23 D. L. Harris, S. F. Oberman, and M. A. Horowitz, “SRT Division Architectures and

Implementations,” in Proc 13th IEEE Symp. Computer Arithmetic, pp 18-25, Jul. 1997.

24 P. Kornerup, “Quotient Digit Selection for SRT Division and Square Root,” IEEE Trans.

Computers, vol. 54, no. 3, Mar. 2005.

25 J. A. Prabhu and G. B. Zyner, “167MHz Radix-8 Divide and Square Root Using Overlapped

Radix-2 Stages,” in Proc. 12th IEEE Symp. Computer Arithmetic, pp. 155-162, Jul. 1995.

26 J. Chiang, H. Chung, and M. Tsai, “Carry-Free Radix-2 Subtractive Division Algorithm and

58

Implementation of the Divider,” Tamkang Journal of Science and Engineering, vol. 3, no. 4,

pp 249-255, 2000.

27 M. Kuhlmann and K. K. Parhi, “Fast Low-Power Shared Division and Square Root

Architecture,” in Proc. IEEE Int. Conf. on Computer Design, (ICCD), pp. 128-135, Oct. 1998.

28 G. Gerwig, H. Wetter, E. M. Schwarz, J Haess, “Higher Performance Floating-Point Unit with

116 wide Divider,” in Proc. 12th IEEE Symp. On Computer Arithmetic, pp 87-94, June 2003.

29 SPEC. “SPEC CFP2006 Benchmark Description” [Online]. Available:

http:www.spec.org/cpu2006/CFP2006.

30 M. Darley, B. Kronlage, D. Bural, B. Churchill, D. Pulling, P. Wang, ”The TMS390602A

Floating-Point Coprocessor for SPARC Systems,” IEEE Micro, vol. 10, no. 3, pp. 36-47,

May 1990.

31 B. J. Benchschneider, W. J. Bowhill, E. M. Cooper et al, ”A Pipeline 50-MHz CMOS 64-bit

Floating-Point Arithmetic Processor,” IEEE Journal of Solid-State Circuits, vol. 24, no. 5, Oct.

1989.

32 L. Min, B. Yong-Qiang, S. Xu-Bang, G. De-Yuan, “The implementation of an out-of-order

execution floating point unit,” Proc. IEEE Int. Conf. Solid State and Integrated Circuits, pp.

1384-1387, vol. 2, Oct. 2004.

33 M. Aagaard and C. H. Seger, “The Formal Verification of a Pipelined Double Precision IEEE

Floating Point Multiplier,” Proc. IEEE/ACM Int. Conf. Computer Aided Design, pp. 7-10,

1995.

34 M. Leeser and J. O’Leary, “The Verification of a Subtractive Radix-2 Square Root

Algorithm and Implementation,” IEEE Int. Conf. on Computer Design, (ICCD), pp.

526-531, Oct. 1995.

35 S. Siewert, “Big Iron Lessons, Part 1: FPU Architecture Now and Then” [Online],

Available: http://www-128.ibm.com/developerworks/library/pa-

bigiron1/?ca=dnt-65.

36 N. Weste, D. Harris, CMOS VLSI Design, 3rd Ed. Int. Ed, Addison Wesley, 2004.

59

37 A. Nannarelli, “Low Power Division and Square Root”, PhD. thesis, Univ. of

California, Irvine, 1999.

60

Appendix A: Arithmetic by Example

Introduction

The following sections provide a demonstration of some of the arithmetic methods used to compute the

multiplication, division, and square root operations in this design. They illustrate the broad concepts of

each function, and provide an insight into the correct operation of the most important sections of each

unit.

Radix-2 Unsigned Multiplication

Unsigned multiplication is one of the simplest of operations, where a set of partial products are produced

and summed (Figure A.1). Each partial product is the result of an AND operation on a bit from operand B

and the entire operand A. The partial product is then shifted to the left a number of times, depending on

the bits position in operand B.

A × B = R

2110 × 2610 = 54610

101012 × 110102 = 10001000102

A[4:0] 1 0 1 0 1 2110

B[4:0] × 1 1 0 1 0 2610

PP1 0 0 0 0 0 A and B0

PP2 1 0 1 0 1 A and B1

PP3 0 0 0 0 0 A and B2

PP4 1 0 1 0 1 A and B3

PP5 1 0 1 0 1 A and B4

Result 1 0 0 0 1 0 0 0 1 0 PP Sum = 54610

Figure A.1 Unsigned Radix-2 Multiplication Example

61

Radix-4 Multiplication using Modified Booth Encoding

The following example is of an unsigned multiplication of two numbers that uses a Modified Boot-2

algorithm. After the partial products have been generated, they are sign extended and then summed.

This is required as the Booth encoding can generate negative numbers, and using a special form of sign

encoding, requires only flipping the bits and shifting for all negative partial products.

A × B = R

2110 × 2610 = 54610

101012 × 110102 = 10001000102

A[4:0] 1 0 1 0 1 2110

B[4:0] × 1 1 0 1 0 2610

~S1 S1 S1 PP15 PP14 PP13 PP12 PP11 PP10

1 ~S2 PP25 PP24 PP23 PP22 PP21 PP20 S1

PP15 PP14 PP13 PP12 PP11 PP10 S2

PP1 0 1 1 0 1 0 1 0 1 (B1,B0,0) = -2A

PP2 1 0 1 0 1 0 1 0 1 (B3,B2,B1)= -1A

PP3 1 0 1 0 1 0 1 (0,B4,B3) = +2A

Result 1 0 0 0 1 0 0 0 1 0 PP Sum = 54610

Figure A.2 Booth-2 Multiplicaiton Example

The sign extension can extended to a larger number of partial products by repeating the form of encoding

on the second line for the rest of the other products (Figure A.3). In the following example of a generic 8-

bit multiplication, it may appear as if an error has occurred because the last product is shorter then the

rest. This is due to the modified Booth encoding, which will only receive “00X”, and can only product a

zero or positive one product. Further, the sign extension of the forth row was not continued to include a

one, because an 8-bit by 8-bit multiplication can only produce a 16-bit number, which results in the extra

sign extension on row four and the carry out from the most significant column being ignored.

• • • • • • • •

× • • • • • • • •

~S S S • • • • • • • • •

1 ~S • • • • • • • • • S

1 ~S • • • • • • • • • S

~S • • • • • • • • • S

• • • • • • • • S

• • • • • • • • • • • • • • • •

Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram

62

SRT Radix-2 Division

X / D = Q

2510 / 510 = 510

110012 / 001012 = 001012

In Floating-Point Terms 1.10012 / 1.01002 = 1.01002

Let PRs0 = 000.110012 and Result Exponent = Result Exponent + 1

Let PRc0 = 0

All QSLC calculations are based on the SUN Design [25]

Iteration 0

PRs0 0 0 0 . 1 1 0 0 1

PRc0 0 0 0 . 0 0 0 0 0

QSLC 4-bit Sum 0 0 0 . 1 Qi = 1 , Q1 = 1.0

2PRs0 0 0 1 . 1 0 0 1 0

2PRc0 0 0 0 . 0 0 0 0 0

-D 1 1 0 . 1 1 0 0 0 PR1 = 2PR0 - D

PRs1 1 1 1 . 0 1 0 1 0

PRc1 0 0 0 1 . 0 0 0 0 0

Iteration 1

PRs1 1 1 1 . 0 1 0 1 0

PRc1 0 0 1 . 0 0 0 0 0

QSLC 4-bit Sum 0 0 0 . 0 1 Q2 = 1 as MSB5 = 1, Q2 = 1.1

2PRs1 1 1 0 . 1 0 1 0 0

2PRc1 0 1 0 . 0 0 0 0 0

-D + 1 1 0 . 1 1 0 0 0 PR2 = 2PR1 - D

PRs2 0 1 0 . 0 1 1 0 0

PRc2 1 1 0 1 . 0 0 0 0 0

Iteration 2

PRs2 0 1 0 . 0 1 1 0 0

PRc2 1 0 1 . 0 0 0 0 0

QSLC 4-bit Sum 1 1 1 . 0 Q3 = -1 , Q3 = 1.01

2PRs2 1 0 0 . 1 1 0 0 0

2PRc2 0 1 0 . 0 0 0 0 0

+D + 0 0 1 . 0 1 0 0 0 PR3 = 2PR2 + D

63

PRs3 1 1 1 . 1 0 0 0 0

PRc3 0 0 0 0 . 1 0 0 0 0

Iteration 3

PRs3 1 1 1 . 1 0 0 0 0

PRc3 0 0 0 . 1 0 0 0 0

QSLC 4-bit Sum 0 0 0 . 0 0 Q4= 0 as MSB5= 0 , Q4= 1.010

2PRs3 1 1 1 . 0 0 0 0 0

2PRc3 0 0 1 . 0 0 0 0 0

+D 0 0 0 . 0 0 0 0 0 PR4 = 2PR3

PRs4 1 1 0 . 0 0 0 0 0

PRc4 0 0 1 0 . 0 0 0 0 0

Iteration 4

PRs4 1 1 0 . 0 0 0 0 0

PRc4 0 1 0 . 0 0 0 0 0

QSLC 4-bit Sum 0 0 0 . 0 0 Q5 = 0 , Q5 = 1.0100

2PRs4 1 0 0 . 0 0 0 0 0

2PRc4 1 0 0 . 0 0 0 0 0

+D 0 0 0 . 0 0 0 0 0 PR5 = 2PR4

PRs5 0 0 0 . 0 0 0 0 0

PRc5 1 0 0 0 . 0 0 0 0 0

Figure A.4 SRT Radix-2 Division Example

No more iteration’s are required, as in the next iteration, all of the bits for the QSLC input will be zero.

Thus, further iterations will only generate extra zero quotient bits. This method has produced a result of

“1.0100…..0”, which matches what was expected. In further iterations, the partial remainder will be

completely zero, thus the sticky bit generated would also be zero.

64

SRT Radix-2 Square Root

The following is a demonstration of a radix-2 SRT square root operation.

= Q

2510 1/2 = 510

SQRT 110012 = 001012

In Floating-Point Terms SQRT 1.10012 = 1.01002

Let PRs0 = 000.100102

Let PRc0 = 0

All QSLC calculations are based on the SUN Design [25]

Iteration 0

X 0 0 1 . 1 0 0 1 0 0 0 0

PRsi 0 0 0 . 1 0 0 1 0 0 0 0

PRCi 0 0 0 . 0 0 0 0 0 0 0 0

QSLC 0 0 0 . 1 qi+1=1 PRi+1=2PRi-(2Q+2-(i+))

Qi 0 0 1 . 0 0 0 0 0 0 0 0

2-(i+1) 0 0 0 . 1 0 0 0 0 0 0 0

2PRsi 0 0 1 . 0 0 1 0 0 0 0 0

2PRci 0 0 0 . 0 0 0 0 0 0 0 0

F[i] 1 0 1 . 0 1 1 1 1 1 1 1 +

PRsi+1 1 0 0 . 0 1 0 1 1 1 1 1

PRci+1 0 1 0 . 0 1 0 0 0 0 0 1 Carry in

Iteration 1

PRsi 1 0 0 . 0 1 0 1 1 1 1 1

PRCi 0 1 0 . 0 1 0 0 0 0 0 1

QSLC 1 1 0 . 0 qi+1=-1 PRi+1=2PRi+(2Q-2-(i+))

Qi 0 0 1 . 1 0 0 0 0 0 0 0

2-(i+1) 0 0 0 . 0 1 0 0 0 0 0 0

2PRsi 0 0 0 . 1 0 1 1 1 1 1 0

2PRci 1 0 0 . 1 0 0 0 0 0 1 0

F(i) 0 1 0 . 1 1 0 0 0 0 0 0 +

PRsi+1 1 1 0 . 1 1 1 1 1 1 0 0

PRci+1 0 0 1 . 0 0 0 0 0 1 0 0

Iteration 2

65

PRsi 1 1 0 . 1 1 1 1 1 1 0 0

PRCi 0 0 1 . 0 0 0 0 0 1 0 0

QSLC 1 1 1 . 1 qi+1=0 PRi+1=2PRi

Qi 0 0 1 . 0 1 0 0 0 0 0 0

2-(i+1) 0 0 0 . 0 0 1 0 0 0 0 0

2PRsi 1 0 1 . 1 1 1 1 1 0 0 0

2PRci 0 1 0 . 0 0 0 0 1 0 0 0

0 0 0 0 . 0 0 0 0 0 0 0 0

PRsi+1 1 0 1 . 1 1 1 1 1 0 0 0

PRci+1 0 1 0 . 0 0 0 0 1 0 0 0

Iteration 3

PRsi 1 0 1 . 1 1 1 1 1 0 0 0

PRCi 0 1 0 . 0 0 0 0 1 0 0 0

QSLC 1 1 1 . 1 qi+1=0 PRi+1=2PRi

Qi 0 0 1 . 0 1 0 0 0 0 0 0

2-(i+1) 0 0 0 . 0 0 0 1 0 0 0 0

2PRsi 0 1 1 . 1 1 1 1 0 0 0 0

2PRci 1 0 0 . 0 0 0 1 0 0 0 0

0 0 0 0 . 0 0 0 0 0 0 0 0

PRsi+1 0 1 1 . 1 1 1 1 0 0 0 0

PRci+1 1 0 0 . 0 0 0 1 0 0 0 0

Figure A.5 SRT Radix-2 Square Root Example

Q3 = 1.0100, which means after several iterations the desired result has been obtained. No further

iterations were completed because all further “qi+1” will be zero. The main difference between an integer

and the demonstrated floating point version of SRT square root, is that PRsi and Qi are initialized on the

first iteration to (X-1) and 1 respectively.

66

Radix-2 On-The-Fly Conversion

The following is an example of how the quotient is formed for division and square root operations.

Specifically it uses the SRT radix-2 division example from above. The key here is to notice that only the

bit position specified in the third column is new for both Q and QM. All other bits are simply the result of

a select statement from the old values. This is illustrated in iteration three, where the QSLC output is

negative one, and Q selects the previous value from QM.

The square root operation is more complicated then division, as its divisor “F(I)” is variable. There are

two separate versions of F(i), one for both positive and negative products for each iteration. They can,

however, simply be formed by using the final three columns.

Iteration qi 2-i Qi QMi

0 N/A 1.00000 0.XXXXX 0.XXXXX

1 1 0.10000 0.1XXXX 0.0XXXX

2 1 0.01000 0.11XXX 0.10XXX

3 -1 0.00100 0.101XX 0.100XX

4 0 0.00010 0.1010X 0.1001X

5 0 0.00001 0.10100 0.10011

Figure A.6 On-The-Fly Conversion Example

67

Appendix B: Counter Study

Introduction

A counter is a functional unit that is able to produce a value equal to the sum of its input bits, where each

input bit is worth the same. For example, a standard binary counter given the input vector “10011010”

may produce an output of “0100”, which means four bits are high. While many counters may produce the

same output given identical inputs, there is a great variation of implementation. This section aims to

delve into the various styles of counters, and discover what advantages they offer in terms of both

performance and area, for the multiplication and division units.

Timing Analysis

Counter Pre-Layout Data Arrival Times (ns)


3 to 2 Counters (Full Adders)

ADK Fadd1 Std. Cell 2.17 1.23 0.78

Synthesized 1.25 0.71 0.45

4 to 2 Counters

Two ADK Fadd1 Std. Cells 5.26 2.95 1.86

Synthesized (Adders Preserved) 2.96 1.69 1.10

Custom ADK Design [11] 4.95 2.84 1.80

Synthesized (Adders Dissolved) 2.97 1.70 1.04

27 to 2 Counters

Samsung ADK Fadd1 Design [15] 21.06 11.68 7.34

Synthesized (Adders Preserved) 11.55 6.63 4.34

Synthesized (Adders Dissolved) 11.31 6.64 4.29

Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis

Several different types of counters were studied (Figure B.1) to determine the best architecture for a

multiplier design. In particular this relates to the second stage of multiplication, fast multi-operand

addition. The two common methods for adding the partial products into a carry save form rely on either

seven levels of 3:2 counters or four levels of 4:2 counters. For both different types of counter, the

synthesized versions offer a greatly reduced delay. While the 3:2 counter may seem to have a data arrival

time of less then 50% of the larger counter, it is important to remember that these figures do not account

for carry propagation. As a result, the synthesized adders may suffer, since the synthesis tool assumes all

inputs are available at time zero, where the ADK std. cell versions may have a port optimized for late

arrival.

Also of particular interest to the multiplier, is a comparison of the delays of 4:2 and 27:2 counters. The

Samsung design [15] relies on seven levels of full adder cells to produce a 27:2 counter. An alternate

design [17], utilizes four levels of 4:2 counters, which would approximately produce a delay of at 4.16ns

68

(four levels of synthesized 4:2 counters with the cells dissolved). This second approach offers a

substantial decrease in delay in comparison to the required 7.34ns of the 27:2 ADK std. cell counter.

However, it is important to remember that the decreased delay may come at the cost of a significant area

increase. Also, providing a very fast CSA for the multiplier may not yield any performance increase, as

there may be other slower pipeline stages.

Area Analysis

Counter Area by Gate Count


3 to 2 Counters (Full Adders)

ADK Fadd1 Std. Cell 5 5 5

Synthesized 7 7 8

4 to 2 Counters

Two ADK Fadd1 ADK Std. Cells 10 9 9

Synthesized (Adders Preserved) 14 14 15

Custom ADK Design [11] 16 16 16

Synthesized (Adders Dissolved) 14 14 18

27 to 2 Counters

Samsung ADK Fadd1 Design [15] 133 114 114

Synthesized (Adders Preserved) 170 170 190

Synthesized (Adders Dissolved) 199 197 246

Figure B.2 Std. Cell v. Synthesized Counter Area Analysis

In contrast to the timing analysis (Figure B.1), the gate counts for the synthesized counters (Figure B.2)

are vastly increased. The 4:2 counter with all of its internal full adder cells dissolved, uses 3.6 times the

number of gates as the ADK std. cell full adder. While an extreme example, it does illustrate the trend of

the synthesized adders requiring a larger area. It is important to remember, to achieve a satisfactory

delay for the AMI 0.5 micron process, all adders were optimized for delay. As a result of synthesizing with

optimization for delay, the required area increases.

Conclusion

The case for a full ADK standard cell instantiated HDL counter design is that of a low gate count, and

probably similar pre-layout and post-layout timing results. In comparison, the synthesized options

produce a vastly faster pre-layout result but with an increased gate count. The selection of which counter

is appropriate should then be based on how many counters are used and on the importance of the data

arrival times offered by each. For the multiplier, which uses a vast number of counters, area is the

overriding factor. In contrast, the combined division square root unit uses significantly few counters, and

delay is more important.

69

Appendix C: HDL Verification

Introduction

This section presents the outputs from the HDL simulator for the three main operations. For

multiplication and division, it was possible to model their complete functionality in a Verilog test bench.

In the case of square root, no simple arithmetic HDL constructs were available, and the produced results

were verified by hand. For all three operations, a separate test bench was constructed, and 1000 random

normalized operands were generated. While the majority of HDL written, has an accompanying test

bench, the units tested here utilize all of the other sub components, and as a result provide a means of

testing all designed modules.

Multiplication

# Griffith University Floating-Point Multiplier Test Bench.

# Simulation File: [./hdl/testbench/tb_fpu_multply_vec.v]

#

# Simulated and circuit responses being logged

# Log File: [./hdl/testbench/output/tb_fpu_multply_vec.out]

#

# Initializing test vectors...

#

# Performing 1000 verification tests using random normalized vectors.

#

# Test bench complete

#

# Details:

# Tests completed: 1000

# Calculation errors: 0

# Fraction error: 0

# Exponent error: 0

# Sign error: 0

#

# Processing Details:.

# Time required: 6004000

# Cycles required: 6004

Figure C.1 FPU Multiplication Test Bench

When testing the functionality of the designed multiplier, it was possible to model the calculations for all

three fields (exponent, fraction, and sign) in a test bench (Figure C.1). Essentially, one operation was run

at a time (similar conditions to LTH FPU), and when that operation completed, the produced results was

checked against a pre-calculated one. At this stage, the inputs were then reset to random values, and

remain constant until the next test had completed.

Division

# Griffith University Floating-Point Combined DIVSRT Test Bench.

# Performs only division tests.

# Simulation File: [./hdl/testbench/tb_fpu_divsqrt_vec.v]

#


# Log File: [./hdl/testbench/output/tb_fpu_divsqrt_vec.out]

#

70


#


#


#

# Details:

# Tests completed: 1000

# Calculation errors (after corrections): 0

# Fraction errors before correction: 119

# ULP Corrections required: 119

# Exponent error: 0

# Sign error: 0

#




Figure C.2 FPU Division Test Bench

Similar to multiplication, it was possible to model all aspects of a division operation in a Verilog test

bench (Figure C.2). However, in SRT calculations, the quotient can be over estimated, and corrected at a

latter stage by a negative QSLC result. However, this adjustment is not guaranteed to happen directly

after an over estimation, and as a result, the quotient can be over estimated by an ULP. Accounting for

this, the division test bench compared the original result produced by the circuit first, with a pre-

calculated one. When the original result did not match, the pre-calculated result was then compared to

the original result minus an ULP. From 1000 normalized operands, allowing for quotient correction, no

problems were observed.

Square Root

# Griffith University Floating-Point Combined DIVSRT Test Bench.

# Performs only square root tests.

# Simulation File: [./hdl/testbench/tb_fpu_sqrt_simple_vec.v]

#


# Log File: [./hdl/testbench/output/tb_fpu_sqrt_simple_vec.out]

#


#


#


#




Figure C.3 FPU SQRT Test Bench

Unlike the division and multiplication operations, the square root function could not be modeled in

Verilog. By modifying the division test bench, to simply generate operands and record results (Figure

C.3), it was possible to create a list of operations with results that could be verified at a later stage (Figure

C.4).

71

When verifying the recorded results, the logged quotients generated by VSIM were compared to those

produced using Microsoft Windows Power Toy Calculator. Given the square root operation also suffers

from a similar error as division, where the quotient may be over estimated by an ULP, the results were

assumed the same, if the simulated quotient or the simulated quotient minus an ULP matched the result

produced by the calculator. Further, as the division and square root functions share many common

blocks, by verifying the division operation with a large set of test vectors, the square root function is also

verified.

Square Root Test: 0 (First test with odd exponent)

Operand A 1.1001010111101000000100010010000101010011010100100100 Odd Exponent

Calculator 1.11000111111000001001100001000001001110010101111111001

Circuit 1.11000111111000001001100001000001001110010101111111010 Q = Q - ULP

Correction 1.11000111111000001001100001000001001110010101111111001

Square Root Test: 3 (First test with even exponent)

Operand A 1.0000011010011111001001000111111011001101101110001111 Even Exponent

Calculator 1.00000011010010100010100100000110011101000101100001000

Circuit 1.00000011010010100010100100000110011101000101100001000

Square Root Test: 612 (Selected at random)


Calculator 1.00101000101110011101101100001010000100101011000100100

Circuit 1.00101000101110011101101100001010000100101011000100100

Square Root Test: 726 (Selected at random)


Calculator 1.00000010100000010000010001110110111110001001011110010

Circuit 1.00000010100000010000010001110110111110001001011110010

Square Root Test: 996 (Last test with odd exponents)

Operand A 1.0101101110001010110000000001000011010101001100000010 Odd Exponent

Calculator 1.10100101110101001100010011001101010000110011101001011

Circuit 1.10100101110101001100010011001101010000110011101001011

Test cases are from the log file produced and referenced in the square root test bench (Figure C.3)

Figure C.4 FPU Square Root Result Verification

72

Appendix D: Code Listing

Introduction

A CD/DVD accompanies this dissertation. This section outlines all the contents of that disc. In particular

this includes HDL, area and timing reports, and produced netlists. Also included is some of the early work

performed on the LTH FPU, which includes the separation into pipeline stages and synthesis reports.

File Locations and Naming Conventions

File Locations

Location Description

./hdl Base directory for created modules

./hdl/scripts Contains all PERL and Verilog scripts used to generate other modules (e.g.

CSA tree and QSLC ROM)

./hdl/testbench Contains test benches for created modules

./hdl/testbench/output Stores the log files produces by a limited number of the main test benches.

./lth Has a similar directory structure to that of “./hdl”. It also contains a

“synth” directory, similar to that described below.

./synth Base directory for synthesis results

./synth/area Area reports (e.g. gate count)

./synth/delay Timing Reports

./synth/netlist Produced netlist, either from synthesized or flattened modules.

./synth/script Contains all synthesis settings and Linux Bash shell scripts to run them in

terminal mode.

Figure D.1 File Location Map

File Naming Conventions

File Pattern Description

*.ami05[fast|slow|typ].*

Result or output of synthesis, where the name of the module that produced

this file precedes the process variation.

*.area.rpt Area report

*.delay.rpt Pre-layout delay report

*.out Log file produced from VSIM simulation

*_syn.* A module designed to be or produced from a design utilizing Verilog

primitives. There should also be a similar name file with out “-syn” that is

based on ADK cells.

Tb_*.v Verilog test bench

Figure D.2 File Naming Conventions

73

Highlights

File Description

./hdl/fpu_divsqrt.v Radix-4 combined division and square root unit

./hdl/fpu_multply.v Multiplication unit

./synth/delay/fpu_divsqrt.ami05fast.rpt Delay report for the division square root unit on an AMI

0.5 fast process.

./synth/delay/fpu_mulyply.ami05fast.rpt Delay report for the multiplication unit on an AMI 0.5 fast

process.

./hdl/scripts/csatree.pl Generates connectivity for the CSA tree, instantiates 27:2

to 4:2 counters.

./hdl/testbench/tb_fpu_multiply_vec.v FPU Multiplication test bench

./hdl/testbench/tb_fpu_divsqrt_vec.v FPU Division test bench

Figure D.3 HDL Synthesis and Simulation Highlights

design of double precision ieee-754 floating-point units

Documents