92050235 reducing computation time for short bit width twos compliment multiplier

8/11/2019 92050235 Reducing Computation Time for Short Bit Width Twos Compliment Multiplier

1/60

REDUCING THE COMPUTATION TIME IN (SHORT BIT-

WIDTH) TWO'S COMPLEMENT MULTIPLIERS

ABSTRACT:

Two's complement multipliers are important for a wide range of applications. In this

paper, we present a technique to reduce by one row the maximum height of the partial product

array generated by a radix-4 Modified ooth !ncoded multiplier, without any increase in the

delay of the partial product generation stage. This reduction may allow for a faster compression

of the partial product array and regular layouts. This technique is of particular interest in all

multiplier designs, but especially in short bit-width two's complement multipliers for high-

performance embedded cores. The proposed method is general and can be extended to higher

radix encodings, as well as to any si"e square and m times n rectangular multipliers. #e

e$aluated the proposed approach by comparison with some other possible solutions% the results

based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of

both area and delay.


2/60

Introducton !"out #$r%o&

O$r$:

&ardware description languages such as erilog differ from softwareprogramminglanguagesbecause they include ways of describing the propagation of time and signaldependencies (sensiti$ity). There are two assignment operators, a bloc*ing assignment (+), and anon-bloc*ing (+) assignment. The non-bloc*ing assignment allows designers to describe astate-machine update without needing to declare and use temporary storage $ariables (in anygeneral programming language we need to define some temporary storage spaces for theoperands to be operated on subsequently% those are temporary storage $ariables). ince theseconcepts are part of erilog's language semantics, designers could quic*ly write descriptions oflarge circuits, in a relati$ely compact and concise form. t the time of erilog's introduction(/014), erilog represented a tremendous producti$ity impro$ement for circuit designers whowere already using graphical schematic capturesoftware and specially-written softwareprograms to document and simulate electronic circuits.

The designers of erilog wanted a language with syntax similar to the 2programming language, which was already widely used in engineering software de$elopment.erilog is case-sensiti$e, has a basicpreprocessor(though less sophisticated than that of 3I2255), and equi$alent control flow*eywords(ifelse, for, while, case, etc.), and compatibleoperator precedence. yntactic differences include $ariable declaration (erilog requires bit-widths on netreg types), demarcation of procedural bloc*s (beginend instead of curly braces 67),

and many other minor differences.

erilog design consists of a hierarchy of modules. Modules encapsulate designhierarchy, and communicate with other modules through a set of declared input, output, andbidirectional ports. Internally, a module can contain any combination of the following8net$ariable declarations (wire, reg, integer, etc.), concurrent and sequential statement bloc*s,and instances of other modules (sub-hierarchies). equential statements are placed inside abeginend bloc* and executed in sequential order within the bloc*. ut the bloc*s themsel$es areexecuted concurrently, qualifying erilog as a dataflow language.

erilog's concept of 'wire' consists of both signal $alues (4-state8 9/, :, floating,

undefined9), and strengths (strong, wea*, etc.) This system allows abstract modeling of sharedsignal-lines, where multiple sources dri$e a common net. #hen a wire has multiple dri$ers, thewire's (readable) $alue is resol$ed by a function of the source dri$ers and their strengths.

subset of statements in the erilog language is synthesi"able. erilog modules thatconform to a synthesi"able coding-style, *nown as ;T< (register transfer le$el), can be
http://en.wikipedia.org/wiki/Programming_languagehttp://en.wikipedia.org/wiki/Programming_languagehttp://en.wikipedia.org/wiki/Schematic_capturehttp://en.wikipedia.org/wiki/Electronic_circuit_simulationhttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Case-sensitivehttp://en.wikipedia.org/wiki/Preprocessorhttp://en.wikipedia.org/wiki/Preprocessorhttp://en.wikipedia.org/wiki/Control_flowhttp://en.wikipedia.org/wiki/Control_flowhttp://en.wikipedia.org/wiki/Keyword_(computer_programming)http://en.wikipedia.org/wiki/Operator_precedencehttp://en.wikipedia.org/wiki/Dataflow_languagehttp://en.wikipedia.org/wiki/Logic_synthesishttp://en.wikipedia.org/wiki/Schematic_capturehttp://en.wikipedia.org/wiki/Electronic_circuit_simulationhttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Case-sensitivehttp://en.wikipedia.org/wiki/Preprocessorhttp://en.wikipedia.org/wiki/Control_flowhttp://en.wikipedia.org/wiki/Keyword_(computer_programming)http://en.wikipedia.org/wiki/Operator_precedencehttp://en.wikipedia.org/wiki/Dataflow_languagehttp://en.wikipedia.org/wiki/Logic_synthesishttp://en.wikipedia.org/wiki/Programming_languagehttp://en.wikipedia.org/wiki/Programming_language


3/60

physically reali"ed by synthesis software. ynthesis-software algorithmically transforms the(abstract) erilog source into a net-list, a logically-equi$alent description consisting only ofelementary logic primiti$es (3=, >;, 3>T, flip-flops, etc.) that are a$ailable in a specific?@Aor


4/60

operations using aw*ward bit-le$el manipulations (for example, the carry-out bit of a simple 1-bit addition required an explicit description of the oolean-algebra to determine its correct$alue). The same function under erilog-F::/ can be more succinctly described by one of thebuilt-in operators8 5, -, , G, HHH. generateend-generate construct (similar to &=


5/60


6/60

Mu%t0%c!ton !%&ort32:

multiplication algorithm is an algorithm(or method) to multiplytwo numbers.=epending on the si"e of the numbers, different algorithms are in use. !fficient multiplicationalgorithms ha$e existed since the ad$ent of the decimal system

Types of Multiplication lgorithms

/. oothNs lgorithmF. Modified oothNs lgorithmB. #allace Tree lgorithm

Boot3' A%&ort32:

ooth's algorithm is a multiplication algorithm which wor*ed for two'scomplement numbers. It is similar to our paper-pencil method, except that it loo*s for the currentas well as pre$ious bit in order to decided what to do. &ere are steps

If the current multiplier digit is / and earlier digit is : (i.e. a /: pair) shift and sign extendthe multiplicand, subtract with pre$ious result.

If it is a :/ pair, add to the pre$ious result.

If it is a :: pair, or // pair, do nothing.


7/60

In ooth's algorithm, if the multiplicand and multiplier are n-bit two's complementnumbers, the result is considered as Fn-bit two's complement $alue. The o$erflow bit (outside Fnbits) is ignored.

The reason that the abo$e computation wor*s is because

://: x ::/: + ://: x (-::/: 5 :/::) + -://:: 5 ://::: + //::.

!xample F8

::/: x ://: ------------ :::::::: - ::/: -------------

//////:: 5 ::/: ------------- (/) :::://::

In this we ha$e computed

::/: x ://: + ::/: x ( -::/: 5 /:::) + - ::/:: 5 ::/:::: + //::

!xample B, (-C) x (-B)8

/:// -H -C (4-bit two's complement) x //:/ -H -B ----------- :::::::: - /////:// (notice the sign extension of multiplicand) ------------ :::::/:/ 5 ////:// ------------- /////:// - ///://

------------- :::://// -H 5/C

long example8

/::///:: - -/:: x ://:::// - 00 --------------------


8/60

:::::::: :::::::: - //////// /::///:: -------------------- :::::::: ://::/:: 5 ///////: :///::

-------------------- ///////: //:/:/:: - ////::// /:: -------------------- ::::/:// :/:/:/:: 5 //::///: : -------------------- //://::/ :/:/:/:: - -00::

3ote that the multiplicand and multiplier are 1-bit two's complement number,but the result is understood as /E-bit two's complement number. e careful about the proper

alignment of the columns. /: pair causes a subtraction, aligned with /, :/ pair causes anaddition, aligned with :. In both cases, it aligns with the one on the left. The algorithm starts withthe :-th bit. #e should assume that there is a (-/)-th bit, ha$ing $alue :.

ooth lgorithm d$antages and =isad$antages

=epends on the architecture @otential ad$antage8 might reduce the O of /Ns

in multiplier

In the multipliers that we ha$e seen so far8

=oesnNt sa$e in speed(still ha$e to wait for the critical path, e.g., the shift-add delay in sequentialmultiplier)

Incr

!ases area8 recoding circuitry 3= subtraction

Mod4$d Boot3:

ooth F modified to produce at most nF5/ partial products. lgorithm8 (for unsigned numbers)

/. @ad the


9/60


10/60

ooth F modified to produce at most nF5/ partial products. lgorithm8 (for unsigned numbers)

/. @ad the


11/60

E. um @artial @roducts

Interpretation of the ooth recoding table8

i5/ i i-/ add !xplanation

: : : :GM 3o string of /Ns in sight: : / /GM !nd of a string of /Ns: / : /GM Isolated /: / / FGM !nd of a string of /Ns/ : : PFGM eginning of a string of /Ns/ : / P/GM !nd one string, begin new one/ / : P/GM eginning of a string of /Ns/ / / :GM 2ontinuation of string of /Ns

Arouping multiplier bits into pairs

>rthogonal idea to the ooth recoding

;educes the num of partial products to half

If ooth recoding not usedha$e to be able to multiply by B (hard8 shift5add)

pplying the grouping idea to oothModified ooth ;ecoding (!ncoding)

#e already got rid of sequences of /Ns

no mult by B

Qust negate, shift once or twice

ses high-radix to reduce number of intermediate addition operands

2an go higher8 radix-1, radix-/E

;adix-1 should implement GB, G-B, G4, G-4

;ecoding and partial product generation becomes more complex

2an automatically ta*e care of signed multiplication

W!%%!c$ tr$$:


12/60

#allace tree is an efficienthardwareimplementation of a digital circuit thatmultiplies two integers, de$ised by an ustralian 2omputer cientist2hris #allacein /0E4.R/S

The #allace tree has three steps8

/. Multiply (that is - 3=) each bit of one of the arguments, by each bit of the other,yielding nFresults. =epending on position of the multiplied bits, the wires carry differentweights, for example wire of bit carrying result of aFbBis BF (see explanation of weightsbelow).

F. ;educe the number of partial products to two by layers of full and half adders.

B. Aroup the wires in two numbers, and add them with a con$entional adder.RFS

The second phase wor*s as follows. s long as there are three or more wires with

the same weight add a following layer8

Ta*e any three wires with the same weights and input them into a full adder. The resultwill be an output wire of the same weight and an output wire with a higher weight foreach three input wires.

If there are two wires of the same weight left, input them into a half adder.

If there is Kust one wire left, connect it to the next layer.
http://en.wikipedia.org/wiki/Computational_complexity_theoryhttp://en.wikipedia.org/wiki/Computer_hardwarehttp://en.wikipedia.org/wiki/Chris_Wallace_(computer_scientist)http://en.wikipedia.org/wiki/Wallace_tree#cite_note-0http://en.wikipedia.org/wiki/Adder_(electronics)http://en.wikipedia.org/wiki/Wallace_tree#cite_note-1http://en.wikipedia.org/wiki/Full_adderhttp://en.wikipedia.org/wiki/Half_adderhttp://en.wikipedia.org/wiki/Computational_complexity_theoryhttp://en.wikipedia.org/wiki/Computer_hardwarehttp://en.wikipedia.org/wiki/Chris_Wallace_(computer_scientist)http://en.wikipedia.org/wiki/Wallace_tree#cite_note-0http://en.wikipedia.org/wiki/Adder_(electronics)http://en.wikipedia.org/wiki/Wallace_tree#cite_note-1http://en.wikipedia.org/wiki/Full_adderhttp://en.wikipedia.org/wiki/Half_adder


13/60

The benefit of the #allace tree is that there are only O(log n) reduction layers, andeach layer has O(/) propagation delay. s ma*ing the partial products is O(/) and the finaladdition is O(log n), the multiplication is only O(log n), not much slower than addition (howe$er,much more expensi$e in the gate count). 3ai$ely adding partial products with regular adderswould require O(logFn) time. ?rom a complexity theoreticperspecti$e, the #allace tree

algorithm puts multiplication in the class32/

.

These computations only consider gate delaysand don't deal with wire delays,which can also be $ery substantial. The #allace tree can be also represented by a tree of BF or4F adders. It is sometimes combined with ooth encoding.

W$&3t $10%!n$d

The weight of a wire is the radix (to base F) of the digit that the wire carries. In general,

anbmP ha$e indexes of nand m% and since Fn

Fm

+ Fn5 m

the weight of anbmis Fn5 m

.

E1!20%$

n+ 4, multiplying aBaFa/a:by bBbFb/b:8

/. ?irst we multiply e$ery bit by e$ery bit8o weight / - a:b:

o weight F - a:b/, a/b:

oweight 4 - a:bF, a/b/, aFb:

o weight 1 - a:bB, a/bF, aFb/, aBb:

o weight /E - a/bB, aFbF, aBb/

oweight BF - aFbB, aBbF

o weight E4 - aBbB

F. ;eduction layer /8

o @ass the only weight-/ wire through, output8 / weight-/ wire

o dd a half adder for weight F, outputs8 / weight-F wire, / weight-4 wire

odd a full adder for weight 4, outputs8 / weight-4 wire, / weight-1 wire

o dd a full adder for weight 1, and pass the remaining wire through, outputs8 Fweight-1 wires, / weight-/E wire

o dd a full adder for weight /E, outputs8 / weight-/E wire, / weight-BF wire
http://en.wikipedia.org/wiki/Computational_complexity_theoryhttp://en.wikipedia.org/wiki/NC_(complexity)http://en.wikipedia.org/wiki/NC_(complexity)http://en.wikipedia.org/wiki/Gate_delayhttp://en.wikipedia.org/wiki/Booth_encodinghttp://en.wikipedia.org/wiki/Computational_complexity_theoryhttp://en.wikipedia.org/wiki/NC_(complexity)http://en.wikipedia.org/wiki/Gate_delayhttp://en.wikipedia.org/wiki/Booth_encoding


14/60

odd a half adder for weight BF, outputs8 / weight-BF wire, / weight-E4 wire

o @ass the only weight-E4 wire through, output8 / weight-E4 wire

B. #ires at the output of reduction layer /8

o

weight / - /

o weight F - /

o weight 4 - F

oweight 1 - B

o weight /E - F

o weight BF - F

oweight E4 - F

4. ;eduction layer F8

o dd a full adder for weight 1, and half adders for weights 4, /E, BF, E4

C. >utputs8

o weight / - /

o weight F - /

oweight 4 - /

o weight 1 - F

o weight /E - F

oweight BF - F

o weight E4 - F

o weight /F1 - /

E. Aroup the wires into a pair integers and an adder to add them.

To5 co20%$2$nt:

The two's complement of abinary number is defined as the $alue obtained bysubtracting the number from a large power of two (specifically, from FNfor anN-bit two's
http://en.wikipedia.org/wiki/Binary_numberhttp://en.wikipedia.org/wiki/Binary_number


15/60

complement). The two's complement of the number then beha$es li*e the negati$e of the originalnumber in most arithmetic, and it can coexist with positi$e numbers in a natural way.

two's-complement system, or two's-complement arithmetic, is a system in whichnegati$e numbers are represented by the two's complement of the absolute $alue% R/Sthis system is

the most common method of representing signed integersoncomputers.RFS

In such a system, anumber is negated (con$erted from positi$e to negati$e or $ice $ersa) by computing its ones'complement and adding one. n 3-bit two's-complement numeral system can represent e$eryinteger in the range F3/to F3/-/ whileones' complementcan only represent integers in therange (F3//) to F3//

The two's-complement system has the ad$antage of not requiring that the additionand subtraction circuitry examine the signs of the operands to determine whether to add orsubtract. This property ma*es the system both simpler to implement and capable of easilyhandling higher precision arithmetic. lso, "erohas only a single representation, ob$iating thesubtleties associated with negati$e "ero,which exists in ones'-complement systems.

REDUCING THE COMPUTATION TIME IN (SHORT BIT-

WIDTH) TWO'S COMPLEMENT MULTIPLIERS

/6 INTRODUCTION:

In multimedia, B= graphics and signal processing applications, performance, in mostcases, strongly depends on the effecti$eness of the hardware used for computing multiplications,since multiplication is, besides addition, massi$ely used in these en$ironments. The high interestin this application field is witnessed by the large amount of algorithms and implementations ofthe multiplication operation, which ha$e been proposed in the literature (for a representati$e setof references, see R/S). More specifically, short bit-width (1-/E bits) twoNs complementmultipliers with single-cycle throughput and latency ha$e emerged and become $ery importantbuilding bloc*s for high-performance embedded processors and =@ execution cores RFS, RBS. Inthis case, the multiplier must be highly optimi"ed to fit within the required cycle time and powerbudgets. nother rele$ant application for short bit-width multipliers is the design of IM= units

supporting different data formats RBS, R4S. In this case, short bit-width multipliers often play therole of basic building bloc*s. TwoNs complement multipliers of moderate bit-width (less than BFbits) are also being used massi$ely in ?@A. ll of the abo$e translates into a high interest andmoti$ation on the part of the industry, for the design of high-performance short or moderate bit-width twoNs complement multipliers.

The basic algorithm for multiplication is based on the well-*nown paper and pencil
http://en.wikipedia.org/wiki/Two's_complement#cite_note-0http://en.wikipedia.org/wiki/Signed_number_representationshttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Two's_complement#cite_note-1http://en.wikipedia.org/wiki/Two's_complement#cite_note-1http://en.wikipedia.org/wiki/Ones'_complementhttp://en.wikipedia.org/wiki/Zerohttp://en.wikipedia.org/wiki/Negative_zerohttp://en.wikipedia.org/wiki/Ones'_complementhttp://en.wikipedia.org/wiki/Two's_complement#cite_note-0http://en.wikipedia.org/wiki/Signed_number_representationshttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Two's_complement#cite_note-1http://en.wikipedia.org/wiki/Ones'_complementhttp://en.wikipedia.org/wiki/Zerohttp://en.wikipedia.org/wiki/Negative_zerohttp://en.wikipedia.org/wiki/Ones'_complement


16/60

approach R/S and passes through three main phases8 /) partial product (@@) generation, F) @@reduction, and B) final (carry-propagated) addition. =uring @@ generation, a set of rows isgenerated where each one is the result of the product of one bit of the multiplier by themultiplicand. ?or example, if we consider the multiplication D U V with both D and V on n bitsand of the form xnW/ . . . D: and ynW/ . . . V:, then the ithrow is, in general, a proper left shifting

of yiGD, i.e., either a string of all "eros when yi+ :, or the multiplicand D itself when y i+ /. Inthis case, the number of @@ rows generated during the first phase is clearly n.

Modified ooth !ncoding (M!) is a technique that has been introduced to reduce thenumber of @@ rows, still *eeping the generation process of each row both simple and fastenough. >ne of the most commonly used schemes is radix-4 M!, for a number of reasons, themost important being that it allows for the reduction of the si"e of the partial product array byalmost half, and it is $ery simple to generate the multiples of the multiplicand. More specifically,the classic twoNs complement n G n bit multiplier using the radix-4 M! scheme, generates a @@

array with a maximum height of RnFS5/ rows, each row before the last one being one of theFfollowing possible $alues8 all "eros, 5-D%5-FD. The last row, which is due to the negati$eencoding, can be *ept $ery simple by using specific techniques integrating twoNs complementand sign extension pre$ention R/S.

The @@ reduction is the process of adding all @@ rows by using a compression tree RES, RS.ince the *nowledge of intermediate addition $alues is not important, the outcome of this phaseis a result represented in redundant carry- sa$e form, i.e., as two rows, which allows for muchfaster implementations. The final (carry-propagated) addition has the tas* of adding these two

rows and of presenting the final result in a non redundant form, i.e., as a single row.

In this wor*, we introduce an idea to o$erlap, to some extent, the @@ generation and the @@

reduction phases. >ur aim is to produce a @@ array with a maximum height of RnFS rows that isthen reduced by the compressor tree stage.F


17/60

s we will see for the common case of $alues n which are power of two, the abo$e

reduction can lead to an implementation where the delay of the compressor tree is reduced byone D>;F gate *eeping a regular layout. ince we are focusing on small $alues of n and fastsingle-cycle units, this reduction might be important in cases where, for example, a high

computation performance through the assembly of a large number of small processing units withlimited computation capabilities are required, such as 1 U 1 or /E U /E multipliers R1S.

similar study aimed at the reduction of the maximum height to RnFS but using adifferent approach has recentlyF presented interesting results in R0S and pre$iously, by the sameauthors, in R/:S. Thus, in the following, we will e$aluate and compare the proposed approachwith the technique in R0S. dditional details of our approach, besides the main results presentedhere, can be found in R//S.

The paper is organi"ed as follows8 in ection F, the multiplication algorithm based on M!is briefly re$iewed and analy"ed. In ection B, we describe related wor*s. In ection 4, wepresent our scheme to reduce the maximum height of the partial product array by one unit duringthe generation of the @@ rows. ?inally, in ection C, we pro$ide e$aluations and comparisons.

6MODI7IED BOOTH RECODED MULTIPLIERS:

In general, a radix- + FbM! leads to a reduction of the number of rows to about RnbS

while, on the other hand, it introduces the need to generate all the multiples of the multiplicandD, at least from PF G D to F G D. s mentioned abo$e, radix-4 M! is particularly ofinterest since, for radix-4, it is easy to create the multiples of the multiplicand :% 5-D% 5-FD. Inparticular, 5-FD can be simply obtained by single left shifting of the corresponding terms 5-D. Itis clear that the M! can be extended to higher radices (see R/FS among others), but thead$antage of getting a higher reduction in the number of rows is paid for by the need to generatemore multiples of D. In this paper, we focus our attention on radix-4 M!, although theproposed method can be easily extended to any radix- M! R//S.


18/60

?rom an operational point of $iew, it is well *nown that the radix-4 M! scheme consists

of scanning the multiplier operand with a three-bit window and a stride of two bits (radix-4). ?oreach group of three bits (yFi5/, yFi, yFi5/), only one partial product row is generated according tothe encoding in Table /. possible implementation of the radix-4 M! and of the correspondingpartial product generation is shown in ?ig. /, which comes from a small adaptation of R/:, ?ig./FbS. ?or each partial product row, ?ig. /a produces the one, two, and neg signals. These signalsare then exploited by the logic in ?ig. /b, along with the appropriate bits of the multiplicand, inorder to generate the whole partial product array. >ther alternati$es for the implementation of therecoding and partial product generation can be found in R/BS, R/4S, R/CS, among others.

s introduced pre$iously, the use of radix-4 M! allows for the (theoretical) reductionof the @@ rows to RnFS, with theF possibility for each row to host a multiple of yiG D, with yi X6:,5-/,5-F7. #hile it is straightforward to generate the positi$e terms :, D, and FD at leastthrough a left shift of D, some attention is required to generate the terms -D and -FD which, asobser$ed in Table /, can arise from three configurations of the yFi5/, yFi, and yFi-/bits. To a$oidcomputing negati$e encodings, i.e., -D and -FD, the twoNs complement of the multiplicand isgenerally used. ?rom a mathematical point of $iew, the use of twoNs complement requiresextension of the sign to the leftmost part of each partial product row, with the consequence of anextra area o$erhead. Thus, a number of strategies for pre$enting sign extension ha$e been

de$eloped. ?or instance, the scheme in R/S relies on the obser$ation that

/-F54. The array resulting from the application of the sign extension pre$ention technique in R/Sto the partial product array of a 1 G 1 M! multiplier RCS is shown in ?ig. F.

The use of twoNs complement requires a neg signal (e.g., neg:, neg/, negF, and negB in?ig. F) to be added in the


19/60

#hen 4-to-F compressors are used, which is a widely used option because of the highregularity of the resultant circuit layout for n power of two, the reduction of the extra row mayrequire an additional delay of two D>;F gates. y properly connecting partial product rows andusing a #allace reduction tree RS, the extra delay can be further reduced to one D>;F R/ES, R/S.&owe$er, the reduction still requires additional hardware, roughly a row of n half adders. Thisissue is of special interest when n is a power of two, which is by far a $ery common case, and the

multiplierNs critical path has to fit within the cloc* period of a high performance processor. ?orinstance, in the design presented in RFS, for n +/E, the maximum column height of the partialproduct array is nine, with an equi$alent delay for the reduction of six D>;F gates R/ES, R/S.?or a maximum height of the partial product array of 1, the delay of the reduction tree would bereduced by one D>;F gate R/ES, R/S. lternati$ely, with a maximum height of eight, it would bepossible to use 4 to F adders, with a delay of the reduction tree of six D>;F gates, but with a$ery regular layout.

86 RELATED WOR9:

ome approaches ha$e been proposed aiming to add the RnFS 5 / rows, possibly in thesame time as the RnFS rows. TheFF solution presented in R/4S is based on the use of differenttypes of counters, that is, it operates at the le$el of the @@ reduction phase. Yang and Aaudiotpropose a different approach in R0S that manages to achie$e the goal of eliminating the extra rowbefore the @@ reduction phase. This approach is based on computing the twoNs complement of thelast partial product, thus eliminating the need for the last neg signal, in a logarithmic time

complexity. special tree structure (basically an incrementer implemented as a prefix tree R/1S)is used in order to produce the twoNs complement (?ig. B), by decoding the M! signals througha B-C decoder (?ig. 4a). ?inally, a row of 4-/ multiplexers with implicit "ero output/ is used (?ig.4b) to produce the last partial product row directly in twoNs complement, without the need for theneg signal. The goal is to produce the twoNs complement in parallel with the computation ofThe partial products of the other rows with maximum o$erlap. In such a case, it is expected toha$e no or a small time penali"ation in the critical path. The architecture in R0S, R/1S is a


20/60

logarithmic $ersion of the linear method presented in R/0S and RF:S. #ith respect to R/0S, RF:S,the approach in R0S is more general, and shows better adaptability to any word si"e. n exampleof the partial product array produced using the abo$e method is depicted in ?ig. C.

In this wor*, we present a technique that also aims at producing only RnFS rows, but byrelying on a differentF approach than R0S.

6 BASIC IDEA:

The case of n G n square multipliers is quite common, as the case of n that is a power oftwo. Thus, we start by focusing our attention on square multipliers, and then present theextension to the general case of m G n rectangular multipliers.

6/ S;u!r$ Mu%t0%$r:

The proposed approach is general and, for the sa*e of clarity, will be explained throughthe practical case of 1 G 1 multiplications (as in the pre$ious figures). s briefly outlined in thepre$ious sections, the main goal of our approach is to produce a partial product array with amaximum height of RnFS rows, without introducing anyF additional delay.


21/60

temporarily considered as being split into two sub rows, the first one containing the partialproduct bits (from right to left) from pp:: to pp1: bar and the second one with two bits set atZone[ in positions 0 and 1. Then, the bit negB related to the fourth partial product row, is mo$edto become a part of the second sub row. The *ey point of this

Zgraphical[ transformation is that the second sub row containing also the bit negB , can now beeasily added to the first sub row, with a constant short carry propagation of three positions(further denoted as ZB-bits addition[), a $alue which is easily shown to be general, i.e.,independent of the length of the operands, for square multipliers. In fact, with reference to the

notation of ?ig. E, we ha$e thats introduced abo$e, due to the particular $alue of the second operand, i.e., : / / : negB , inR//S, we ha$e obser$ed that it requires a carry propagation only across the least-significant threepositions, a fact that can also be seen by the implementation shown in ?ig. .

It is worth obser$ing that, in order not to ha$e delay penali"ations, it is necessary thatthe generation of the other rows is done in parallel with the generation of the first row cascaded

by the computation of the bits qq: qqE: in ?ig. Eb. In order to achie$e this, wemust simplify and differentiate the generation of the first row with respect to the other rows. #eobser$e that the ooth recoding for the first row is computed more easily than for the other rows,because the yW/ bit used by the M! is always equal to "ero. In order to ha$e a preliminary


22/60

nalysis which is possibly independent of technological details, we refer to the circuits in thefollowing figures8

?ig. /, slightly adapted from R/:, ?ig. /FS, for the partial product generation using M!%

?ig. , obtained through manual synthesis (aimed at modularity and area reduction withoutcompromising the delay), for the addition of the last neg bit to the three most significant bits ofthe first row%

?ig. 1, obtained by simplifying ?ig. / (since, in the first row, it is yFi-/+ :), for the partialproduct generation of the first row only using M!% and


23/60

?ig. 0, obtained through manual synthesis of a combination of the two parts of ?ig. 1 andaimed at decreasing the delay of ?ig. 1 with no or $ery small area increase, for the partial productgeneration of the first row only using M!.

In particular, we obser$e that, by direct comparison of ?igs. / and 1, the generation of theM! signals for the first row is simpler, and theoretically allows for the sa$ing of the delay ofone 33=B gate. In addition, the implementation in ?ig. 0 has a delay that is smaller than thetwo parts of ?ig. 1, although it could require a small amount of additional area.

s we see in the following, this issue hardly has any significant impact on the o$eralldesign, since this extra hardware is used only for the three most significant bits of the first row,and not for all the other bits of the array.

The high-le$el description of our idea is as follows8/. Aeneration of the three most significant bit weights of the first row, plus addition of the

last neg bit8


24/60

possible implementations can use a replication of three times the circuit of ?ig. 0 (each for thethree most significant bits of the first row), cascaded by the circuit of ?ig. to add the negsignal%

F. @arallel generation of the other bits of the first row8 possible implementations can useinstances of the circuitry depicted in ?ig. 1, for each bit of the first row, except for the three mostsignificant%B. @arallel generation of the bits of the other rows8 possible implementations can use the circuitry

of ?ig. /, replicated for each bit of the other rows.

ll items / to B are independent, and therefore can be executed in parallel. 2learly if, asassumed and expected, item / is not the bottlenec* (i.e., the critical path), then theimplementation of the proposed idea has reached the goal of not introducing time penalties.

6 E1t$non to R$ct!n&u%!r Mu%t0%$r:

number of potential extensions to the proposed method exist, including rectangularmultipliers, higher radix M!, and multipliers with fused accumulation R//S. &ere, we quic*lyfocus on m G n rectangular multipliers. #ith no loss of generality, we assume m H+ n i.e., m + n5 mNwith mNH+ :, since it leads to a smaller number of rows% for simplicity, and also with no lossof generality, in the following, we assume that both m and n are e$en. 3ow, we ha$e seen in ?ig.Ea, that for mN + : then the last neg bit, i.e., neg RnFS5/belongs to the same column as the first row

partial product . #e obser$e that the first partial product row has bits up to %


25/60

therefore, in order to also include in the first row the contribution of , due to the

particular nature of operands it is necessary to perform a carry propagation (i.e.,

bit addition) in the sum

Thus, for rectangular multipliers, the proposedapproach can be applied #ith the cost of a -bit addition.

The complete or e$en partial execution o$erlap of the first row with other rowsgeneration clearly depends on a number of factors, including the $alue of mN and the way that the

-bit addition is implemented, but still the proposed approach offers an interestingalternati$e that can possibly be explored for designing and implementing rectangular multipliers.

,6 E#ALUATION AND COMPARISONS:

In this section, the proposed method based on the addition of the last neg signal to the firstrow is first e$aluated. The designed architecture is then compared with an implementation basedon the computation of the twoNs complement of the last row (referred to as ZTwoNs complement[method) using the designs for the B-C decoders, 4-/ multiplexers, and twoNs complement tree inR0S. Moreo$er, in the analysis, the standard M! implementations for the first and for a Aenericpartial product row are also ta*en into account (as summari"ed in Table F).

?or all the implementations, we explicitly e$aluate the most common case of a n x nmultiplier, although we ha$e shown in ection 4 that the proposed approach can also be extendedto m x n rectangular multipliers. #hile studying the framewor* of possible implementations, weconsidered the first phase of the multiplication algorithm (i.e., the partial product generation) andwe focused our attention on the issues of area occupancy and modular design, since it isreasonable to expect that they lead to a possibly small multiplier with regular layout. Thedetailed results of some extensi$e e$aluations and comparisons, both based on theoretical

analysis and related implementations are reported in R//S. ;esults encompass the following8


26/60


27/60

could be to ha$e the generation of the first bits of the first row carried out by the circuit of ?ig. 0,followed by the cascaded addition pro$ided by ?ig. (ection 4).

ased on all of the abo$e, our architecture has been designed to perform the followingoperations8

/. Aeneration of the three most significant bit weights of the first row (through the $erysmall and regular circuitry of ?ig. 0) and addition to these bits of the neg signal (by means of thecircuitry of ?ig. )% F. Aeneration of the other bits of the first row, using the circuitry depicted in ?ig. 1% and B. Aeneration of the bits of the other rows, using the circuitry of ?ig. /.

s these three operations can be carried out in parallel, the o$erall critical path of theproposed architecture emerges from the largest delay among the abo$e paths.

2ritical path and area cost for the proposed architecture, as well as for the otherimplementations in Table F, were computed with reference to a /B: nm &2M> standard celllibrary from TMicroelectronics RFFS (later used also for obtaining o$erall synthesis results). Inthis analysis, the contribution of wires was neglected, and a buffer-free configuration wasconsidered. 3onetheless, details regarding buffer stages location and si"e are discussed in R//S.=ata concerning area and delay for elementary cells used in this wor* (as well as in R0S) arereported in Table B. ;esults are reported in Tables 4 and C, respecti$ely. It is worth obser$ing thatresults may $ary depending on specific parameters selected for the synthesis such as logicimplementation, optimi"ation strategies, and target libraries.

#e obser$e that the ZTwoNs 2omplement[ approach has a delay that is longer than thedelay to generate the standard partial product rows, becoming e$en longer as the si"e n of themultiplier increases (e.g., exceeding the delay of a D3>;F gate starting from n \ /E). >n theother hand, according to theoretical estimations, we can see that the delay for generating the firstrow in the proposed method is


28/60

estimated to be lower than the delay for generating the standard rows. This means that the extra

row is eliminated without any penalty on the o$erall critical path.

#ith respect to area costs, it can be obser$ed that the proposed method hardly introducesany area o$erhead with respect to the standard generation of a partial product row. >n the otherhand, the ZTwoNs 2omplement[ approach requires additional hardware, which increases with thesi"e of the multiplier.


29/60

,6 I20%$2$nt!ton R$u%t:

In order to further chec* the $alidity of our estimations in an implementation technology,

we implemented the designs in Table F through logic synthesis and technology mapping to anindustrial standard cell library. pecifically, for the logic synthesis, we used ynopsys =esign2ompiler and the designs were mapped to a /B: nm &2M> industrial library fromTMicroelectronics RFFS.

To perform the e$aluation, we obtained the area-delay space for the sole generation ofthe partial product row of interest (i.e., the first row in the proposed approach, the last row in theimplementation presented in R0S). In order to support the comparison, the area-delay space for thegeneration of the partial product rows using standard M! implementations was also e$aluated,by considering the first row and the other rows of the partial product array separately (Table F).The results, obtained for n + 1, /E, and BF, are depicted in ?ig. /:.

The delays are shown both in absolute units (ns) and normali"ed to the delay of anin$erter with a fan-out of four (E1 ps for the technology used, under worst-case conditions).ccordingly, the area is presented both in absolute units (]mF) and normali"ed to equi$alentgates using the area of a 33=F gate (48B0 ]mF for the technology used). #e obtained se$eraldesign points (using different target delays) for each approach, and the minimum delay showncorresponds to the fastest design that the tool was capable of synthesi"ing.

#e obser$e that the Z@roposed method[ implementation produces a cur$e in the delay-

area graph bounded by the cur$e for the generation of a standard partial product (upper bound)and by the cur$e for the standard generation of the first partial product (lower bound) for thethree $alues of n considered. Moreo$er, the minimum delay that is achie$ed is $ery similar to thecase of the generation of a standard partial product for n+ 1% /E (with our approach it is about:.C-:. ?>4 higher), and is e$en less for n+BF due to the predominant effect of the higherloading of the control signals. Therefore, our scheme does not introduce any additional delay inthe partial product generation stage for target delays higher than about C ?>4.

The cur$e for our scheme gets closer to the cur$e corresponding to the standardgeneration of the first partial product as n increases. This is due to the fact that as n increases, the

short addition of the leading part achie$es more o$erlap with the generation of the rest of thepartial product (with higher input load capacitance, as n increases).

The ZTwoNs 2omplement[ scheme achie$es minimum delays between and /: ?>4, atthe cost of requiring more than four times the area at this point, compared to the Z@roposedmethod[ approach. Most importantly, its delay is much higher than the one of any standard row.


30/60

=6 CONCLUSIONS:

TwoNs complement n x n multipliers using radix-4 Modified ooth !ncoding produce RnFS

partial products but due to theF sign handling, the partial product array has a maximum height ofRnFS 5 /. #e presented a scheme that produces a partial product array with a maximum height ofRnFS, withoutF introducing any extra delay in the partial product generation stage. #ith the extrahardware of a (short) B-bit addition, and the simpler generation of the first partial product row,we ha$e been able to achie$e a delay for the proposed scheme within the bound of the delay of astandard partial product row generation. The outcome of the abo$e is that the reduction of themaximum height of the partial product array by one unit may simplify the partial productreduction tree, both in terms of delay and regularity of the layout. This is of special interest forall multipliers, and especially for single-cycle short bit-width multipliers for high performanceembedded cores, where short bit-width multiplications are common operations. #e ha$e alsocompared our approach with a recent proposal with the same aim, considering results using awidely used industrial synthesis tool and a modern industrial technology library, and concludedthat our approach may impro$e both the performance and area requirements of square multiplierdesigns. The proposed approach also applies with minor modifications to rectangular and togeneral radix- Modified ooth !ncoding multipliers.


31/60

>6 R$4$r$nc$:

1. M.=. !rcego$ac and T. *lobd"iKa, ;.Y.Yrishnamurthy, and .V. or*ar, Z //:A>@ # /E-it Multiplier and ;econfigurable@@# ;econfi-gurable =ual-upply 4-#ay IM= ector@rocessing ccelerator in 4C nm 2M>,[ I!!! Q. olid tate 2ircuits, $ol. 4C, no. /, pp.0C-/:/, Qan. F:/:.


32/60

4. M.. chmoo*ler, M. @utrino, . Mather, Q. Tyler, &.. 3guyen, 2.;oth, M. harma,M.3. @ham, and Q. ptimal 2ircuits for @arallelMultipliers,[ I!!! Trans. 2omputers, $ol. 4,no. B, pp. FB-F1C, Mar. /001.

17. Q.-V. Yang and Q.-


33/60

2omplementation,[ @roc. IntNl 2onf. 2omputational cience, pp. F/F-F/0, F::C.

18. Y. &wang, 2omputer rithmetic @rinciples, rchitectures, and =esign.#iley, /00.

19. ;. &ashemian and 2.@. 2hen, Z 3ew @arallel Technique for =esign of

=ecrementIncrement and TwoNs 2omplement 2ircuits,[ @roc. B4th Midwest ymp.2ircuits and ystems, $ol. F,pp. 11-10:, /00/.

20. =. AaKs*i, @rinciples of =igital =esign. @rentice-&all, /00.TMicroelectronics, Z/B:nm&2M>0 2ell


34/60

o error in co)iation

nayi of ,e "Partia!rod#ct.r" #cceeded.

Proce "Check Syntax" co)eted #ccef#y

S*nt3$ r$0ort

eeae 9.2i xt .36

Coyri*ht c 19952007 ;iinx< =nc. ri*ht reer+ed.

Para)eter >/P'= et to .?xt?rona+.t)

CP@ : 0.00 ? 0.13 A Baed : 0.00 ? 0.00

Para)eter xthddir et to .?xt

CP@ : 0.00 ? 0.13 A Baed : 0.00 ? 0.00

eadin* dei*n: Partia!rod#ct.r

>(B DE CD>B>S

1 Synthei Dtion S#))ary

2 &'( Co)iation

3 'ei*n &ierarchy nayi

4 &'( nayi

5 &'( Synthei

5.1 &'( Synthei eort

6 d+anced &'( Synthei

6.1 d+anced &'( Synthei eort


35/60

7 (o (e+e Synthei

8 Partition eort

9 Eina eort

9.1 'e+ice #tiiFation #))ary

9.2 Partition eo#rce S#))ary

9.3 >=/=G BPD>

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% Synthei Dtion S#))ary %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

So#rce Para)eter

=n#t Eie a)e : "Partia!rod#ct.r"

=n#t Eor)at : )ixed

=*nore Synthei Contraint Eie : D

>ar*et Para)eter

D#t#t Eie a)e : "Partia!rod#ct"

D#t#t Eor)at : GC

>ar*et 'e+ice : xc3500e5c132

So#rce Dtion

>o /od#e a)e : Partia!rod#ct

#to)atic ES/ Bxtraction : HBS

ES/ Bncodin* *orith) : #to


36/60


37/60

@e Synchrono# Set : He

@e Synchrono# eet : He

Pack =D e*iter into =D : a#to

BJ#i+aent re*iter e)o+a : HBS

Genera Dtion

Dti)iFation Goa : Seed

Dti)iFation BIort : 1

(i-rary Search Drder : Partia!rod#ct.o

Kee &ierarchy : D

>( D#t#t : He

Go-a Dti)iFation : Cocket

ead Core : HBS

Lrite >i)in* Contraint : D

Cro Cock nayi : D

&ierarchy Searator : ?

# 'ei)iter :

Cae Seci,er : )aintain

Sice @tiiFation atio : 100

/ @tiiFation atio : 100

Merio* 2001 : HBS

#to / Packin* : D

Sice @tiiFation atio 'eta : 5

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$


38/60

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% &'( Co)iation %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

Co)iin* +erio* ,e ",*8-.+" in i-rary ork

Co)iin* +erio* ,e ",*8a.+" in i-rary ork

/od#e ,*8- co)ied

Co)iin* +erio* ,e ",*1-.+" in i-rary ork

/od#e ,*8a co)ied

Co)iin* +erio* ,e ",*1a.+" in i-rary ork

/od#e ,*1- co)ied

Co)iin* +erio* ,e ",*1.+" in i-rary ork

/od#e ,*1a co)ied

/od#e Partia!rod#ct co)ied

o error in co)iation

nayi of ,e "Partia!rod#ct.r" #cceeded.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% 'ei*n &ierarchy nayi %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

nayFin* hierarchy for )od#e Partia!rod#ct in i-rary ork.

nayFin* hierarchy for )od#e ,*8a in i-rary ork.


39/60


40/60

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% &'( Synthei %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

Perfor)in* -idirectiona ort reo#tion...

SyntheiFin* @nit ,*8a.

eated o#rce ,e i ",*8a.+".

@nit ,*8a yntheiFed.

SyntheiFin* @nit ,*8-.

eated o#rce ,e i ",*8-.+".

Eo#nd 1-it xor2 for i*na 0!0Nxor0000.

@nit ,*8- yntheiFed.

SyntheiFin* @nit ,*1a.

eated o#rce ,e i ",*1a.+".

Eo#nd 1-it xor2 for i*na onei0.

@nit ,*1a yntheiFed.


41/60


42/60

L=G:;t:646 Si*na 42 i ai*ned -#t ne+er #ed.













L=G:;t:1780 Si*na 80 i ne+er #ed or ai*ned.

@nit Partia!rod#ct yntheiFed.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

&'( Synthei eort

/acro Statitic

O ;or : 35

1-it xor2 : 35

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$


43/60


44/60

L=G:;t:1290 &ierarchica -ock ro241 i #nconnected in -ock

Partia!rod#ct.

=t i -e re)o+ed fro) the dei*n.


Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



45/60


Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



Partia!rod#ct.



46/60


Partia!rod#ct.



Partia!rod#ct.


$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

d+anced &'( Synthei eort

/acro Statitic

O ;or : 35

1-it xor2 : 35

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% (o (e+e Synthei %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

Dti)iFin* #nit Partia!rod#ct ...

/ain* a eJ#ation...

#idin* and oti)iFin* ,na netit ...

Eo#nd area contraint ratio of 100 R 5 on -ock Partia!rod#ct< act#a ratio i 0.


47/60

Eina /acro Procein* ...

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Eina e*iter eort

Eo#nd no )acro

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% Partition eort %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

Partition =)e)entation Stat#

o Partition ere fo#nd in thi dei*n.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

% Eina eort %

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$


48/60

Eina e#t

>( >o (e+e D#t#t Eie a)e : Partia!rod#ct.n*r

>o (e+e D#t#t Eie a)e : Partia!rod#ct

D#t#t Eor)at : GC

Dti)iFation Goa : Seed

Kee &ierarchy : D

'ei*n Statitic

O =D : 82

Ce @a*e :

O B(S : 6

O (@>3 : 1

O (@>4 : 5

O =D #Ier : 44

O =@E : 8

O D@E : 36

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

'e+ice #tiiFation #))ary:

Seected 'e+ice : 3500ec1325

#)-er of Sice: 3 o#t of 4656 0

#)-er of 4 in#t (@>: 6 o#t of 9312 0


49/60

#)-er of =D: 82

#)-er of -onded =D: 44 o#t of 92 47

Partition eo#rce S#))ary:

o Partition ere fo#nd in thi dei*n.

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

>=/=G BPD>

D>B: >&BSB >=/=G @/BS B D(H SH>&BS=S BS>=/>B.

ED CC@>B >=/=G =ED/>=D P(BSB BEB >D >&B >CB BPD>

GBB>B' E>B P(CBandD@>B.

Cock =nfor)ation:

o cock i*na fo#nd in thi dei*n

ynchrono# Contro Si*na =nfor)ation:

o aynchrono# contro i*na fo#nd in thi dei*n


50/60

>i)in* S#))ary:

Seed Grade: 5

/ini)#) eriod: o ath fo#nd

/ini)#) in#t arri+a ti)e -efore cock: o ath fo#nd

/axi)#) o#t#t reJ#ired ti)e after cock: o ath fo#nd

/axi)#) co)-inationa ath deay: 6.176n

>i)in* 'etai:

+a#e diayed in nanoecond n

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

$$$$$$$$$$$$$$$$$$$$$$$

>i)in* contraint: 'efa#t ath anayi

>ota n#)-er of ath ? detination ort: 138 ? 36

'eay: 6.176n (e+e of (o*ic $ 3

So#rce: /r1 P'

'etination: 005 P'

'ata Path: /r1 to 005

Gate et

Ce:ino#t fano#t 'eay 'eay (o*ica a)e et a)e


51/60


52/60

?? ? ?Q? ?

?? ?!!!? Q ? Mendor: ;iinx

?? Q Q Q? Merion : 9.2i

?? Q Q ication : =SB

?? ? ? Eiena)e : t!t-!efcheck.tf

?? ?!!!? ?Q >i)eta) : /on an 23 18:06:08 2012

?? Q Q ? Q

?? Q!!!Q?Q!!!Q

??

??Co))and:

??'ei*n a)e: t!t-!efcheck!-eh

??'e+ice: ;iinx

??

Tti)ecae 1n?1

)od#e t!t-!efcheck!-ehU

re* V7:0W /d $ 8-00000000U

re* V7:0W /r $ 8-00000000U

ire V15:0W k)U

ire V15:0W k1U

ire V15:0W k2U

ire V15:0W k3U

ire V15:0W k4U

tet @@>


53/60

./d/di)e: 300n

O50U

/d $ 8-00110111U

??

?? C#rrent >i)e: 350n

O50U

C&BCK!k)16-0000001010010100U

C&BCK!k216-0000101100100000U

C&BCK!k316-0011001101110100U

??


O250U

/d $ 8-01111101U

/r $ 8-10101101U

??


O50U

C&BCK!k)16-1101011101111001U

C&BCK!k116-0000010010111101U

C&BCK!k216-0000101000001000U

C&BCK!k316-0010100000100100U

C&BCK!k416-1010000010010000U


55/60

??


O150U

/d $ 8-10011001U

/r $ 8-10011001U

??


O50U

C&BCK!k)16-0010100101110001U

C&BCK!k116-0000001111011001U

C&BCK!k216-0000111100110100U

C&BCK!k316-0010001100100100U

C&BCK!k416-1111001101000000U

end

tak C&BCK!k)U

in#t V15:0W B;>!k)U

O0 -e*in

if B;>!k) X$$ k) -e*in

Ndiay"Brror at ti)e$dn k)$-< exected$-"< Nti)e< k)!k)U

>;!BD $ >;!BD R 1U

end

end

endtak

tak C&BCK!k1U


56/60

in#t V15:0W B;>!k1U

O0 -e*in

if B;>!k1 X$$ k1 -e*in

Ndiay"Brror at ti)e$dn k1$-< exected$-"< Nti)e< k1!k1U

>;!BD $ >;!BD R 1U

end

end

endtak

tak C&BCK!k2U

in#t V15:0W B;>!k2U

O0 -e*in



>;!BD $ >;!BD R 1U

end

end

endtak

tak C&BCK!k3U

in#t V15:0W B;>!k3U

O0 -e*in



57/60


>;!BD $ >;!BD R 1U

end

end

endtak

tak C&BCK!k4U

in#t V15:0W B;>!k4U

O0 -e*in



>;!BD $ >;!BD R 1U

end

end

endtak

end)od#e


58/60

o#t #t a+e for):


59/60

Sche)atic dia*ra):

>echnica che)atic dia*ra):


60/60

92050235 reducing computation time for short bit width twos compliment multiplier

Documents