research paper - ijerst · of time. faster multipliers may be engineered in order to do fewer...

217

This article can be downloaded from http://www.ijerst.com/currentissue.php

Int. J. Engg. Res. & Sci. & Tech. 2015 V L Prathyusha and N Suresh Babu, 2015

IMPLEMENTATION OF POWER OPTIMIZEDRAZOR-BASED MULTIPLIER

In this project a multiprecision (MP) reconfigurable multiplier that incorporates variable precision,Parallel Processing (PP), razor-based Dynamic Voltage Scaling (DVS), and dedicated MPoperands scheduling to provide optimum performance for a variety of operating conditions. All ofthe building blocks of the proposed reconfigurable multiplier can either work as independentsmaller-precision multipliers or work in parallel to perform higher-precision multiplications. Giventhe user's requirements (e.g., throughput), a dynamic voltage/frequency scaling managementunit configures the multiplier to operate at the proper precision and frequency. Adapting to therun-time workload of the targeted application, razor flip-flops together with a dithering voltageunit then configure the multiplier to achieve the lowest power consumption. The single-switchdithering voltage unit and razor flip-flops help to reduce the voltage safety margins and overheadtypically associated to DVS to the lowest level. The large silicon area and power overheadtypically associated to reconfigurability features are removed. Finally, the proposed novel MPmultiplier can further benefit from an operands scheduler that rearranges the input data, henceto determine the optimum voltage and frequency operating conditions for minimum powerconsumption.

Keywords: Razor multiplier, Clock domain, XILINX ISE, Verilog

INTRODUCTIONA binary multiplier is an electronic circuit used indigital electronics, such as a computer, to multiplytwo binary numbers. It is built using binary adders.A variety of computer arithmetic techniques canbe used to implement a digital multiplier. Mosttechniques involve computing a set of partialproducts, and then summing the partial products1 M.Tech Student, Department of E.C.E., Chirala Engineering College, Ramapuram Beach Rd, Chirala, Andhra Pradesh 523157, India.2 Associate Professor, Department of E.C.E., Chirala Engineering College, Ramapuram Beach Rd, Chirala, Andhra Pradesh 523157,

India.

Int. J. Engg. Res. & Sci. & Tech. 2015

ISSN 2319-5991 www.ijerst.comVol. 4, No. 4, November 2015

© 2015 IJERST. All Rights Reserved

Research Paper

together. This process is similar to the methodtaught to primary schoolchildren for conductinglong multiplication on base-10 integers, but hasbeen modified here for application to a base-2(binary) numeral system.

Until the late 1970s, most minicomputers didnot have a multiply instruction, and soprogrammers used a “multiply routine” which

V L Prathyusha1* and N Suresh Babu2

*Corresponding Author: V L Prathyusha [email protected]

218



repeatedly shifts and accumulates partial results,often written using loop unwinding. Mainframecomputers had multiply instructions, but they didthe same sorts of shifts and adds as a “multiplyroutine”. Early microprocessors also had nomultiply instruction. The Motorola 6809, introducedin 1978, was one of the earliest microprocessorswith a dedicated hardware multiply instruction. Itdid the same sorts of shifts and adds as a“multiply routine”, but implemented in themicrocode of the MUL instruction.The methodtaught in school for multiplying decimal numbersis based on calculating partial products, shiftingthem to the left and then adding them together.The most difficult part is to obtain the partialproducts,as that involves multiplying a longnumber by one digit (from 0 to 9):

123

x 456

=====

738 (this is 123 x 6)

615 (this is 123 x 5, shifted one position to theleft)

+ 492 (this is 123 x 4, shifted two positions tothe left)

=====

56088

A binary computer does exactly the same, butwith binary numbers. In binary encoding each longnumber is multiplied by one digit (either 0 or 1),and that is much easier than in decimal, as theproduct by 0 or 1 is just 0 or the same number.Therefore, the multiplication of two binarynumbers comes down to calculating partialproducts (which are 0 or the first number), shiftingthem left, and then adding them together (a binaryaddition, of course):

1011 (this is 11 in decimal)

x 1110 (this is 14 in decimal)

======

0000 (this is 1011 x 0)

1011 (this is 1011 x 2, shifted one position tothe left)

1011 (this is 1011 x 4, shifted two positions tothe left)

+ 1011 (this is 1011 x 8, shifted three positionsto the left)

=========

10011010 (this is 154 in decimal)

This is much simpler than in the decimalsystem, as there is no table of multiplication toremember: just shifts and adds. This method ismathematically correct and has the advantagethat a small CPU may perform the multiplicationby using the shift and add features of its arithmeticlogic unit rather than a specialized circuit. Themethod is slow, however, as it involves manyintermediate additions. These additions take a lotof time. Faster multipliers may be engineered inorder to do fewer additions; a modern processorcan multiply two 64-bit numbers with 6 additions(rather than 64), and can do several steps inparallel. The second problem is that the basicschool method handles the sign with a separaterule (“+ with + yields +”, “+ with - yields -”, etc.).Modern computers embed the sign of the numberin the number itself, usually in the two’scomplement representation. That forces themultiplication process to be adapted to handletwo’s complement numbers, and thatcomplicates the process a bit more. Similarly,processors that use ones’ complement, sign-and-magnitude, IEEE-754 or other binary

219



representations require specific adjustments tothe multiplication process

High speed arithmetic operations are veryimportant in many signal processing applications.Speed of the Digital Signal Processor (DSP) islargely determined by the speed of its multipliers.In fact the multipliers are the most important partof all digital signal processors; they are veryimportant in realizing many important functionssuch as fast Fourier transforms and convolutions.Since a processor spends considerable amountof time in performing multiplication, animprovement in multiplication speed can greatlyimprove system performance. Multiplication canbe implemented using many algorithms such asarray, booth, carry save,and wallace treealgorithm. The computational time required by thearray-multiplier is less because the partialproducts are computed -independently in parallel.The delay associated with the array multiplier isthe time taken by the signals to propagate throughthe gates that form the multiplication array.

Arrangement of adders is another way ofimproving-multiplication speed. There are twomethods for this: Carry Save Array (CSA) methodand Wallace tree method. In the CSA method,bits are processed one by one to supply a carrysignal to an adder located at a one bit higherposition. The CSA method has got its ownlimitations since the execution time depends onthe number of bits of the multiplier. In the Wallacetree method, three bit signals are passed to aone bit full adder and the sum is supplied to thenext stage full adder of the same bit and the carryoutput signal is passed to the next stage full adderof same number of bit and the then formed carryis supplied to the next stage of the full adderlocated at a one bit higher position. In this method,the circuit lay out is not easy.

Booth algorithm reduces the number of partial-products. However, large booth arrays arerequired for high speed multiplication andexponential operations which in turn require largepartial sum and partial carry registers.Multiplication of two n-bit operands using a radix-4 booth recording multiplier requiresapproximately n/(2 m) clock cycles to generatethe least significant half of the final product, wherem is the number of booth recoded adder stages.Thus, a large propagation delay is associated withthis case. The modified booth encoded Wallacetree multiplier uses modified booth algorithm toreduce the partial products and also fasteradditions are performed using the Wallace tree.

RAZOR BASED MULTIPLIERCONSUMERS demand for increasingly portableyet high performance Multimedia andcommunication products imposes stringentconstraints on the power consumption ofindividual internal components. Of these,multipliers perform one of the most frequentlyencountered arithmetic operations in DigitalSignal Processors (DSPs). For embeddedapplications, it has become essential to designmore power-aware multipliers. Given their fairlycomplex structure and interconnections,multipliers can exhibit a large number ofunbalanced paths, resulting in substantial glitchgeneration and propagation.

Each pair of incoming operands is routed tothe smallest multiplier that can compute the resultto take advantage of the lower energyconsumption of the smaller circuit. This ensembleof point systems is reported to consume the leastpower but this came at the cost of increased chiparea given the used ensemble structure. Toaddress this issue, proposed to share and reuse

220



some functional modules within the ensemble.In, an 8-bit multiplier is reused for the 16-bitmultiplication, adding scalability without large areapenalty. Reference extended this method byimplementing pipelining to further improve themultiplier’s performance.

A more flexible approach is proposed in, withseveral multiplier elements grouped together toprovide higher precisions and reconfigurability.Reference analyzed the overhead associated tosuch reconfigurable multipliers. This analysisshowed that around 10%-20% of extra chip areais needed for 8-16 bits multipliers. Combiningmultiprecision (MP) with Dynamic Voltage Scaling(DVS) can provide a reduction in powerconsumption by adjusting the supply voltageaccording to circuit’s run-time workload ratherthan fixing it to cater for the worst case scenario.When adjusting the voltage, the actualperformance of the multiplier running under scaledvoltage has to be characterized to guarantee afail-safe operation. Conventional DVS techniquesconsist mainly of lookup table (LUT) and on-chipcritical path replica approaches.

The LUT approach tunes the supply voltageaccording to a predefined voltage-frequencyrelationship stored in a LUT, which is formedconsidering worst case conditions (processvariations, power supply droops, temperature hot-spots, coupling noise, and many more).Therefore, large margins are necessarily added,which in turn significantly decrease theeffectiveness of the DVS technique. The criticalpath replica approach typically involves an on-chip critical path replica to approximate the actualcritical path. Therefore, voltage could be scaledto the extent that the replica fails to meet thetiming.However, safety margins are still neededto compensate for the intradie delay mismatch

and address fast-changing transient effects. Inaddition, the critical path may change as a resultof the varying supply voltage or process ortemperature variations. If this occurs,computations will completely fail regardless of thesafety margins. The aforementioned limitationsof conventional DVS techniques motivated recentresearch efforts into error-tolerant DVSapproaches, which can run-time operate thecircuit even at a voltage level at which timingerrors occur. A recovery mechanism is thenapplied to detect error occurrences and restorethe correct data. Because it completely removesworst case safety margins, error-tolerant DVStechniques can further aggressively reducepower consumption. In this paper, we propose alow power reconfigurable multiplier architecturethat combines MP with an error-tolerant DVSapproach based on razor flip-flops .The maincontributions of this paper can be summarizedfollows.

• A novel MP multiplier architecture featuring,respectively, 28.2% and 15.8% reduction insilicon area and power consumptioncompared with its conventional 32 × 32 bitfixed-width multiplier counterpart. All reportedmultipliers trade silicon area/powerconsumption for MP. In this paper, silicon areais optimized by applying an operationreduction technique that replaces a multiplierby adders/subtractions.

• A silicon implementation of this MP multiplierintegrating an error-tolerant razor-baseddynamic DVS approach. The fabricated chipdemonstrates run-time adaptation to theactual workload by operating at the minimumsupply voltage level and minimum clockfrequency while meeting throughputrequirements. Prior works combining MP with

221



• The clock Frequency Scaling Unit (FSU)implemented using a frequency divisiontechnique to limit clock rate. It function is todynamically generate the different levels offrequencies so as to minimize the clock;

ANALOG DIVIDERSRegenerative Frequency DividerA regenerative frequency divider, also known asa Miller frequency divider, mixes the input signalwith the feedback signal from the mixer. Thefeedback signal is fin/2. This produces sum anddifference frequencies fin/2, 3fin/2 at the output ofthe mixer. A low pass filter removes the higherfrequency and the fin/2 frequency is amplified andfed back into mixer. In order to establish a stable1/2 frequency feedback, the amplifier gain at thehalf frequency must be greater than unity. Thephase shift must also be an integer multiple of2pi.

DVS have only considered a limited numberof offline simulated precision-voltage pairs,with unnecessary large safety margins addedto cater for critical paths.

• A novel dedicated operand scheduler thatrearranges operations on input operands soas to reduce the number of transitions of thesupply voltage and, in turn, minimize the overallpower consumption of the multiplier. Unlikereported scheduling works, the function of theproposed scheduler is not task schedulingrather input operands scheduling for theproposed MP multiplier.

Figure 1: Overall Multiplier SystemArchitecture

The proposed MP multiplier system (Figure 1)comprises three different modules that are asfollows:

• The MP multiplier;

• Tthe Voltage Scaling Unit (VSU) implementedusing a voltage dithering technique to limitsilicon area overhead. Its function is todynamically generate the supply voltage so asto minimize power consumption;

Figure 2: Regenerative Frequency Divider

Injection-Locked Frequency DividerA free-running oscillator which has a smallamount of a higher-frequency signal fed to it willtend to oscillate in step with the input signal. Suchfrequency dividers were essential in thedevelopment of television.It operates similarly toan injection locked oscillator. In an injection lockedfrequency divider, the frequency of the input signalis a multiple (or fraction) of the free-runningfrequency of the oscillator. While these frequencydividers tend to be lower power than broadbandstatic (or flip-flop based) frequency dividers, the

222



drawback is their low locking range. The ILFDlocking range is inversely proportional to thequality factor (Q) of the oscillator tank. Inintegrated circuit designs, this makes an ILFDsensitive to process variations. Care must betaken to ensure the tuning range of the drivingcircuit (for example, a voltage-controlledoscillator) must fall within the input locking rangeof the ILFD.

Digital DividersFor power-of-2 integer division, a simple binarycounter can be used, clocked by the input signal.The least-significant output bit alternates at thesame rate as the input, the next bit is the 1/2 therate, the third bit is 1/4 the rate, etc. Anarrangement of flipflops are a classic method forinteger-n division. Such division is frequency andphase coherent to the source over environmentalvariations including temperature. The easiestconfiguration is a series where each flip-flop is adivide-by-2. For a series of three of these, suchsystem would be a divide-by-8. By addingadditional logic gates to the chain of flip flops, otherdivision ratios can be obtained. Integrated circuitlogic families can provide a single chip solutionfor some common division ratios.

Another popular circuit to divide a digital signalby an even integer multiple is a Johnson counter.This is a type of shift register network that isclocked by the input signal. The last register’scomplemented output is fed back to the firstregister’s input. The output signal is derived fromone or more of the register outputs. For example,a divide-by-6 divider can be constructed with a 3-register Johnson counter. The three valid valuesfor each register are 000, 100, 110, 111, 011, and001. This pattern repeats each time the networkis clocked by the input signal. The output of eachregister is a f/6 square wave with 60° of phase

shift between registers. Additional registers canbe added to provide additional integer divisors.

Mixed Signal DivisionAn arrangement of D flip-flops are a classicmethod for integer-n division. Such division isfrequency and phase coherent to the source overenvironmental variations including temperature.The easiest configuration is a series where eachD flip-flop is a divide-by-2. For a series of three ofthese, such system would be a divide-by-8. Morecomplicated configurations have been found thatgenerate odd factors such as a divide-by-5.Standard, classic logic chips that implement thisor similar frequency division functions include the7456, 7457, 74292, and 74294.

Fractional-n DividersA fractional-n frequency synthesizer can beconstructed using two integer dividers, a divide-by-n and a divide-by-(n + 1) frequency divider.With a modulus controller, n is toggled betweenthe two values so that the VCO alternatesbetween one locked frequency and the other. TheVCO stabilizes at a frequency that is the timeaverage of the two locked frequencies. By varyingthe percentage of time the frequency dividerspends at the two divider values, the frequencyof the locked VCO can be selected with very finegranularity.

Clock DistributionTypically, the reference clock enters the chip anddrives a Phase Locked Loop (PLL), which thendrives the system’s clock distribution. The clockdistribution is usually balanced so that the clockarrives at every endpoint simultaneously. One ofthose endpoints is the PLL’s feedback input. Thefunction of the PLL is to compare the distributedclock to the incoming reference clock, and varythe phase and frequency of its output until the

223



reference and feedback clocks are phase andfrequency matched.

The above Figure 4 shows how three 8 × 8 bitPEs are used to realize a 16 × 16 bit multiplier.The 32 × 32 bit multiplier is constructed using asimilar approach but requires 3 × 3 PEs. A 3-bitcontrol word defines which PEs workconcurrently and which PEs are disabled.Whenever the full precision (32 × 32 bit) is notexercised, the supply voltage and the clockfrequency may be scaled down according to theactual workload.

Figure 3: Clock Distribution Network Circuit

Figure 4: PEs Combined to Form 16 × 16 Bit Multiplier

224



Figure 5: RTL Schematic for Razor Multiplier

Figure 6: Internal View of RTL Schematic

RESULTSRTL SchematicThe RTL SCHEMATIC gives the information aboutthe user view of the design. The internal blockscontains the basic gate representation of the logic.These basic gate realization is purely dependupon the corresponding FPGA selection and theinternal database information.

WaveformIn the waveform which is shown above, clk signalrepresents clock, rst signal represents reset, a,

b signal represents the inputs which we areapplying to the design. Similarly prod (product) isthe output signal for the design. Here clock signalis generated for the positive edge.

Initially the reset signal should be force to logic1 and after one clock cycle made it to logic 0 forperforming the corresponding functionaloperation. To obtain the required outputs forcethe inputs logic with the required values. Here theoutput product value is obtained by multiplyingthe inputs a and b.

225



CONCLUSIONA novel MP multiplier architecture featuring,respectively, its 32 × 32 bit conventional fixed-widthmultiplier counterpart. When designing this MPmultiplier architecture the power consumption is1.2 mw and delay is 4.28 achieved which showsthe power optimization of the proposed designTherefore the run-time adaptation to the actualworkload is so much reduced while meetingthroughput requirements. The HDL is developedbased on the verilog language. The RTL issimulated and synthesized in the XILINX ISE.

REFERENCES1. Bhardwaj M, Min R and Chandrakasan A

(2001), “Quantifying and Enhancing PowerAwareness of VLSI Systems”, IEEE Trans.

Table 1: Design and Summary Reports

Very Large Scale Integr. (VLSI) Syst.,Vol. 9, No. 6, pp. 757-772.

2. Carbognani F, Buergin F, Felber N, KaeslinH and Fichtner W (2008), “TransmissionGates Combined with Level-RestoringCMOS Gates Reduce Glitches in Low-Power Low-Frequency Multipliers”, IEEETrans. Very Large Scale Integr. (VLSI) Syst.,Vol. 16, No. 7, pp. 830-836.

3. Kuang S-R and Wang J-P (2010), “Designof Power-Efficient Configurable BoothMultiplier”, IEEE Trans. Circuits Syst. I, Reg.Papers, Vol. 57, No. 3, pp. 568-580.

4. Kuroda T (1999), “Low Power CMOS DigitalDesign for Multimedia Processors”, in Proc.Int. Conf. VLSI CAD, October, pp. 359-367.

Figure 7: Test Bench Waveform

226



5. Lee H (2004), “A Power-Aware ScalablePipelined Booth Multiplier”, in Proc. IEEE Int.SOC Conf., September, pp. 123-126.

6. Ling W and Savaria Y (2004), “Variable-Precision Multiplier for Equalizer withAdaptive Modulation”, in Proc. 47th MidwestSymp. Circuits Syst., Vol. 1, July, pp. 1553-1556.

7. Min R, Bhardwaj M, Cho S-H, Ickes N, ShihE, Sinha A, Wang A and Chandrakasan A(2002), “Energy-Centric EnablingTechnologies for Wireless SensorNetworks”, IEEE Wirel. Commun., Vol. 9,No. 4, pp. 28-39.

8. Pfander O A, Hacker R and Pfleiderer H-J(2004), “A Multiplexer-Based Concept forReconfigurable Multiplier Arrays”, in Proc. Int.Conf. Field Program. Logic Appl., Vol. 3203,September, pp. 938-942.

9. Wang A and Chandrakasan A (2003),“Energy-Aware Architectures for aRealvalued FFT Implementation”, in Proc.IEEE Int. Symp. Low Power Electron.Design, August, pp. 360-365.

10. Yamanaka T and Moshnyaga V G (2004),“Reducing Multiplier Energy by Data-DrivenVoltage Variation”, in Proc. IEEE Int. Symp.Circuits Syst., May, pp. 285-288.

research paper - ijerst · of time. faster multipliers may be engineered in order to do fewer...

Documents