liao

7/31/2019 Liao

1/4

Automated Design Techniques for Low-Power

High-Speed CircuitsOn a Self-Configuring 64-bit Wallace Tree Multiplier

EE241: Advanced Digital Integrated CircuitsMidterm ReportZhujie Lin ([email protected]) Michael Liao ([email protected])

ABSTRACT

This paper presents techniques to reduce power consumption in arithmetic logic units (ALUs) while improvingperformance. This ultimate paradigm in design takes advantage of varying input widths to enable evaluation withpartial ALU activation. We will demonstrate partial ALU evaluations have shorter critical paths; this thus enables usto increase clock speed. Shorter clock time means the circuit can spend less time in operation mode and more time inpower savings mode such as sleep mode or clock gating mode. These proposed techniques will be implemented and

benchmarked on various large-input dynamic Wallace Tree multipliers using 90nm technology. Additional powersaving circuitry, influenced by static CMOS sleep mode circuits and clock gating in dynamic logic, are added onto anexisting dynamic Wallace Tree multiplier. These proposed additions look at the incoming input bits and determines

which part of the tree to precharge using a self-generated variable clock. This design encompasses the philosophy ofoff until used by using the partial tree for partial computations. The goal is to see if these proposed circuit additionssignificantly reduce leakage power while increasing overall performance.

____________________________________________________________________________________________________

I. INTRODUCTIONThe current trend in CPU and multimedia evolution

emphasizes on the increasing width and accuracy of ALUs.Flagship CPUs from both Intel and AMD have recentlyupgraded to 64-bits, and GPUs from NVidia and ATI nowmake use of 32-bit FPUs. While these hardware changesenable personal computers to perform on par with high endcomputers of yesteryear, software does not always takeadvantage of these hardware evolutions. Powering unusedhardware presents a direct obstacle in today's paradigm forlow-powered high-speed circuits.

We wish to approach the necessity of low-poweredhigh-speed circuits with a different philosophy: off bydefault. The circuit will intelligently turn on only thenecessary paths for execution, this process will be shown tohave faster execution time.

Conventional methods for power reduction rely on thereduction of voltage and the reduction of frequency to allowfor lower voltage operations. Using the equation

P=CV2F , power consumption seems to be reduced.

A mobile processor of the current generation (Pentium M)represents this approach: ALUs are turned off when nocomputation is necessary, clock speed is reduced underlight load, and voltage is reduced. This approach presentsseveral fallacies: shutdown of ALU is impossible while

processing a multimedia stream; a fixed computationrequires the same amount of clock cycles regardless ofclock speed; and there is no way to increase performancewithout increasing power consumption.

We can reduce power consumption by relying on .Instead of dependent only on the general usage of a block,

should also be dependent on the width of operation and thelength of the clock. These are the revised equations:

P=CV2F ' '

'=0

width

clockF '=F

0

clock

It is safe to assume a 64-bit multiplier will not always beperforming 64-bit multiplications, width will likely to bemuch less than 1. Since the reduced width of operationreduces the critical path of computation, clock represents the

reduction of time a circuit spends in operation mode. Theflip side of reduced clock period is the option to increasefrequency, and because of width, there is still overall powersavings. This is an attractive choice previously unavailableto circuit designers who reduce power by scaling downvoltage.

Section II investigates the problems with our benchmarknormal Wallace Tree multiplier implemented in dynamiclogic. Section III discusses the necessary circuits andconcepts for solving the power and performance issues ofthe benchmark multiplier. We will present the methods forautomating such changes in section IV. Finally, section Voutlines our testing methodology.

II. SHORTCOMINGSOF BENCHMARK MULTIPLIER

A. DYNAMIC LOGIC

In most dynamic logic designs, some kind of levelrestoring device is used to alleviate the problem of chargeleakage on the output nodes (See Figure 1). These levelrestorers add extra intrinsic capacitances as well as a staticleakage current that increases power consumption of the

7/31/2019 Liao

2/4

circuit. In high-speed applications (~5 GHz), each evaluateonly has 0.1ns to complete (with the other 0.1ns for

precharge), thus a level restoring device may incur toomuch performance overhead per operation that is it notviable. To prevent leakage power consumption of criticalnodes by the precharge device, clock gating is introduced,which only enables precharge of the device when it is inuse, not while the device is inactive. Later in section III, theuse of clock gating to improve overall performance and

prevent leakage power consumption will further explainedin detail.

PDN

VDD

GND

Level

Restorer

F

LeakageNode

Dynamic Logic w/ Level RestorerFigure 1

B. WALLACE TREE MULTIPLIERIn a variety of applications, a basic high-speed Wallace

Tree multiplier implemented in dynamic logic does nothave optimally performance or power consumption. Inmultimedia applications, where the multiplier will always beon and the input bit lengths are highly correlated, thedynamic Wallace Tree will have unnecessary powerconsumption due to the multiplier being precharged every

cycle. Since the input bits are correlated, if the inputs do notrequire the entire Wallace Tree to compute, then parts ofthe Wallace Tree will not be active for long periods of time.But these parts still leak charge and still get charged by the

precharge devices. If those parts of the multiplier can beturned off, then power consumption can be reducedsignificantly.

In microprocessors, where the multiplier can be idle forlong periods of time, having a constant clock to prechargethe critical nodes also result in unnecessary powerconsumption. In this case, having a sleep mode todisconnect the multiplier from the supplies makes sense.

When computing Legacy code (such as 16-bit and 32-bitoperations) in a 64-bit multiplier, the whole word length isnever used and therefore precharging only for the active

parts will yield optimal power consumption.

III. PROPOSED ADDITIONSTOTHE MULTIPLIER

In the discussion of power saving, the following modesof operation need to be defined: clock gating, sleep, and

operation. These modes define the power consumption ofeach data path depending on the word length of the input.

Clock gating When the circuit senses that a datapath willnot be used to evaluate the current input, this particulardatapath will remain in precharge mode. The ability to doso reduces dynamic power dissipation since the clockcharges and discharges the input capacitances of precharge

PMOS transistors. Power consumption is P=CV2

F

and for a continuously active clock is 1. But for a gatedclock is dependent on the input it receives and is likely to bemuch less than 1.

Sleep mode In the event that a datapath remains unusedfor a prolonged period of time, that particular datapath will

be shut down and all precharged nodes are allowed to leakaway. Sleep mode is achieved by turning off both the

precharge PMOS and the evaluate NMOS, this introducestwo large resistances and thus minimizes powerconsumption of unused circuitry.

Operation mode If a datapath is determined to benecessary for processing data, that particular datapath willreceive a clock signal to precharge the path if it was in sleepnode or directly evaluate if the path was in clock gatingmode. The length of the clock may vary depending on the

bit length of the input, this results in reduced circuit activitytime in comparison to a fixed clock operation.

PDN

VDD

GND

0

0

PDN

VDD

GND

1

0

PDN

VDD

GND

Clock Gating Sleep Mode Operation ModeFigure 2

In order to obtain the aforementioned modes of operationand to maximize power savings, several circuits need to beimplemented: most significant bit (MSB) detection, variable

duty cycle clock, datapath state selector, and datamultiplexer.

Most significant bit detection This circuit determines thebit length of the incoming data. MSB detection must be fastand efficient since it controls the length of the clock toreflect operation time, the arrangement of data for topcalculation efficiency, and the state of every data path.

7/31/2019 Liao

3/4

. . .

MSB Detection CircuitFigure 3

Variable duty cycle clock This clock generator is show inthe figure below. The reason for having a variable clock isso we can reduce the operation time of the circuitry. This

benefits us in two ways: first, we have the ability to run thecircuit at higher clock speed depending on the complexityof operation; second, the less time a circuit spends inoperation mode the less current is leaked away.

EN

CLK

Variable Clock GeneratorFigure 4

Datapath state selector Each datapath has a differentutilization rate. In our benchmark multiplier, in highlyuncorrelated operation, every bit can be considered a noisesignal and have 50% utilization rate when active; incorrelated operation, the most significant bits see very fewtransitions and the least significant bits still have noisedistribution; in sparse computation mode, idle prevails, thusthe utilization rate of every bit is minimal; in legacy mode,

we are guaranteed that a select set of bits will never beused. Our datapath state selector must have the followingattributes: minimal operational time so when data is highlyuncorrelated the datapath doesn't take long to switch modes;carefully choose between clock gating and sleep modeswhen the data is correlated or mostly idle, this is because ittakes a while to bring elements from sleep to active as everynode needs to be recharged.

Data multiplexer When dealing with two inputs, theirrelative bit lengths may vary, a fixed circuit is more easilyoptimized for the condition that A is equally long or longerthan B. A data multiplexer is thus needed to route the datainto operational circuitry so this condition is alwayssatisfied. This enables a regularly structured operationalcircuitry to compute data more efficiently.

IV. DESIGN METHODOLOGY

In showcasing our power reduction and performanceboosting circuits for general application, we have developeda full suite of implementation techniques to quickly convert

any standard ALU design. The philosophy of designautomation requires scripting to reflect the structuralregularity of circuits. The goal is to generate any lengthALU containing the circuitry mentioned in the previoussection. Some parts of ALU structures are highly repetitive,while others are placed in random, thus there are techniquesto deal with each of the situations: scripting for regularcircuits, scripting for irregular circuits.

Scripting for regular circuits In our Wallace Treeexample, we can see in the following diagram that a 5-bitWallace Tree is just a 4-bit tree with an extra row of addersand a lengthened vector add unit. We can exploit thisstructural regularity to generate Wallace Trees of any bitlength. There are two less rows of parallel adders than thenumber of bits in the adder, each row has two more fulladders than the previous row, and its then followed by avector adder at twice the length of the number of bits (SeeFigure 6). The most structured part of a Wallace Tree is the

block of AND gates; it's simply a square with side width

equal to the number of bits. The setup circuitry such as bitdetection and data multiplexer all scale linearly with thenumber of bits. The result of such scripting will featuresimilarly named circuit elements with slight variation innumbering to differentiate one adder from another.

Regularity of a 4-bit by 4-bit Wallace Tree [1]Figure 5

Scripting for irregular circuits The input and outputnetworks between every row of parallel adders in a WallaceTree is highly irregular; some might take an original input,some might take a carry and a save, some might take othercombinations of original, carry, and save. On top of theirregular wiring, there's a need to simulate the wiringresistance and capacitance leading from one node toanother, and the resulting model must also reflect thevarying length of the paths. A data structure is necessary toautomate the generation of such wiring networks. Whereasin a regularly structured script, the circuit elements can begenerated on the fly with small variation in numbering, airregularly structured network requires the names to be

7/31/2019 Liao

4/4

entered into a database. The entries can be referred to by itsrelative position in the circuit, and can also be updated withnew names as more circuit elements are connectedhierarchically. The wiring network in a Wallace Treemultiplier can be represented in 3D, the top level representsa row of adders, each outputting a sum and a carry wire,which can be seen in Figure 6. The relative positions ofthese two wires are known, so they can be entered into thedatabase in the correct locations. The level below theseadders are an interconnect network, they might extend anoriginal wire or represent the sum or carry wires leading tothe next adder. These vary in lengths depending on theiroriginating points. With a database, these attributes areremembered and thus the correct values for resistance andcapacitances can be extracted.

Hierarchical Structure of the Wallace Tree [1]Figure 6

V. TESTING METHODOLOGY

We will benchmark our proposed additions using 90nm

ST Microelectronics standard cell technology. Theproposed 64-bit by 64-bit multiplier will be comparedagainst a static CMOS design and a basics dynamic designof the same input size as well as other smaller word lengthmultipliers (i.e. 16b by 16b and 32b by 32b Wallace Trees)for power consumption and propagation delay. We will testfull ranges of operation, by using a specific set of testingvalues that vary input word lengths interspersed with

periods of inactivity.

Input Length Choices We can see that for the basicWallace Tree multiplier, a 1-bit by 64-bit multiplicationactivates a different part of the tree than a 64-bit by 1-bitmultiplication. The two operations have different powerconsumption as well as different propagation delays.However, our proposed design should be unaffected by theinput order. Also, a 32-bit by 32-bit multiply on our

proposed 64-bit multiplier will be tested against a pure 32-bit Wallace Tree as well as the two 64-bit basic WallaceTrees. We want to see if our design still has power and

performance advantages over a dedicated 32-bit multiplier.

Input Sequences We would choose input sequences thatincur the maximal switching activity in the test multipliersto test the input extremes for dynamic switching power and

propagation delay. Long periods of inactivity injected totest the advantages of the sleep mode and inactivitydetection in our proposed design. Figure 7 below showsvarious input bit width sequences we will test.

No.ofBits

Time

TimeTime

Time

No.ofBits

No.ofBits

No.ofBits

Multimedia (Bits Correlated) Noise (Bits Uncorrelated)

Idle Periods

Legacy (16-, 32-, or 64-bit)

Various Testing Input SequencesFigure 7

VI. REFERENCES

1. J. Rabaey, A. Chandrakasan, B. Nicolic,DigitalIntegrated Circuits, 2nd ed.

liao

Documents