a low-power delay buffer using gated driver tree

8
1212 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 A Low-Power Delay Buffer Using Gated Driver Tree Po-Chun Hsieh, Jing-Siang Jhuang, Pei-Yun Tsai, Member, IEEE, and Tzi-Dar Chiueh, Senior Member, IEEE Abstract—This paper presents circuit design of a low-power delay buffer. The proposed delay buffer uses several new tech- niques to reduce its power consumption. Since delay buffers are accessed sequentially, it adopts a ring-counter addressing scheme. In the ring counter, double-edge-triggered (DET) flip-flops are utilized to reduce the operating frequency by half and the C-ele- ment gated-clock strategy is proposed. A novel gated-clock-driver tree is then applied to further reduce the activity along the clock distribution network. Moreover, the gated-driver-tree idea is also employed in the input and output ports of the memory block to decrease their loading, thus saving even more power. Both simulation results and experimental results show great improve- ment in power consumption. A 256 8 delay buffer is fabricated and verified in 0.18 m CMOS technology and it dissipates only 2.56 mW when operating at 135 MHz from 1.8-V supply voltage. Index Terms—C-element, delay buffer, first-in–first-out (FIFO), gated-clock, ring-counter. I. INTRODUCTION P ORTABLE multimedia and communication devices have experienced explosive growth recently. Longer battery life is one of the crucial factors in the widespread success of these products. As such, low-power circuit design for multimedia and wireless communication applications has become very impor- tant. In many such products, delay buffers (line buffers, delay lines) make up a significant portion of their circuits [1]–[3]. Such serial access memory is needed in temporary storage of signals that are being processed, e.g., delay of one line of video signals, delay of signals within a fast Fourier transform (FFT) architec- tures [4], and delay of signals in a delay correlator [2]. Currently, most circuits adopt static random access memory (SRAM) plus some control/addressing logic to implement delay buffers. For smaller-length delay buffers, shift register can be used instead. The former approach is convenient since SRAM compilers are readily available and they are optimized to generate memory modules with low power consumption and high operation speed with a compact cell size. The latter approach is also convenient since shift register can be easily synthesized, though it may consume much power due to unnecessary data movement. Previously, a simplified and thus lower-power sequential ad- dressing scheme for SRAM application in delay buffers is pro- posed in [5]. A ring counter is used to point to the target words Manuscript received June 29, 2007; revised April 29, 2008. First published March 16, 2009; current version published August 19, 2009. This work was sup- ported in part by the National Science Council, Taiwan, under Grant NSC-95- 2219-E-002-020 and NSC-96-2220-E-002-020. P.-C. Hsieh, J.-S. Jhuang, and T.-D. Chiueh are with the Graduate Institute of Electronics Engineering and the Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan. P.-Y. Tsai was with the Graduate Institute of Electronics Engineering and the Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan. She is now with the Department of Electrical Engineering, Na- tional Central University, Jhong-Li 32001, Taiwan. Digital Object Identifier 10.1109/TVLSI.2008.2004704 to be written-in and read-out. Since the ring counter is made up of an array of D-type flip-flops (DFFs) triggered by a global clock signal and all except one DFFs have a value of “0,” it is possible to disable the clock signal to most DFFs. Such a gated-clock ring counter is implemented in [6] to compose a low-power first-in–first-out (FIFO) memory. In this paper, we propose to use double-edge-triggered (DET) flip-flops instead of traditional DFFs in the ring counter to halve the operating clock frequency. A novel approach using the C-el- ements instead of the R–S flip-flops in the control logic for gen- erating the clock-gating signals is adopted to avoid increasing the loading of the global clock signal. In addition to gating the clock signal going to the DET flip-flops in the ring counter, we also proposed to gate the drivers in the clock tree. The tech- nique will greatly decrease the loading on distribution network of the clock signal for the ring counter and thus the overall power consumption. The same technique is applied to the input driver and output driver of the memory part in the delay buffer. In a delay buffer based on the SRAM cell array such as the one in [6], the read/write circuitry is through the bit lines that work as data buses. In the proposed new delay buffer, we use a tree hier- archy for the read/write circuitry of the memory module. For the write circuitry, in each level of the driver tree, only one driver along the path leading to the addressed memory word is acti- vated. Similarly, a tree of multiplexers and gated drivers com- prise the read circuitry for the proposed delay buffer. Simula- tion results show the effectiveness of the above techniques in power reduction. As an example, a 256 8 delay buffer chip is designed and fabricated. Measured results indicate its much better power performance than the same-size delay buffer based on existing commercial SRAM. The rest of this paper is organized as follows. Section II first introduces the conventional architecture for implementing delay buffers. Next, the proposed delay buffer using the new ring counter and gated driver trees for the read and write circuits of the memory module is described in Section III. Section IV then presents experimental results of the new delay buffer. Also, comparison in power and area of the new delay buffer with conventional SRAM-based delay buffers are given. Section V then concludes this paper. II. CONVENTIONAL DELAY BUFFERS The simplest way to implement a delay buffer is to use shift registers as shown in Fig. 1. If the buffer length is and the word-length is , then a total of DFFs are required, and it can be quite large if a standard cell for DFF is used. In addition, this approach can consume huge amount of power since on the average binary signals make transitions in every clock cycle. As a result, this implementation is usually used in short delay buffers, where area and power are of less concern. 1063-8210/$26.00 © 2009 IEEE Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Upload: judeesh-jacob

Post on 17-Oct-2014

70 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Low-Power Delay Buffer Using Gated Driver Tree

1212 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009

A Low-Power Delay Buffer Using Gated Driver TreePo-Chun Hsieh, Jing-Siang Jhuang, Pei-Yun Tsai, Member, IEEE, and Tzi-Dar Chiueh, Senior Member, IEEE

Abstract—This paper presents circuit design of a low-powerdelay buffer. The proposed delay buffer uses several new tech-niques to reduce its power consumption. Since delay buffers areaccessed sequentially, it adopts a ring-counter addressing scheme.In the ring counter, double-edge-triggered (DET) flip-flops areutilized to reduce the operating frequency by half and the C-ele-ment gated-clock strategy is proposed. A novel gated-clock-drivertree is then applied to further reduce the activity along the clockdistribution network. Moreover, the gated-driver-tree idea is alsoemployed in the input and output ports of the memory blockto decrease their loading, thus saving even more power. Bothsimulation results and experimental results show great improve-ment in power consumption. A 256 8 delay buffer is fabricatedand verified in 0.18 m CMOS technology and it dissipates only2.56 mW when operating at 135 MHz from 1.8-V supply voltage.

Index Terms—C-element, delay buffer, first-in–first-out (FIFO),gated-clock, ring-counter.

I. INTRODUCTION

P ORTABLE multimedia and communication devices haveexperienced explosive growth recently. Longer battery life

is one of the crucial factors in the widespread success of theseproducts. As such, low-power circuit design for multimedia andwireless communication applications has become very impor-tant. In many such products, delay buffers (line buffers, delaylines) make up a significant portion of their circuits [1]–[3]. Suchserial access memory is needed in temporary storage of signalsthat are being processed, e.g., delay of one line of video signals,delay of signals within a fast Fourier transform (FFT) architec-tures [4], and delay of signals in a delay correlator [2]. Currently,most circuits adopt static random access memory (SRAM) plussome control/addressing logic to implement delay buffers. Forsmaller-length delay buffers, shift register can be used instead.The former approach is convenient since SRAM compilers arereadily available and they are optimized to generate memorymodules with low power consumption and high operation speedwith a compact cell size. The latter approach is also convenientsince shift register can be easily synthesized, though it mayconsume much power due to unnecessary data movement.

Previously, a simplified and thus lower-power sequential ad-dressing scheme for SRAM application in delay buffers is pro-posed in [5]. A ring counter is used to point to the target words

Manuscript received June 29, 2007; revised April 29, 2008. First publishedMarch 16, 2009; current version published August 19, 2009. This work was sup-ported in part by the National Science Council, Taiwan, under Grant NSC-95-2219-E-002-020 and NSC-96-2220-E-002-020.

P.-C. Hsieh, J.-S. Jhuang, and T.-D. Chiueh are with the Graduate Institute ofElectronics Engineering and the Department of Electrical Engineering, NationalTaiwan University, Taipei 10617, Taiwan.

P.-Y. Tsai was with the Graduate Institute of Electronics Engineering andthe Department of Electrical Engineering, National Taiwan University, Taipei10617, Taiwan. She is now with the Department of Electrical Engineering, Na-tional Central University, Jhong-Li 32001, Taiwan.

Digital Object Identifier 10.1109/TVLSI.2008.2004704

to be written-in and read-out. Since the ring counter is made upof an array of D-type flip-flops (DFFs) triggered by a globalclock signal and all except one DFFs have a value of “0,” itis possible to disable the clock signal to most DFFs. Such agated-clock ring counter is implemented in [6] to compose alow-power first-in–first-out (FIFO) memory.

In this paper, we propose to use double-edge-triggered (DET)flip-flops instead of traditional DFFs in the ring counter to halvethe operating clock frequency. A novel approach using the C-el-ements instead of the R–S flip-flops in the control logic for gen-erating the clock-gating signals is adopted to avoid increasingthe loading of the global clock signal. In addition to gating theclock signal going to the DET flip-flops in the ring counter, wealso proposed to gate the drivers in the clock tree. The tech-nique will greatly decrease the loading on distribution networkof the clock signal for the ring counter and thus the overall powerconsumption. The same technique is applied to the input driverand output driver of the memory part in the delay buffer. In adelay buffer based on the SRAM cell array such as the one in[6], the read/write circuitry is through the bit lines that work asdata buses. In the proposed new delay buffer, we use a tree hier-archy for the read/write circuitry of the memory module. For thewrite circuitry, in each level of the driver tree, only one driveralong the path leading to the addressed memory word is acti-vated. Similarly, a tree of multiplexers and gated drivers com-prise the read circuitry for the proposed delay buffer. Simula-tion results show the effectiveness of the above techniques inpower reduction. As an example, a 256 8 delay buffer chipis designed and fabricated. Measured results indicate its muchbetter power performance than the same-size delay buffer basedon existing commercial SRAM.

The rest of this paper is organized as follows. Section IIfirst introduces the conventional architecture for implementingdelay buffers. Next, the proposed delay buffer using the newring counter and gated driver trees for the read and writecircuits of the memory module is described in Section III.Section IV then presents experimental results of the new delaybuffer. Also, comparison in power and area of the new delaybuffer with conventional SRAM-based delay buffers are given.Section V then concludes this paper.

II. CONVENTIONAL DELAY BUFFERS

The simplest way to implement a delay buffer is to use shiftregisters as shown in Fig. 1. If the buffer length is and theword-length is , then a total of DFFs are required, and itcan be quite large if a standard cell for DFF is used. In addition,this approach can consume huge amount of power since on theaverage binary signals make transitions in every clockcycle. As a result, this implementation is usually used in shortdelay buffers, where area and power are of less concern.

1063-8210/$26.00 © 2009 IEEE

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 2: A Low-Power Delay Buffer Using Gated Driver Tree

HSIEH et al.: LOW-POWER DELAY BUFFER USING GATED DRIVER TREE 1213

Fig. 1. Delay buffer implemented by shift registers.

Fig. 2. Pointer-based delay buffer.

SRAM-based delay buffers are more popular in long delaybuffers because of the compact SRAM cell size and small totalarea. Also, the power consumption is much less than shift reg-isters because only two words are accessed in each clock cycle:one for write-in and the other for read-out. A binary counter canbe used for address generation since the memory words are ac-cessed sequentially.

Though the SRAM-based delay buffers do away with manydata transitions, there still can be considerable power consump-tion in the SRAM address decoder and the read/write circuits.In fact, since the memory words are accessed sequentially, wecan use a ring counter with only one rotating active cell to pointto the words for write-in and read-out. This method, known asthe pointer-based scheme [5], is illustrated in Fig. 2. The bottomrow of D-type flip-flops is initialized with only one “1” (the ac-tive cell) and all the other DFFs are kept at “0.” When a clockedge triggers the DFFs, this “1” signal is propagated forward.Consequently, the traditional binary address decoder can be re-placed by this “unary-coded” ring counter. Compared to theshift register delay buffers, this approach propagates only one“1” in the ring counter instead of propagating -bit words.Obviously, with much less data transitions, the pointer-baseddelay buffers can save a lot of power.

By observing the fact that only one of the DFFs in the ringcounter is activated, the gated-clock technique has then beenproposed to be applied to the DFFs in [6]. In their approach,every eight DFFs in the ring counter are grouped into one block.Then, a “gate” signal is computed for each block to gate thefrequently toggled clock signal when the block can be inactiveso that unnecessary power wasted in clock signal transitions issaved. As shown in Fig. 3, when the input of the first DFF in ablock is asserted, it sets the output of the R–S flip-flop to “1”at the next clock edge. Thus, the incoming “1” can be trappedin that block and continue to propagate inside the block. On theother hand, the successful propagation of “1” to the first DFF inthe next block can henceforth shut down the unnecessary clocksignal in the current block.

Fig. 3. Ring counter with clock gated by R–S flip-flop.

III. PROPOSED DELAY BUFFER

In the proposed delay buffer, several power reductiontechniques are adopted. Mainly, these circuit techniques aredesigned with a view to decreasing the loading on high fan-outnets, e.g., clock and read/write ports.

A. Gated-Clock Ring Counter

Although some power is indeed saved by gating the clocksignal in inactive blocks, the extra R–S flip-flops still serve asloading of the clock signal and demand more than necessaryclock power. We propose to replace the R–S flip-flop by a C-el-ement and to use tree-structured clock drivers with gating soas to greatly reduce the loading on active clock drivers. Addi-tionally, DET flip-flops are used to reduce the clock rate to halfand thus also reduce the power consumption on the clock signal.The proposed ring counter with hierarchical clock gating and thecontrol logic is shown in Fig. 4. Each block contains one C-el-ement to control the delivery of the local clock signal “CLK ”to the DET flip-flops, and only the “CKE signals along thepath passing the global clock source to the local clock signalare active. The “gate” signal (CKE ) can also be derived fromthe output of the DET flip-flops in the ring counter.

The C-element is an essential element in asynchronous cir-cuits for handshaking. One of its implementation is shown inFig. 5(a) [7]. The logic of the C-element is given by

(1)

where as well as are its two inputs and as well asare the next and current outputs. If , then the next output

will be the same as . Otherwise, and remainunchanged. Since the output of C-element can only be changedwhen , it can avoid the possibility of glitches, a cru-cial property for a clock gating signal. In order to reduce morepower, we replace DFFs by double-edge-triggered flip-flops [8][see Fig. 5(b)] and operate the ring counter at half speed . Withsuch changes, the clock gating control mechanism in Fig. 4(a) isdifferent from the one in Fig. 3. When the input of the last DETflip-flop in the previous block changes to “1” making both twoinputs of the C-element the same, the clock signal in the cur-rent block will be turned on. When the output of the first DETflip-flop in the current block is asserted, then both inputs of theC-element in the previous block go to “0” and the clock for theprevious block is disabled.

In order to further diminish the loading on the global clocksignal (“CLK”), we propose to use a driver tree distribution net-work for the global clock and activate only those drivers along

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 3: A Low-Power Delay Buffer Using Gated Driver Tree

1214 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009

Fig. 4. (a) Ring counter with clock gated by C-elements, (b) tree-structuredclock drivers with gating, and (c) control logic for clock enable signals.

Fig. 5. Circuit diagrams of (a) the C-element [7] and (b) the double-edge-trig-gered flip-flop [8].

the path from the clock source to the blocks that need to bedriven by the clock. The “gate” signal for those drivers can be

Fig. 6. (a) Circuit of the clock enable signal for the gated-clock driver tree and(b) its timing diagram.

derived from the same clock gating signals of the blocks thatthey drive. Thus, in a quad-tree clock distribution network, the“gate” signal of the th gate driver at the th level (CKE )should be asserted when the active DET flip-flop (whose outputis “1”) in the ring counter is inside the group of blocks withindex from to , where is the numberof blocks. To be precise, every clock gating signal will be on fortwo more cases, when “1” is at the input of last DET flip-flop inblock and when “1” is at the output of the first DETflip-flop of block . In a quad-tree driver architecturewith four times more drivers in each level, all drivers need be ac-tivated if no gating is applied and the number of active driversis given by

(2)

On the other hand, only drivers are activated in theworst case for the proposed gated-clock tree when two driversare activated in each of the level. On the average, thereare no more than drivers that are turned on,where is the number of DET flip-flops in one block.

An example is illustrated in Fig. 6 with . If the active“1” in the ring counter is propagated to the input of the last DETflip-flop, Q , then the clock enable signals, CKE andCKE , are turned on. Subsequently, CLK and CLK can

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 4: A Low-Power Delay Buffer Using Gated Driver Tree

HSIEH et al.: LOW-POWER DELAY BUFFER USING GATED DRIVER TREE 1215

TABLE IPOWER CONSUMPTION OF THREE RING COUNTERS

be delivered to block 49. On the contrary, when Q rises, itwill turn off the clock enable signals of CKE and CKE . Asa result, CLK and CLK stop driving their loading blocks.

The loading of the clock signal in the proposed scheme can beanalyzed as follows. Assume that a quad tree is used for clockdrivers, then for a length- ring counter constituted by a totalof flip-flops partitioned in blocks.

• Loading of a traditional ring counter

(3)

• Loading of [6]

(4)

• Loading of the proposed ring counter

(5)

where , , and denote the loading of a D-typeflip-flop clock input, an AND gate input and an R–S flip-flopclock input, respectively. In Table I, we have simulated threedifferent ring counter structures using 0.18- m CMOS tech-nology and 1.8-V supply voltage at an operating frequency of50 MHz. The length of the ring counter is set to 1024, andeight flip-flops are grouped in one block ( , ).If we assume that the ratio between , , and is2 : 1 : 2, the estimated loading ratios of the three architecturesare also listed in the table. The simulation results indeed reflectthe power consumption analysis in (3)–(5). It also indicates thatit consumes less than 5% of the power of a same-length R–Sflip-flop-based ring counter [6] (Fig. 3).

B. Gated-Driver Tree

To save area, the memory module of a delay buffer is often inthe form of an SRAM array with input/output data bus as in [6].Special read/write circuitry, such as a sense amplifier, is neededfor fast and low-power operations. However, of all the memorycells, only two words will be activated: one is written by theinput data and the other is read to the output. Driving the inputsignal all the way to all memory cells seems to be a waste ofpower. The same can be said for the read circuitry of the outputport. In light of the previous gated-clock tree technique, we shallapply the same idea to the input driving/output sensing circuitryin the memory module of the delay buffer.

The memory words are also grouped into blocks. Eachmemory block associates with one DET flip-flop block in theproposed ring counter and one DET flip-flop output addresses acorresponding memory word for read-out and at the same timeaddresses the word that was read one-clock earlier for write-in.

Fig. 7. (a) Gated-driver tree of input driving circuitry and (b) its timingdiagram.

Fig. 7(a) depicts the tree-structured hierarchy of tri-state in-verters used for delivering the input word to the addressedmemory word. Note that only the driver tree for one input bitis shown. A -bit delay buffer needs sets of the circuitry inFig. 7. The “enable” signal of the th tri-state inverter at the thlevel ( ) should be asserted when the “1” is within one ofthe blocks from index to index in thering counter, where is the number of blocks and a quad treeis assumed. These signals are generated by a C-element and aninverter in a similar way as the circuit shown in Fig. 4(a) exceptthat the input signal to the C-element from left (start-up) isthe output of the first DET flip-flop in blockand the signal from right (shut-off) is the output of the firstDET flip-flop in block . The corresponding timingdiagram of control and data signals is shown in Fig. 7(b). Ascan be seen, the assertion of Q , the output of the firstDET flip-flop in block 49, activates , , and , andsimultaneously disables , , and The contentof the memory word addressed by Q , (address 384), iswritten by the input signal.

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 5: A Low-Power Delay Buffer Using Gated Driver Tree

1216 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009

TABLE IIPOWER CONSUMPTION OF THE INPUT DRIVER TREE

WITH AND WITHOUT THE GATING STRATEGY

The loading of the input write circuitry can be estimated.• Without gating strategy, it is

(6)

• With gated driver tree, it is

(7)

where and are the loading of one latch and onetri-state buffer, respectively. In the estimation, a quad tree is as-sumed. We can see the logarithmical decrease in loading candramatically reduce the power consumption. In Table II, wehave simulated input driver tree structures with and without thegating strategy by using 0.18- m CMOS technology and 1.8-Vsupply voltage at an operating frequency of 50 MHz. Due to thedriving capability, buffer sizing is considered in the simulationand equivalently extra loading of 12 must be included in(6) and (7) for the case . If the loading ratio between

and is 1:2.8, the estimated power consumption ratioby (6) and (7) also matches very well with the simulated results.

The output sensing circuit has the same structure as that inFig. 7(a) except that the signal flow direction is reversed. Datain the addressed memory element pass through several levels oftri-state inverters that work as the multiplexer. The same set ofenable signals (E ) can be used to control these output tri-stateinverters. Note that the proposed techniques can also be appliedto variable-length delay buffers. First of all, the proposed ringcounter can be made variable-length by including alternativesignal paths selected by multiplexers to bypass some DET flip-flops in the ring. Second, as all the control signals of the gated-clock/driver tree are derived from the ring counter, a completedelay buffer with variable length can be constructed quite easilyby including the latches and gated I/O driver trees. Of course,in this variable-length delay buffer, hardware corresponding tothe maximum length must be implemented.

IV. EXPERIMENTAL RESULTS

A. Physical Design

A delay buffer based on the proposed techniques is designedand implemented in 0.18- m CMOS technology. The standard6-T SRAM cell is used in the delay buffer. Eight DET flip-flops,eight memory words, and associated control logic are designedin a full-custom fashion and grouped as one block. We have sim-ulated the proposed delay buffer with various lengths in 0.18 mCMOS technology. The word-length is set to 8 bits. The area

Fig. 8. Simulated results of (a) power and (b) area of various delay buffersversus different lengths.

and power consumption are estimated from post layout sim-ulation. In addition, we compared the simulated results withthe values provided by a commercial SRAM compiler in thesame technology. Since in each clock cycle, one read and onewrite operations are necessary for the delay buffer of length

, either one two-port SRAM with words or two one-portSRAMs each with words is required. Fig. 8(a) shows thesimulated power consumption at 135-MHz operating frequencyand 1.8-V supply voltage. Fig. 8(b) depicts their occupied area.From Fig. 8, we can see that the proposed delay buffer out-performs both the two-port and single-port SRAM-based delaybuffers in terms of power consumption. In addition, the area ofthe proposed delay buffer is smaller than the SRAM-based delaybuffers when the buffer length is shorter than 256.

A 256 8 delay buffer is designed and fabricated. In this cir-cuit, a wordlength of one byte is adopted for this size is quitecommon in communication and video applications. Since eachoutput of the ring counter is to activate one word line of the

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 6: A Low-Power Delay Buffer Using Gated Driver Tree

HSIEH et al.: LOW-POWER DELAY BUFFER USING GATED DRIVER TREE 1217

Fig. 9. Die photo of the proposed delay buffer.

Fig. 10. Experimental results of (a) maximum clock rate and (b) power con-sumption at different supply voltage.

memory array, to achieve compact area and high-speed opera-tion, the outputs of the ring counter must be pitch-matched tothe memory array and the ring counter is placed as close as pos-sible to the memory cell. As shown in Fig. 9, to make the aspect

TABLE IIICHIP SUMMARY

TABLE IVCOMPARISON OF MEASUREMENT RESULTS WITH

SRAM-BASED DELAY BUFFERS

ratio of the layout close to 1, the 256 8 delay buffer is foldedinto two 128 8 delay buffers. The core area of the 256 8delay buffer is 0.232 m 0.519 m. It is packaged in a 28-pinpackage, including eight output pins and twelve input pins.

B. Measurements

The measurement results are depicted in Fig. 10, where themaximum achievable clock rate under different supply voltagesand the power consumption versus different supply voltages atthe maximum operating frequency are shown. The maximumoperating frequency may decrease as the length of delay bufferincreases due to the propagation of the global clock signal in thegated driver tree. Lower supply voltage worsens this critical pathdelay. The proposed chip dissipates only 2556 W at 135 MHzfrom a supply voltage of 1.8 V. When the supply voltage islowered to 1.2 V, the chip consumes only 440 W at 58 MHz.Table III gives a brief summary of the low-power 256 8 delaybuffer chip.

Finally, Table IV lists the comparison between the proposedchip and the delay buffers using commercial single-port aswell as two-port SRAM. The proposed chip has a power con-sumption of about 17% of the single-port SRAM-based delaybuffer, or 13% of the two-port SRAM-based delay buffer. Theeffectiveness of the proposed techniques in lowering powerconsumption is quite obvious.

C. Simulations for Scalability

To obtain a further insight about the scalability of theproposed delay buffer architecture in nanometer CMOS tech-nology, we have run simulations of the proposed buffer withseveral different lengths in 90-nm and 65-nm CMOS tech-nology. To alleviate the leakage power problem, dual-Vt MOStransistors are adopted in the 65-nm simulation while only

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 7: A Low-Power Delay Buffer Using Gated Driver Tree

1218 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009

Fig. 11. Simulated power (with leakage power) of the proposed delay bufferarchitecture in (a) 90-nm CMOS technology and (b) 65-nm CMOS technology.

single-Vt MOS transistors are used in the 90-nm technology.The low Vt MOS transistors only exist in write-enable andread-enable pass gates between bit lines and memory cellsto provide enough driving. The supply voltages are 1 V and0.85 V in 90-nm and 65-nm cases, respectively, and the oper-ating frequency is 200 MHz.

Fig. 11 shows the total power consumption in normal oper-ation mode and the leakage power consumption in idle (dis-abled clock) mode for 90-nm and 65-nm technology, respec-tively. Note that the total power consumption in normal oper-ation mode is not logarithmically proportional to the length ofthe delay buffer. Instead, due to the quad tree structure for allthe driving circuitry, delay buffers of length and haveapproximate dynamic power because basically these two casesactivate the same number of drivers. We can see that the su-

periority of the proposed circuit is still obvious in 90-nm tech-nology in that the leakage power is almost negligible. Even inthe more advanced 65-nm technology, the leakage power canbe controlled to within an acceptable level for medium-lengthdelay buffers with the dual-Vt approach. For longer-length delaybuffers and for more advanced technology, other leakage reduc-tion techniques such as the “sleep” transistors in SRAM (Latch)cells can help to reduce leakage power [9].

V. CONCLUSION

In this paper, we presented a low-power delay buffer architec-ture which adopts several novel techniques to reduce power con-sumption. The ring counter with clock gated by the C-elementscan effectively eliminate the excessive data transition withoutincreasing loading on the global clock signal. The gated-drivertree technique used for the clock distribution networks can elim-inate the power wasted on drivers that need not be activated. An-other gated-demultiplexer tree and a gated-multiplexer tree areused for the input and output driving circuitry to decrease theloading of the input and output data bus. All gating signals areeasily generated by a C-element taking inputs from some DETflip-flop outputs of the ring counter. Measurement results indi-cate that the proposed architecture consumes only about 13% to17% of the conventional SRAM-based delay buffers in 0.18- mCMOS technology. Further simulations also demonstrate its ad-vantages in nanometer CMOS technology. We believe that withmore experienced layout techniques the cell size of the proposeddelay buffer can be further reduced, making it very useful in allkinds of multimedia/communication signal processing ICs.

ACKNOWLEDGMENT

The authors would like to thank the Chip ImplementationCenter (CIC) of the National Science Council, Taiwan, for chipfabrication.

REFERENCES

[1] W. Eberle et al., “80-Mb/s QPSK and 72-Mb/s 64-QAM flexible andscalable digital OFDM transceiver ASICs for wireless local area net-works in the 5-GHz band,” IEEE J. Solid-State Circuits, vol. 36, no.11, pp. 1829–1838, Nov. 2001.

[2] M. L. Liou, P. H. Lin, C. J. Jan, S. C. Lin, and T. D. Chiueh, “De-sign of an OFDM baseband receiver with space diversity,” IEE Proc.Commun., vol. 153, no. 6, pp. 894–900, Dec. 2006.

[3] G. Pastuszak, “A high-performance architecture for embedded blockcoding in JPEG 2000,” IEEE Trans. Circuits Syst. Video Technol., vol.15, no. 9, pp. 1182–1191, Sep. 2005.

[4] W. Li and L. Wanhammar, “A pipeline FFT processor,” in Proc. Work-shop Signal Process. Syst. Design Implement., 1999, pp. 654–662.

[5] E. K. Tsern and T. H. Meng, “A low-power video-rate pyramid VQdecoder,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1789–1794,Nov. 1996.

[6] N. Shibata, M. Watanabe, and Y. Tanabe, “A current-sensed high-speedand low-power first-in-first-out memory using a wordline/bit-line-swapped dual-port SRAM cell,” IEEE J. Solid-State circuits, vol.37, no. 6, pp. 735–750, Jun. 2002.

[7] E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp.720–738, Jun. 1989.

[8] R. Hosain, L. D. Wronshi, and A. albicki, “Low power design usingdouble edge triggered flip-flop,” IEEE Trans. Very Large Scale Integr.(VLSI ) Syst., vol. 2, no. 2, pp. 261–265, Jun. 1994.

[9] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N.Vallepalli, Y. Wang, B. Zheng, and M. Bohr, “SRAM design on 65-nmCMOS technology with dynamic sleep transistor for leakage reduc-tion,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 895–901, Apr.2005.

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.

Page 8: A Low-Power Delay Buffer Using Gated Driver Tree

HSIEH et al.: LOW-POWER DELAY BUFFER USING GATED DRIVER TREE 1219

Po-Chun Hsieh was born in Kaohsiung, Taiwan, in1977. He received the B.S. degree in electrical en-gineering and the M.S. degree in electronics engi-neering from National Taiwan University, Taipei, in2002 and 2004, respectively.

He is currently with Faraday Technology Corpora-tion and participates on the DC–DC Converter team.His research interests include low-power circuit de-sign of communication systems and power manage-ment circuit design.

Jing-Siang Jhuang was born in Taipei, Taiwan in1981. He received the B.S. degree in electrical en-gineering and the M.S. degree in electronics engi-neering from National Taiwan University, Taipei, in2003 and 2005, respectively.

His research interests include baseband signal pro-cessing of communication systems and related VLSIdesign.

Pei-Yun Tsai (S’02–M’06) received the B.S., M.S.,and Ph.D. degrees in electrical engineering from theNational Taiwan University, Taipei, in 1994, 1996,and 2005, respectively.

From 1996 to 2000, she was with ASUStek andparticipated on the optical storage systems team.She is now an Assistant Professor of electricalengineering at National Central University, Taoyuan,Taiwan.

Prof. Tsai received the Acer Longtern Award,MXIC Golden Silicon Award, and the First Asian

Solid-State Circuit Conference Student Design Contest Outstanding Award in2005. Her research interests include baseband signal processing algorithms andVLSI design for digital communication systems.

Tzi-Dar Chiueh (S’87–M’90–SM’03) received theB.S. and Ph.D. degrees in electrical engineering fromNational Taiwan University (NTU) and California In-stitute of Technology, Pasadena, in 1983 and 1989,respectively.

He is currently a Professor of electrical engi-neering in the Graduate Institute of ElectronicsEngineering, National Taiwan University. Hisresearch interests include algorithm, architecture,and integrated circuits for baseband communicationsystems.

Prof. Chiueh received the Acer Longtern Award 11 times and the GoldenSilicon Award in 2002, 2005, and 2007. His teaching efforts were recognizedfive times by the Teaching Excellence Award from NTU. He received theOutstanding Research Award from National Science Council, Taiwan, in2004–2006 and was awarded the Himax Chair Professorship at NTU in 2006.

Authorized licensed use limited to: Dayananda Sagar College of Engineering. Downloaded on February 23,2010 at 02:13:21 EST from IEEE Xplore. Restrictions apply.