filterwith distributed resonant generator · 2019-10-08 · elements drive a low7er level clock...

2
5-2 RF2 A 1 GHz FIR Filter with Distributed Resonant Clock Generator Visvesh S. Sathe, Jerry C. Kao, and Marlos C. Papaefthymiou University of Michigan, Ann Arbor ABSTRACT Atib~ ~ ~ ~ ~ ~~~~~~~~~~~~~~~~~d B1A Criica Pa1 In this paper we present the design and experinmental Vdd B-LAT CriticalPath .Simnrratiorn validation of RF2, a 1 GHz, two-phase resonant-clocked FIR H e C filter test-chip with a distributed resonant clock generator and c4 ll$M(2 t an on-chip inductor. RF2 is fabricated in a 0.13 qm CMOS G GG G GA process and dissipates 124mW at resonance, with clock poxver VIVw accounting for only 16% of overall power. Implemented using a fully ASIC design flow, RF2 achieves 84% clock-power 'j3 p / efficiency over CV2J the highest for anTy fully-integated QN resonant-clocked chip. Resonating at 1.01GHz, RF2 reports 2 the highest operating frequeny fbr a resonant-clocked L..,Ji datapath to date. vss c/ \ INTRODUCTION Increasing throughput and performance requirements have Fig. 1: BLAT schematics and pipeline implementation resulted in clock power remaining a major contributor to total Distributed Clockgenerator FIR Filter E__ signature power dissipation. By efficilently resonating the entire clock I ii ,IST network capacitance using inductance, resonant-clocking I ecar vic data BII techniques offer the opportunity for energy-efficient clocking, BroadatB ...I Broadcast BtffI resulting in substantial reduction of overall power dissipation. I - r. Previous work in the area involved the use of resonant clocks 1 I.: h derived from off-chip inductors to drive flip-flops [1,2,3]. The ivk >v -K$ poor slew of sinusoidal resonant clocks, however, degrades I performance and robustness to variation in flop-based designs. L B 0 4 4 0 Previous adiabatic techniques introduce additional resistance K in the resonating clock network [1 2]. Another approach has ,g~ Pipelinied 2 tae Pc,ipelined been to resonate the global clock distribution network [5]. 'ItY Adder 4:2 Crpa. '- NiIt iptlier Such a mechanism, which is directed towards clock skew and jitter reduction, does not significantly impact clock power [6]. Flg. 2: RF2 block diagram The basis for the design of RF2 is the key observation that yield TDQt97ps with (ECK,ED+Q)=(OfJ,23f1) per toggle. As a latches are naturally suited to the design of resonant-clocked result, RF2 does not require deeper pipelines and therefore datapaths. Unlike flip-flops, latch-based systems can be does not incur longer latencies than a corresponding designed so that their performance is nearly insensitive to the conventional implementation. To avoid crossover current slew of resonant clock waveforms. Therefore, such a latch- resulting from the poor slews of unbuffered resonant clocks, based design methodology alloxvs for resonant clocks to clocked transistors are incorporated within the logic stack. As directly drive latches (with no clock buffers) without incurring seen in Fig. 1, pipeline stages in RF2 are designed so that performance degradation. RF2 also benefits from the time- critical data arrives at the latch when the latching clock is borrowing afforded by latch-based design. nearly at Vdd, ensuring full gate overdrive for clocked For robust and skew4-tolerant operation, RF2 utilizes a two- transistors in B-LAT, thus maximizing perfonmance. Data phase clocking scheme driven by a distributed clock generator arriving early at the latch input leaves the latch earlier despite using a blip topologyo [7]. The two phases, alog with the voltage-dependent TDQ delay, providing additional timing appropriate latch design, achieve tWo non-overlapping latch slack to the next pipeline stage. Our methodology can also be transparency regions. The clock generator swlvtches in RF2 are shown to work effectively in designs with feedback loops. also driven by the resonant clock phases, resulting in increased Post-layout simulations at the 3c process comer also energy efficlencq. RF2 achieves 84% clock-power efficiency demonstrate that even wxith zero logic delay, the setup in Fig. 1 over CVj while driving 8OpF of clock load. Compared to its is race-immune at up to 290ps of clock skew. RF2 exhibits conventional flop-based counterpart, RF2 drives a lower clock loxv skew, bounded by its insertion delay of i5ps. load, resulting in even higher relative clock-power reduction. RF2 DESIGN LATCH DESIGN Fig. 2 shows a block diagram of RF2, a 14-tap 8-bit Fig. 1 shows transiistor schemtatics of BLAT the resonant- transpose-type FIR filter xvith BIST, desig>ned for a target clockLed level-senzsitive latch used in RF2. B-LAfT is a freCquec of 1GHz. In each tap, pipelined multipliers merge modLifiedx Svensson latch implemen:tation optimizedx for low partial produlcts to gen:eratei sulman:dcarry vector pa:irs. Th0ese TnQ [7]. Post-layout simu:lations of B-LAT-X2 with 15ff load 44 978-4-90074-04-8 2007E S-ymposium on VLSI C:ircui:ts Digest= of Technical PapeWrs

Upload: others

Post on 31-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Filterwith Distributed Resonant Generator · 2019-10-08 · elements drive a low7er level clock grid, connected to the power dissipation, resulting in an increase in total power inductorthrough

5-2RF2 A 1GHz FIR Filter with Distributed Resonant Clock Generator

Visvesh S. Sathe, Jerry C. Kao, and Marlos C. PapaefthymiouUniversity of Michigan, Ann Arbor

ABSTRACTAtib~ ~ ~~~ ~~~~~~~~~~~~~~~~~d B1A CriicaPa1In this paper we present the design and experinmental Vdd B-LAT CriticalPath.Simnrratiorn

validation of RF2, a 1 GHz, two-phase resonant-clocked FIR H e C

filter test-chip with a distributed resonant clock generator and c4 ll$M(2 tan on-chip inductor. RF2 is fabricated in a 0.13 qm CMOS G GG G GA

process and dissipates 124mW at resonance, with clock poxver VIVwaccounting for only 16% of overall power. Implemented usinga fully ASIC design flow, RF2 achieves 84% clock-power 'j3 p /efficiency over CV2J the highest for anTy fully-integated QNresonant-clocked chip. Resonating at 1.01GHz, RF2 reports 2

the highest operating frequeny fbr a resonant-clocked L..,Jidatapath to date. vss c/\INTRODUCTION I

Increasing throughput and performance requirements have Fig. 1: BLAT schematics and pipeline implementationresulted in clock power remaining a major contributor to total Distributed Clockgenerator FIR Filter

E__ signaturepower dissipation. By efficilently resonating the entire clock I ii ,ISTnetwork capacitance using inductance, resonant-clocking I ecar vic dataBIItechniques offer the opportunity for energy-efficient clocking, BroadatB...I Broadcast BtffIresulting in substantial reduction of overall power dissipation. I - r .Previous work in the area involved the use of resonant clocks 1I.:h

derived from off-chip inductors to drive flip-flops [1,2,3]. The ivk> v -K$poor slew of sinusoidal resonant clocks, however, degrades Iperformance and robustness to variation in flop-based designs. L B 0 4 4 0Previous adiabatic techniques introduce additional resistance Kin the resonating clock network [1 2]. Another approach has

,g~ Pipelinied 2 tae Pc,ipelinedbeen to resonate the global clock distribution network [5]. 'ItY Adder 4:2 Crpa. '- NiItiptlierSuch a mechanism, which is directed towards clock skew andjitter reduction, does not significantly impact clock power [6]. Flg. 2: RF2 block diagramThe basis for the design of RF2 is the key observation that yield TDQt97ps with (ECK,ED+Q)=(OfJ,23f1) per toggle. As a

latches are naturally suited to the design of resonant-clocked result, RF2 does not require deeper pipelines and thereforedatapaths. Unlike flip-flops, latch-based systems can be does not incur longer latencies than a correspondingdesigned so that their performance is nearly insensitive to the conventional implementation. To avoid crossover currentslew of resonant clock waveforms. Therefore, such a latch- resulting from the poor slews of unbuffered resonant clocks,based design methodology alloxvs for resonant clocks to clocked transistors are incorporated within the logic stack. Asdirectly drive latches (with no clock buffers) without incurring seen in Fig. 1, pipeline stages in RF2 are designed so thatperformance degradation. RF2 also benefits from the time- critical data arrives at the latch when the latching clock isborrowing afforded by latch-based design. nearly at Vdd, ensuring full gate overdrive for clockedFor robust and skew4-tolerant operation, RF2 utilizes a two- transistors in B-LAT, thus maximizing perfonmance. Data

phase clocking scheme driven by a distributed clock generator arriving early at the latch input leaves the latch earlier despiteusing a blip topologyo [7]. The two phases, alog with the voltage-dependent TDQ delay, providing additional timingappropriate latch design, achieve tWo non-overlapping latch slack to the next pipeline stage. Our methodology can also betransparency regions. The clock generator swlvtches in RF2 are shown to work effectively in designs with feedback loops.also driven by the resonant clock phases, resulting in increased Post-layout simulations at the 3c process comer alsoenergy efficlencq. RF2 achieves 84% clock-power efficiency demonstrate that even wxith zero logic delay, the setup in Fig. 1over CVj while driving 8OpF of clock load. Compared to its is race-immune at up to 290ps of clock skew. RF2 exhibitsconventional flop-based counterpart, RF2 drives a lower clock loxv skew, bounded by its insertion delay of i5ps.load, resulting in even higher relative clock-power reduction. RF2 DESIGNLATCH DESIGN Fig. 2 shows a block diagram of RF2, a 14-tap 8-bitFig. 1 shows transiistor schemtatics of BLAT the resonant- transpose-type FIR filter xvith BIST, desig>ned for a target

clockLed level-senzsitive latch used in RF2. B-LAfT is a freCquec of 1GHz. In each tap, pipelined multipliers mergemodLifiedx Svensson latch implemen:tation optimizedx for low partial produlcts to gen:eratei sulman:dcarry vector pa:irs. Th0eseTnQ [7]. Post-layout simu:lations of B-LAT-X2 with 15ff load

44 978-4-90074-04-8 2007E S-ymposium on VLSI C:ircui:ts Digest= of Technical PapeWrs

Page 2: Filterwith Distributed Resonant Generator · 2019-10-08 · elements drive a low7er level clock grid, connected to the power dissipation, resulting in an increase in total power inductorthrough

Traneoductcc 0' (

0 4>

Vdl.32aH 4.-

Fig. 3: Distributed clock gencrator and network Fig. 5: RF2 microphotograph

pairs are then merged With clock-delayed multiplier outputs Clock insertion delays in RF2 are insensitive to Vd variationfrom the previous tap using 4:2 compressors. The final vector- and depend only on initerconnect delay. Perfoimancemerge adder is implemented as a pipelined carr-save adder. degradation in RF2 due to voltage scaling is therefore lessThe BIST generates a pseudo-random number sequence, pronounced compared to conventionally-clocked designs.processes the filter output to generate a signature waveform, Furthermore, efficient clocking in RF2 allows for additionaland captures the state of the signature analyzer at a user- supply-voltage scaling by compensating for the increased logicdefined time. A center-tapped symimetric on-chip inductor is delay with higher clock amplitude.used to achieve efficient LC resonance with the distributed Fig. 4 shows the clock logic and total energy-per cyeparasitic capacitanzce ofthe clocknectw ork. dissipation vs. the clock supply Voltage Vdc. The operatingUnlike previous work, RF2 does not use a separate clock frequenq of te design is 1.01Hz. At each value of Vdc, the

generator block. Instead, the cross-coupled NMOS switches supply voltage Vdd was scaled to achieve the minimum totalproviding the required negative trans-conductance are power dissipation. For low er values of Vdc, a higher supplyembedded within the B-LAT latch, as shown in Fig. 1. Using a voltage is required due to lower clock amplitude. Increasingdistributed clock generator reduces local clock skew and VdC increases the clock amplitude, allowing for lower totalsimplifies designz. energy dissipation througlhi Vdd scaling. Bey ond the optimalA simplifedrseprsentation oi the- clock generato and value of Vd, the supply voltage scaling affirded by improved

network- is illustrated in Fig. 3. In RF2, distributed gain latch peilromance cannot compensate for the increasing clockelements drive a low7er level clock grid, connected to the power dissipation, resulting in an increase in total powerinductor through a 2-level H-Tree. dissipation. Driving 76pF of clock load, RF2 achieves 84%To demonstrate the efficiency of the proposed design cestimated ssem

methodology 2 was implemened in a fl automated quality factor Q of 5. Clock power in RF2 is 19.9mW,ASIC design flow. A gate-level netlist was obtained from accounting for only 16% of the overall 124mW chip power. AtVerilog using latch-based synthesis. Phy sical design was 1 33nW/NHz/Ta/inBit/coeffBit, RF fatures the lowestperfonwed using auto place anld robute tools. In pticular the figure of merit for digital FIR filters published to date. A chipclock network was automatically generated from standard cell micrphotogaph of RF is sho in Fig 5and power route placements using an in-house tool. CONCLUSIONMEASUREMENT RESULTS This paper describes RF2, a robust two-phase latch-basedTo investigate variation in the resonant fiequeny of pp resnRant-clocked FIR filter implemented using a fully-

across multiple chips, the resonant frequencies of 10 randonmly automated ASIC design flow. RF2 demonstrates that a latch-selected chips were meas;ured. All 10 chips were tested based methodology is nattural to resonant clocked design. Thesuccessfully with t 1.012GHz and Gf/gt-0.012. distributed clock generator featured in RF2 simplifies design

and provides better control of local clock skew. At its resonatTotl lo i frequenq of 1GHz, RF2 dissipates 124mW, achieving 84%

E = Energy+Izck Energy energy efficiency over C'fV RF2 is the first GHz-classV5 t.2W ney-Eery EegV

Dissipation Dissipation Dissipation ' implomentation of a resonant-clocked datapath.W4t-\ />REFERENCES

EClock 1. Athas W. et al. "Clock-powered CMOS VLSI graphics processor

S '., Tutal Loell^crgy ,' ... for embedded display conitroller application," ISSC.C, pp. 296-297,Totalt ~ \1.2C y Feb2000.

K I - ;. $ > v 1 &2. Ziesler C. et al. "A 225MHz resonant clocked ASIC chip"t.7VI99OptimlaalVe_,s'= ILP , pp. 48-53, Aug 2003.

W 1 I<V zS- @ 3. Drake A. et al. Resoniant Clockling Using Distributed. Parasitica5 Ut ! i, I.ISV , 11X0kCapacitance JSSC vo139, ppl 520-1528, 2004.

.25 l l V7$ { [ Q 4. Chaint S. et al. "A 4.6z resonant global clocAndJUAistributiong a ; rrLog-1Vic netxwork," ISSC, pp.12-13, Feb 2004.

F42 v~109 I o.t.07\ Lgcli 5Restle P.J et.al "A clock dizstribution netwrk for microprocessors"¢,oOZ g> I~0WV En.ergyl JSSC' Rzol. 36, pp. 792-799, May 2001.r oF>B , 6. W. Athas et. al, "A resonlant signaEl drivecr for two-phase almoElst-1 .OO 1.W 5 lllonloverlapping clockis" ISCAS May 1996, pp12!9-132.

%4 0 ~~~Vdcm . t7. Yuan J. and Svensson C. "High-speed CMOS circuit technique,":JS, vol1.24, pp. 62-70,: Feb 1989.

Fig. 4: Clock,d Logic, an1d Total power energy pert cycle vzs. d

2007 S;ymposium on VLSI Circuit:s Digest of Technical Papers 45