session 26 overview: processor-power management and …

436 • 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

ISSCC 2017 / SESSION 26 / PROCESSOR-POWER MANAGEMENT AND CLOCKING / OVERVIEW

Session 26 Overview: Processor-Power Management and Clocking

DIGITAL CIRCUITS SUBCOMMITTEE

Subcommittee Chair: Edith Beigné, CEA-LETI, Grenoble, France

The first paper in this session considers optimization of computing systems at multiple levels from silicon to data center. A secondpaper pertains to power delivery network reliability and presents a software approach to mitigating worst-case droop. The remainingthree papers deal with improving the power of the clock network by making it reconfigurable, use of adiabatic techniques and throughadaptive frequency throttling.

Session Chair: Kathy Wilcox, AMD, Boxborough, MA

Session Co-Chair: Youngmin Shin, Samsung, Hwaseong, Korea

437DIGEST OF TECHNICAL PAPERS •

ISSCC 2017 / February 8, 2017 / 3:15 PM

3:15 PM – INVITED TALK

26.1 Design Optimization of Computing Systems from the Nanoscale Transistor to the DatacentreA. Nalamalpu, Intel, Hillsboro, OR

In Paper 26.1, Intel presents an invited paper on highly agile system design to support both highest performancefor peak demand and best-in-class performance/W to minimize datacenter operating costs. Silicon and systemco-optimization, reliability, and configurability are required to increase performance to meet the computingdemands of the modern computing era while driving the total cost of ownership lower.

3:45 PM26.2 Power Supply Noise in a 22nm z13TM Microprocessor

P. I-J. Chuang, IBM Research, Yorktown Heights, NYIn Paper 26.2, IBM demonstrates an RLC network model for processor power delivery and measures the 22nmmulticore z13TM processor to analyze worst-case droops called “perfect storms”. Compared to the noiseassociated with synchronous activity on the cores, they create droops that are 1.2× deeper and 1.9× faster. Anetwork of 32 critical-path monitors sense the local droops and initiate activity throttling, improving voltagenoise margin to provide 2% increase in chip performance.

4:15 PM26.3 Reconfigurable Clock Networks for Random Skew Mitigation from Subthreshold to Nominal Voltage

L. Lin, National University of Singapore, Singapore, SingaporeIn Paper 26.3, the University of Singapore presents a reconfigurable clock network design for operation fromsub-threshold to nominal voltage. Clock skew is reduced by up to 2.5 standard deviations and 100mV Vmin

reduction is achieved at 1.8% area penalty in a 40nm LP CMOS FFT testchip.

4:30 PM26.4 A 0.4-to-1V 1MHz-to-2GHz Switched-Capacitor Adiabatic Clock Driver Achieving 55.6% Clock Power

ReductionL. G. Salem, University of California, San Diego, CA

In Paper 26.4, the University of California, San Diego introduces a fully integrated adiabatic clocking schemevia a switched-capacitor DC-AC multi-level inverter topology. Designed in 45nm SOI, the adiabatic driver is0.019mm2 and enables up to 55.6% power savings and features a 2000× dynamic frequency range.

4:45 PM26.5 Adaptive Clocking in the POWER9TM Processor for Voltage Droop Protection

M. S. Floyd, IBM, Austin, TXIn Paper 26.5, IBM presents a novel adaptive clock strategy for the POWER9TM to reduce the timing marginneeded during droop events by embedding analog Voltage Droop Monitors (VDMs) that direct a Digital Phase-Locked Loop (DPLL) to instantly reduce clock frequency in response to a droop. The 14nm implementationcan respond in 6ns with 8mV precision with 0.12% area overhead. The quick response time results in 50%noise margin reduction, translating into 8% less power or 3.5% performance gain within the same powerenvelope.

26

438 • 2017 IEEE International Solid-State Circuits Conference

ISSCC 2017 / SESSION 26 / PROCESSOR-POWER MANAGEMENT AND CLOCKING / 26.2

26.2 Power Supply Noise in a 22nm z13TM Microprocessor

Pierce I-Jen Chuang1, Christos Vezyrtzis1, Divya Pathak2, Richard Rizzolo3, Tobias Webel4, Thomas Strach4, Otto Torreiter4, Preetham Lobo5, Alper Buyuktosunoglu1, Ramon Bertran1, Michael Floyd6, Malcolm Ware6, Gerard Salem7, Sean Carey8, Phillip Restle1

1IBM Research, Yorktown Heights, NY2Drexel University, Philadelphia, PA3IBM, Poughkeepsie, NY4IBM STG, Boeblingen, Germany5IBM STG, Bangalore, India6IBM STG, Austin, TX7IBM STG, Burlington, NY8IBM STG, Poughkeepsie, NY

Successful power supply noise mitigation requires a system-level approach thatincludes design and modeling of the mitigation circuits with the power deliverynetwork (PDN) on the chip, the chip module, the backplane, and the voltageregulator module (VRM). Traditionally, periodic square-wave activity patterns withall cores in sync, which yield low-frequency (LF) or mid-frequency (MF)impedance peaks associated with the backplane and chip/module, respectively,are considered to give rise to the worst case power supply noise. However, voltagedroops that are both deeper and faster at a single victim core are created whencores change activity in more complicated patterns, termed as perfect storms inthis work. These patterns excite high-frequency (HF) modes that are notstimulated when all cores switch simultaneously, and require an accurate modelof the packaged chip, including effective core-to-core inductances due to currentstraveling between cores through low-resistance module planes.

The two most widely used approaches for power-supply noise mitigation areinstruction throttling [1], and adaptive clocking [2]. In a system with asynchronous global clock optimized for on-chip latency and bandwidth, adaptiveclocking is less attractive. The goal of the power supply noise mitigation systemon z13 is to reduce the magnitude of power supply droop occurring during thelifetime of the product, thus reducing the power supply noise margin. As shownin Fig. 26.2.1, for each of the 8 cores on z13 [3], 4 timing-based critical pathmonitors (CPMs) [4] are implemented to sense the local voltage droop, with apower management unit to initiate and relay an activity throttling event signal to4 different units. The red and white dashed arrows indicate the signal paths ofinstruction throttling and CPM signals, respectively. When a droop is detected,instructions are rapidly throttled down to 25% of the un-throttled activity. Afterthe droop, the activity is slowly ramped back up over a period of 512ns to avoidcreating droops. This per-core mitigation scheme was chosen over a centralizedscheme in order to minimize the sense-to-response latency. This scheme not onlyreduces the local droop, but also prevents a potential perfect storm from buildingup in another core. The CPM droop sensors on z13 utilize a latch-tapped delayline to measure the timing effects of power-supply noise on a circuit tuned toreplicate the core’s late-mode critical timing path. Each CPM produces 8thermometer-coded bits, numbered 1 through 8, with 8 representing the highestsupply voltage. Each CPM output is ANDed with other CPM outputs as itpropagates back to the power management unit, reducing the number of high-layer metals required.

The schematic of the multi-node model with 12 on-chip nodes connected by RLCnetworks, representing the on-chip and module PDN wire and planes, is shownin Fig. 26.2.2. The current consumed by each core is modeled by a non-linearleakage conductance GL, in parallel with a separate non-linear conductancedescribing the active or switching current GA. Changes in core activity are modeledby modulating GA by an activity factor ranging from a low value representing idlepower to a maximum activity designed to produce the thermal design point (TDP)chip power. A conductance instead of a current source is adapted to provide thenecessary damping factor to create realistic voltage waveforms.

Figure 26.2.3 shows the measured and modeled noise at core 0 (in the upper leftcorner of the chip) resulting from activity changes in different sets of cores. Forthe measurements, the noise was generated artificially by controlling a pipelinedglobal clock-gate signal in each core, causing a simultaneous step-like change in

the activity of the selected set of cores. An oscilloscope probe remained connectedacross the VDD and GND networks in core 0, while different sets of cores werestimulated with the clock gating signals. The measured and modeled waveformsare in good agreement, indicating a corner-to-corner noise propagation delay of18ns. This delayed response to activity changes in distant cores is key tounderstanding the perfect storm pattern when the noise from earlier events allarrive at a victim core.

The z13 system’s impedance profiles associated with 3 different core activitypatterns are shown in Fig. 26.2.4. The impedance profile is typically determinedwith all cores having the identical current pattern, but, to include the across-chipnoise propagation in Fig. 26.2.3, it is important to include the HF modes wherethe core activity patterns are not synchronized. An LF impedance peak determinedby the backplane VRM and backplane design is not shown. The mid-frequency(MF) peak at 3.8MHz is determined primarily by the on-chip decouplingcapacitance, with the effective inductance to the nearest off-chip decouplingcapacitors. Two of the many HF impedance peaks are also shown, where half ofthe chip is 180 degrees out of phase with the other half.

Due to the complexity of the HF modes, Monte Carlo simulation methods wereused to search for the worst-case relative timing of noise events in different coresusing the multi-node model, as shown in Fig. 26.2.5 for the step current response.The activity step of every core is randomly distributed from 0ns to 200ns, andthe droop amplitude is measured relative to the steady-state voltage when allcores are fully active. The worst perfect-storm alignment results in a noise that is1.2× larger in magnitude and 1.9× steeper in slope, compared to all coresactivated simultaneously. Out of 400,000 samples, 2.1% of the cases result in asimulated droop that is larger than the case of perfect alignment among all cores.

The on-chip trace arrays in each z13 core were used to record the lowest CPMvalue of 1ns period (5 cycles) for characterization purposes. Such CPM tracesare useful to illustrate the complex core-to-core interactions that can produce aperfect storm. Figure 26.2.6 shows two of many traces collected when running aworkload designed to cause a large increase in core activity on 7 of the 8 coresalmost simultaneously, with the noise mitigation circuitry disabled. In most cases,as illustrated on the left of Fig. 26.2.6, small timing misalignments (less than100ns) result in relatively small droops. Rare events, where the activity steps inall cores align within 20ns of each other in a worst-case pattern, produced aperfect storm such as that seen in core 0 on the right of Fig. 26.2.6, showing theimportance of accurate modeling including HF modes.

In summary, to understand and characterize all possible power supply noisescenarios, an accurate model of the voltage regulator, backplane, module, andthe on-chip droop mitigation circuits is necessary. Laboratory measurements ofthe noise waveforms using oscilloscopes and the on-chip CPM sensors withcustom workloads are used to calibrate the model for complex HF noisephenomena. These phenomena result in perfect-storm events that produce raredroops that are 1.2× deeper and 1.9× faster than the synchronized-core stepresponse, in the absence of noise mitigation. The reliable z13 per-core powersupply noise detection and mitigation scheme reduces the required noise margin,which was exploited to give a 2% increase in chip performance. Analyzing datafrom systems running real workloads found that only 0.2% of the chipsexperienced throttling, and the maximum average throttling within any 16msperiod was less than 0.02%, proving this noise mitigation resulted in negligibleperformance loss.

Acknowledgements:The authors would like to thank SRDC and the VLSI department at IBM fortechnical support.

References:[1] T. Webel, et al., “Robust Power Management in the IBM z13,” IBM J. ofResearch and Development, vol. 59, no. 4/5, pp. 16:1-16:12, 2016.[2] K. A. Bowman, S. Raina, et al., "A 16nm Auto-Calibrating Dynamically AdaptiveClock Distribution for Maximizing Supply-Voltage-Droop Tolerance Across a WideOperating Range," ISSCC, pp. 152-1533, 2015. [3] J. Warnock, B. Curran, et al., "22nm Next-Generation IBM System zMicroprocessor," ISSCC, pp. 70-31, 2015.[4] A. Drake, R.Senger, et al., “A Distributed Critical-Path Timing Monitor for a65 nm High-Performance Microprocessor,” ISSCC, pp. 398-399, 2007.

978-1-5090-3758-2/17/$31.00 ©2017 IEEE



Figure 26.2.1: z13 chip view showing locations of the power supply noisedetection and mitigation circuits within a core.

Figure 26.2.2: Circuit schematic of the power delivery network and multi-nodechip model.

Figure 26.2.3: Core 0 voltage supply measured vs. simulated waveforms in thepresence of local and/or remote power supply noise.

Figure 26.2.5: Power supply noise Monte Carlo simulations, where each coregoes from low to high activity at a random time. The results are normalizedwith respect to the noise when all cores are in perfect alignment.

Figure 26.2.6: Measured CPM value traces from randomly aligned core events(left) and a perfect-storm-like alignment (right).

Figure 26.2.4: Impedance peaks and associated current patterns with anexample waveform exciting all modes.

26



26.3 Reconfigurable Clock Networks for Random Skew Mitigation from Subthreshold to Nominal Voltage

Longyang Lin, Saurabh Jain, Massimo Alioto

National University of Singapore, Singapore, Singapore

Clock network optimization is substantially affected by the operating voltage VDD,as the clock skew is dominated by different mechanisms and has a differentbalance between wire and repeater delay at different VDD (Fig. 26.3.1). At above-threshold VDD, deep clock networks with several levels of repeaters are needed tocontrol the clock slope in wires [1]. At sub-threshold VDD, shallow networks areneeded as the gate delay dominates, and the random clock skew approximatelygrows proportionally to the square root of the number of levels [2]. At such VDD,the skew of clock networks designed at above-threshold VDD is much larger thanat the nominal voltage VDD,nom, thus assuring a reasonable skew budget across awide range of VDD is challenging [1-3]. To date, clock skew at low VDD has beenmitigated via moderately deep networks with long-channel LVT buffers [1], designmethodologies [2], [4], and voltage-adaptive delay insertion across different clockdomains [3]. However, no voltage adaption has been performed within a clockdomain.

Reconfigurable clock networks are introduced to optimize the number of clocklevels to VDD, which is dynamically scaled from nominal down to sub-threshold.As in Fig. 26.3.2, reconfigurable clock repeaters are bypassed to progressivelyreduce the number of clock levels within the same clock domain, when VDD isscaled down. Among the several available clock configurations depending on thebypassed levels, an optimal configuration is defined at each VDD. Under dynamicvoltage-frequency scaling, clock reconfiguration is set by an augmented look-uptable that also includes the usual frequency setting for each VDD (Fig. 26.3.2).Consequently, skew is mitigated across a wide voltage range without beingsacrificed at any specific VDD, as opposed to conventional fixed clock networksdesigned at a given VDD. Clock reconfiguration improves the functional yield andreduces the minimum allowed voltage Vmin, thanks to the mitigation of variation-induced hold (setup) time violations. It also facilitates timing closure at designtime, as skew at different VDD settings no longer imposes conflicting requirementson the clock network.

Each reconfigurable repeater operates in either normal or bypass mode,depending on the bypass signal (Fig. 26.3.3). When bypass=0, the circuit operatesas a conventional CMOS repeater (M1-M4 in Fig. 26.3.3). When bypass=1, therepeater circuitry is disabled via power gating (M5-M8) and bypassed by the passtransistor M9. Since repeaters are bypassed only at low VDD (see Fig. 26.3.2), thegate voltage of M9 in bypass mode is set at VDD,nom=1.1V instead of VDD, to avoidits threshold voltage (VTH) loss (indeed, VDD,nom>VDD+VTH at practical low VDD valuesat which repeaters are bypassed). This also increases the intrinsic strength of thepass transistor and thus saves area for a targeted strength, at no energy penalty(the gate voltage of M9 is kept constant in any mode). During place and route,bypassable repeaters are treated as standard cells, replacing conventional clockcells. From Fig. 26.3.3, at very low VDD (e.g. Vmin) all intermediate repeaters arebypassed, and the very first one thus needs to drive the entire clock network. Thisrepeater is provided with adequate strength by boosting the NMOS and PMOSgate voltage by 300mV, at negligible energy penalty (1-3% of the FFT core energy,depending on VDD).

A testchip in 40nm was designed to implement a 256-point 16b complex FFT corewith architecture in [5], with conventional and reconfigurable clock. As in Fig.26.3.4, the testchip includes 3,456 replicas of the critical 8-level clock tree pathof the conventional and reconfigurable FFT. Each clock path replica is associatedwith the fastest path in the FFT core (i.e., with lowest hold margin). Such clockpath replicas are split into three banks, and time-to-digital conversion forstatistical skew characterization, which has been implemented by cascading aprogrammable delay line (for coarse-grain measurements) and a Vernier time-to-digital converter (for fine-grain measurements). The FFT core in the testchip alsoincludes a skew injection mechanism to mimic different variations to fairlycompare different FFT cores at same hold margin, in spite of different core-specificvariations. The injected delay covers the entire range of variations expected at allprocess corners and additional random variations. Hold-time compensation basedon hold-fix buffers with adjustable delay were also inserted to measure the hold-time margin improvement gained with skew reduction, based on the observation

of timing fault occurrence. Fig. 26.3.4 shows standard deviation of the clock skewthat was measured in the clock tree replicas versus VDD. As expected, theconfiguration with minimum skew at low VDD has all intermediate levels beingbypassed and first/last in normal mode (named 10000001, where each bitcorresponds to the value of the bypass signal). At larger VDD, the optimalconfiguration progressively has fewer bypassed levels (e.g., 10000011), and theconventional clock network with no bypassed level is optimal at nominal VDD.Compared to the latter, repeater bypassing brings benefits below 0.5V (near- andsub-threshold region), with up to 3.3× skew reduction.

Compared to a conventional clock network, the reconfigurable clock network (Fig.26.3.5) in 11…1 configuration has the same skew at above-threshold voltages,as expected. When VDD is decreased below 0.5V, the reconfigurable network takesadvantage of the optimal choice of bypassed configurations to reduce the clockskew by up to 3.3×. This improvement is obtained by adopting progressivelyshallower configurations. The allowed VDD range for each clock networkconfiguration is upper bounded due to the impact of wire delay (i.e., the clockslope is substantially degraded due to excessively long uninterrupted wires), andlower bounded due to the skew introduced by repeaters (i.e., the clock skew istoo large due to the excessive number of non-bypassed repeaters).

As a design example, the above FFT core was used to measure the benefits of theskew reduction in terms of hold margin and Vmin. The core is designed to run at350MHz at 1.1V, and energy is reduced by 1.4× when VDD is scaled down to 0.35V.In Fig. 26.3.6, the hold margin is measured by tuning the hold-fix buffer delay atthe point of first hold failure, and is clearly determined by the core-specificvariations. To fairly compare the hold margin of the conventional andreconfigurable clock under different configurations, controlled skew was injectedto set the first point of failure at VDD=0.45V in the conventional and reconfigurableclock with the equivalent 11…1 configuration. This corresponds to a hold margindegradation by 3.4 standard deviations around the nominal margin. Fig. 26.3.6shows that the optimal configuration in reconfigurable clock networks improvesthe hold margin by up to 2.5 standard deviations σ of the clock skew at VDD<0.5V,compared to a conventional clock (σ was evaluated through the aboveexperimental statistical skew characterization). Fig. 26.3.6 also shows the voltagerange in which each configuration ensures correct operation (i.e., positive holdmargin), and is expectedly shifted to lower voltages when more levels arebypassed. As a result of the skew reduction, the reconfigurable clock underconfiguration 100…01 reduces Vmin down to 0.34V, as compared to 0.45V whenthe conventional clock is used. From Fig. 26.3.6, the 110mV Vmin reduction comesat the cost of 1.8% in area, due to the larger area of reconfigurable buffers.

In summary, the reconfigurable clock network can mitigate clock skew across awide range of voltages, adapting to the voltage-specific combination of wire andrepeater delay. The dynamic reconfiguration at different voltages can be performedby using a tuning LUT-based scheme that is similar to mainstream voltage-frequency scaling approaches. The resulting skew mitigation simplifies the timingclosure from sub-threshold to nominal voltage, increases robustness againstvariations, and ultimately reduces Vmin.

Acknowledgements: The authors thank TSMC for chip fabrication and acknowledge the support by theSingaporean Ministry of Education grant MOE2014-T2-2-158.

References: [1] J. Myers, A. Savanth, et al., “An 80nW Retention 11.7pJ/Cycle ActiveSubthreshold ARM Cortex-M0+ Subsystem in 65nm CMOS for WSNApplications,” ISSCC, pp. 144-146, 2015.[2] M. Seok, D. Blaauw, D. Sylvester, “Robust Clock Network Design Methodologyfor Ultra-Low Voltage Operations,” IEEE JETCAS, vol. 1, no. 2, pp. 120-130, 2011.[3] S. Jain, S. Khare, et al., “A 280mV-to-1.2V Wide-Operating-Range IA-32Processor in 32nm CMOS,” ISSCC, pp. 66-67, 2012.[4] X. Zhao, J. Tolbert, et al., “Variation-Aware Clock Network Design Methodologyfor Ultralow Voltage Circuits,” IEEE Trans. on CAD, vol. 31, no. 8, pp. 1222-1234,2012.[5] M. Seok, D. Jeon, et al., “A 0.27V 30MHz 17.7nJ/transform 1024-pt ComplexFFT Core with Super-Pipelining,” ISSCC, pp. 342-344, 2011.

978-1-5090-3758-2/17/$31.00 ©2017 IEEE



Figure 26.3.1: Clock networks optimized for above-threshold voltages andnear/sub-threshold voltages.

Figure 26.3.2: Reconfigurable clock network with repeaters operating innormal/bypass mode (top) and clock tree configuration at different VDD (bottom).

Figure 26.3.3: Schematic of reconfigurable clock repeater (top) and details ongate boosting in the very first repeater (bottom).

Figure 26.3.5: Measured improvement of skew standard deviation inreconfigurable clock network with configuration optimized at each VDD.

Figure 26.3.6: FFT core as case study: reconfigurable clock improves holdmargin (top left), and Vmin (bottom left). Energy and clock frequency normalizedto 1.1V vs. VDD is also shown (top-right).

Figure 26.3.4: Testchip architecture with FFT cores and testing harness (top),including clock path replicas and resulting skew statistical characterization(bottom).

26

• 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

ISSCC 2017 PAPER CONTINUATIONS

Figure 26.3.7: Die micrograph and chip specifications.



26.4 A 0.4-to-1V 1MHz-to-2GHz Switched-Capacitor Adiabatic Clock Driver Achieving 55.6% Clock Power Reduction

Loai G. Salem, Patrick P. Mercier

University of California, San Diego, CA

Clock distribution in modern SoCs consumes a significant fraction of total chippower. To reduce clock distribution power, resonant clocking schemes, where aninductive reactance is used to cancel the capacitive reactance of global clocknetworks at a given resonance frequency, fo, have been proposed. Conventionally,such schemes are only suitable at high multi-GHz frequencies in order to be ableto place the employed inductors on chip [1, 2]. Since many modern energy-efficient SoC designs optimize for clock frequencies <2GHz, with DVFS techniquesbringing the core clock frequencies and the supply voltages VDD to the MHz andnear-threshold regimes, respectively, there is a need to develop low-power clockdistribution schemes that can work across increasingly wider operating ranges.Recent work in quasi-continuous resonant clocking has proposed intermittentcancelation of global clock-tree capacitance during edge transitions, however,such techniques require large off-chip inductors and are limited to 0.98MHz [3]and 150MHz [4], respectively, owing to the need to operate well below resonance(i.e., < fo/10). Thus, while prior-art has shown power reduction for targetedapplications, they all require large on- or off-chip magnetics, and do not meet theMHz-to-GHz frequency-range needs of modern DVFS-enabled SoCs. To addressthese problems, this paper introduces a fully integrated adiabatic clocking schemethat efficiently synthesizes n-step clock waveforms from 1MHz to 2GHz via aswitched-capacitor DC-AC multi-level inverter topology, theoretically reducingpower by 1/n without using any magnetic component.

Figure 26.4.1 illustrates prior resonant clocking techniques, including intermittentresonant clocking (IRC) [3] and quasi-resonant clocking (QRC) [4] schemes.Conventional approaches utilize an array of on-chip inductors along with a per-inductor decoupling capacitor (>10× CCLK). Unfortunately, CLK power increases~±20% away from resonance (fo), thereby limiting DVFS opportunities. On theother hand, IRC and QRC techniques can enable DVFS up to ~ fo/10 by employinglarge off-chip inductors (Fig. 26.4.1, right). However, such approaches can havesevere ringing if accurate pulse-width timing is not ensured, thereby requiringpower-expensive timing logic overhead (e.g., DLLs). Furthermore, special gatedrivers or charge pumps are required to either boost the gate drive voltage of thefooter NMOS in IRC techniques, or provide a –VDD/2 gate drive for QRC footertransistor Mf, to ensure that it turns off before its drain voltage goes to –VDD/2(which is a further device reliability issue).

In contrast, clock power is reduced in the proposed approach through an adiabaticstepwise charging technique implemented using a 4-level switched-capacitor DC-AC inverter topology, shown in Fig. 26.4.1. In this scheme, the CLK capacitance(CCLK) is step-wise charged by sequentially turning on switches s1, s2, s3, ands4, which creates a 4-level voltage staircase whose levels are set by GND, self-balanced capacitors C1 and C2, and VDD. Afterwards, CLK is brought down to GNDin the reverse order. Theoretically, 4-level adiabatic charging reduces CLK powerby 3×. By repeating the same operation periodically at fCLK, a KVL-constrainedmulti-phase switched network is established which inherently enforces VL2=VDD/3and VL3=2/3VDD without any explicit DC-DC converter.

The proposed reconfigurable 4-level inverter, shown in Fig. 26.4.2, is composedof two standard CMOS inverters whose outputs are tied together: an outer inverterpowered between VDD and GND, and an inner inverter with a floating supply andground at 2/3VDD and 1/3VDD, respectively. The outer inverter is controlled bysignals P and N, periodically connecting its output to VDD or GND, while the innerinverter is controlled by signals Pi and Ni, periodically connecting its output to2/3VDD or VDD/3. These control signals are generated by passing the input clockthrough a tunable chain of inverters, producing three signals, A, B, and C, withequal delay times, Δt, and passing these signals through a custom House-of-Cards (HoC) timing gate (whose operation is logically represented by 8combinational gates in Fig. 26.4.2). To enable adiabatic charging, the switch’sRon/Wsw should be set such that the RCCLK time constant, τ, is less than Δt/1.4.The 4-level inverter requires a total of ~6.7CCLK of self-balancing capacitance,which is 1.8× lower than the capacitance required in conventional resonantschemes. The 4-level inverter can be reconfigured into a 3-level inverter by

overlapping pulses Ni, Pi, as shown in Fig. 26.4.2, coarsely decreasing the 10-90% rise/fall time from 0.8×3Δt to 0.8×2Δt. Fine rise/fall time configuration canbe adjusted via the tunable delay chain. The 4-level inverter can also bereconfigured into a standard 2-level CMOS inverter by disabling the inner inverter.

The actual implementation of the custom HoC timing gate, optimized to generateN, Ni, Pi, and P with non-overlapping properties in minimal area and power, isshown in Fig. 26.4.3. Non-overlapping pulses are inherently generated in the HoCgate since, when the leaves of the HoC tree turn on, the output pulses must waituntil the common root in the tree is charged or discharged. For instance, supposethat ABC=110, thereby Nι = 0 Then, if C transitions from 0→1, Cp1 and Cp2 (Cp3

and Cp4) are already discharged (charged) when the C / C edge arrives, and henceall controlling pulses (N, Ni, Pi, P) are synchronized without overlap. The HoCgate can be folded to support 4-, 3-, or 2-level timing signals via configurationbits R1 and R0.

The overall architecture of the adiabatic clocking prototype is shown in Fig. 26.4.3.A 4b programmable-strength reconfigurable 4-level inverter is implemented,where all 16 slices share the same VL2 and VL3 nodes, each connected to 50pF ofon-chip thick-oxide capacitance. An on-chip current starved oscillator lockedthrough an off-chip PLL is employed as the clock source. To ensure sufficientrise/fall time for adiabatic operation up to 2GHz, phases A, B, C are provided fromthe first 3 stages in the 5-stage ring oscillator such that the adiabatic CLK 10-90% (20-80%) rise/fall time is << 24% (<< 18%) of the CLK period. The 4-levelinverter drives a 32× pipelined array of 64b MACs. Capacitance from digital logic,CLK wiring, and drain parasitics of the driver totals CCLK≈15pF (2:1:1).

Fabricated in 9M 45nm SOI, the designed global clock distribution, spanningALOAD=550×550μm2, takes the form of a tree-driven grid. The clock tree and grid(as well as the power distribution) occupy the top 2 UT metals M9 and M8,respectively. Each line of the 5-level H-tree is split into multiple fingers as shownin Fig. 26.4.3 to reduce inductance and enable rigid operation up to 10GHz. Theadiabatic driver, including self-balancing capacitors, occupies only 0.0187mm2

(<6.2% of ALOAD). To quantify the improvement over conventional clocking, thedriver is configured into the 2-level mode with reduced drive strength for identicalrise/fall time to the 3/4-level modes, while multi-level overhead circuits are off.

Measurement results at 1V in Fig. 26.4.4 indicate 4-level (3-level) clock powersavings of at least 42% (28.4%) from 10MHz-2GHz, while successfully operatinga digital load with 55.6% (45.5%) peak savings at 10MHz, where adiabaticclocking overhead is minimal. At 0.4V near-threshold operation, 4-level (3-level)clocking successfully achieves a measured power savings of at least 34.4%(22.5%) from 1MHz-267MHz. Fig. 26.4.4 also shows the measured CLK driverenergy under DVFS operation between 0.4-1V, showing above 39.4% savingsacross the entire DVFS range, with 46.5% peak savings. Fig. 26.4.5 shows themeasured power savings across all possible voltages and frequencies, indicatinga 41.8% average savings across a 2000× dynamic frequency range. The measuredtransient waveforms of the 4-level operation at 10MHz from a 1V supply areshown in Fig. 26.4.5, via both a common-source PMOS analog buffer (open-draindriver) biased by 25Ω (50Ω on PCB and 50Ω input of a sampling scope) for0.75V/V gain, and a cascaded inverter chain. Fig. 26.4.6 compares the proposeddesign to the state-of-the-art clocking schemes, demonstrating the widestadiabatic frequency and supply voltage dynamic ranges with the highest clockpower savings, all with minimal overhead. A die photo is shown in Fig. 26.4.7.

References:[1] S. Chan, P. Restle, et al., “A 4.6GHz Resonant Global Clock DistributionNetwork,” ISSCC, pp. 342-343, 2004. [2] P. Restle, D. Shan, et al., “Wide-Frequency-Range Resonant Clock with On-The-Fly Mode Changing for the POWER8TM Microprocessor,” ISSCC, pp. 100-101,2014. [3] H. Fuketa, M. Nomura, et al., “Intermittent Resonant Clocking Enabling PowerReduction at Any Clock Frequency for 0.37V 980khz Near-Threshold LogicCircuits,” ISSCC, pp. 436-437, 2013. [4] F. Rahman, V. Sathe, “Voltage-Scalable Frequency-Independent Quasi-Resonant Clocking Implementation of a 0.7-to-1.2V DVFS System,” ISSCC, pp.334-335, 2016.

978-1-5090-3758-2/17/$31.00 ©2017 IEEE



Figure 26.4.1: Inductive resonant clocking techniques (top); proposed switched-capacitor multi-level adiabatic clocking technique (bottom).

Figure 26.4.2: Circuit schematics and timing diagrams of the 4-level inverter,showing how it can be reconfigured into 3- and 2-level modes.

Figure 26.4.3: Schematic of the custom HoC timing gate (top); architecture ofthe implemented test chip (bottom).

Figure 26.4.5: Measured 4-level clocking savings across 1MHz-2GHz and 0.4-1V (top); measured waveforms of the 4-level clock (bottom).

Figure 26.4.6: Comparison of the proposed adiabatic clocking scheme vs.resonant clocking implementations.

Figure 26.4.4: Measured clock power improvement of 4- and 3-level clockingcompared to conventional clocking at 1V and 0.4V across frequency (left);measured CLK energy-per-bit improvement of the 4-level inverter across supplyvoltages (right).

26

• 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

ISSCC 2017 PAPER CONTINUATIONS

Figure 26.4.7: Micrograph of the fabricated chip.



26.5 Adaptive Clocking in the POWER9TM Processor for Voltage Droop Protection

Michael S. Floyd1, Phillip J. Restle2, Michael A. Sperling3, Pawel Owczarczyk3, Eric J. Fluhr1, Joshua Friedrich1, Paul Muench3, Timothy Diemoz3, Pierce Chuang2, Christos Vezyrtzis2

1IBM, Austin, TX2IBM, Yorktown Heights, NY3IBM, Poughkeepsie, NY

Increasing transistor counts in modern processors can create instantaneouschanges in current, driving nanosecond-speed supply voltage (VDD) droops thatrequire extra guardband for correct product operation. The POWER9 processoruses an adaptive clock strategy to reduce timing margin needed during powersupply droop events by embedding analog voltage-droop monitors (VDMs) thatdirect a digital phase-locked loop (DPLL) to immediately reduce clock frequencyin response.

The POWER9 chip contains 24 cores organized into 6 structures called Quads,each containing 5 power domains, consisting of 4 cores and a cache region,whose clock grids all share a DPLL. As seen in Fig. 26.5.1, a voltage sense pointfrom each domain is sampled (solid lines) by each of the 5 VDMs, whose outputs(dashed lines) then combine to feed the DPLL. The Quad uses an asynchronousinterface to communicate to the nest (a.k.a. uncore), allowing per-Quad dynamicvoltage and frequency scaling (DVFS). Other adaptive clocking schemes onlysupport fixed clock ratio reduction [1,2], have high response latency, use canarycircuits with significant overhead [3], or rely on complex full error-detection withretry [4]. These schemes cannot achieve meaningful guardband reduction withoutover-reaction or over-detection causing noticeable performance loss. The Quad’sasynchronous design and fractional-N DPLL allow frequency to be adjustedquickly at smaller granularity to maintain timing margin. A previous DLL mitigationscheme [5] similarly reduced frequency by less than a factor of 2, but with asignificant increase in jitter, requiring a lowered frequency to maintain timingguardband. This technique has negligible area impact, since the VDMs, controland routing overhead, and added DPLL circuitry together occupy just 0.12% ofthe 65mm2 Quad area.

VDMs were chosen over other sensors [6] that are difficult to calibrate over thewide range of DVFS required by the product. In contrast, the VDM needs only thevoltage set point required at each frequency target, a relationship already storedin a DVFS table for each chip during manufacturing test. The VDM circuit is105×123 microns in 14nm silicon and is derived from the sense circuit used byour internal voltage regulation macro. The VDM detects over- and under-voltageconditions by comparing the VDD supply grid to the desired voltage set point.Using an 8b voltage identification (VID) code corresponding to the targetfrequency, the VDM derives an analog voltage target from a fixed reference, asdepicted in Fig. 26.5.2(a). An offset generator uses this target to create multiplethreshold compare voltages, each tuned in 8mV steps, which then feedcomparators against the differentially sensed VDD. These produce a negative-activethermometer code, where a zero signifies lack of voltage margin, to indicate whenVDD has crossed each voltage detect threshold as seen in Fig. 26.5.2(b). A codeof “1111” is an Overvolt condition, and “1110” means that VDD is within the“Nominal” (acceptable) range for the target frequency when no droop is present.The codes “1100”, “1000”, and “0000” indicate Small, Large, and Extreme droops,respectively. The VDM filters out high frequency noise over a 1ns timescale andresponds with a droop indication within 1.5ns, using digital comparators thatoutput a new code every cycle. The VDM is designed within ±1% accuracy byusing auto-zero techniques for the analog and comparator circuit blocks.

The output codes from the 5 VDMs are combined with NAND-invert buffers androuted unlatched using fastest available wire to the DPLL, which accounts formetastability and determines if a response is needed. In the absence of droopprotection, the DPLL calculates reduction amounts based upon the number ofactive devices in its digitally controlled oscillator (DCO). When a droop codeactivates, the DPLL jumps to a lower frequency by instantly turning off apercentage of active devices in its DCO (fast path shown in Fig. 26.5.3(a)). EachSmall or Large jump amount is selectable in 3.125% increments “M”, wheredroop_DCO_setting = nominal_DCO_setting * (1 - M/32). The DPLL is designedto jump to within ±(M*0.25)% of the recovery target across the supported rangeof frequencies and operating conditions. The frequency filter setting (control path)

similarly gets its multiplier reduced by the same amount (M/32) to moreaccurately lock to the new target until the droop event ends. The DPLL alsosupports intermediate jumps between Small and Large droop indications.Regardless of the jump amount, the total latency from threshold crossing untilthe clock is slowed at the circuits is 6ns at typical core frequencies (Fig. 26.5.3(b)),which is sufficient to recover timing margin before the droop can induce a circuitfailure. Two options are available to quickly recover frequency as the droopsubsides: either gradually add devices back to the DCO every reference clock cycleor instantly add the devices. In both cases, fewer devices are added back thanthe original amount to prevent overshoot. Then, the nominal frequency filtermultiplier is restored, allowing normal DPLL dynamics to add the remainingdevices back to the DCO as it naturally slews and locks to the original target. Thesedroop mitigation circuits account for 15% of the 0.0333mm2 DPLL area.

For initial hardware test, a slow core-clock frequency of 1GHz was used to allowcycle-accurate clock waveform measurements, and an artificial voltage droop wasinduced to a single core. Figure 26.5.4 shows a 6% clock frequency reduction 5cycles after the VDM detects the droop. This design also leverages the multipleVDM detection thresholds and DPLL actions to optimize droop response. Whena VDM within a Quad triggers a Small droop condition, its DPLL will reduce clockfrequency by a small amount to start mitigation without causing a larger impactunless needed. For that case, when the VDM measures a Large droop, the DPLLfurther reduces its output by another multiple of 1/32. While the adaptive clockfunction was verified above 4GHz, cycle-accurate frequency measurements of thecore mesh clock in our test setup are limited to 2.0GHz. Figure 26.5.5 shows bothsmall and large frequency jumps as the corresponding VDM thresholds arecrossed in response to a real voltage droop event, followed by a gradual recoveryback to the target frequency. While our models and initial testing indicate that 3%and 6% frequency responses should be sufficient, these traces were collectedusing 6% and 12% due to measurement jitter. Analysis of typical POWER9 chipusage shows droop events to require a few microseconds of this modestfrequency reduction at most every millisecond, so performance loss is less than0.01%.

Figure 26.5.6 contains a modeled voltage droop for a worst-case activity stepsynchronized across all cores, and the modeled adaptive clock response. Theinstantaneous demand for current in Fig. 26.5.6(c) triggers the two-step frequencyreduction in response, as seen in Fig. 26.5.6(b). The second-level 6.25%frequency response reduces the power supply noise margin, in Fig. 26.5.5(d),required for the largest possible droops by 50% (-20ps for the dotted line versus0ps, as compared to +20ps at final steady state). While the reduction amount isprimarily chosen to maintain timing margin during the droop event, the resultantvoltage also lowers active power by the same amount, reducing droop magnitudeby 17% seen by the dotted line in Fig. 26.5.6(a).

In summary, this adaptive clocking droop mitigation technique, demonstrated in14nm silicon, has been shown to achieve a 6ns response time and reduce thenoise margin needed by at least 50%. This reduction in required circuit guardbandtranslates into either 8% less power, or 3.5% additional performance within thesame power envelope, for typical operation of the POWER9 processor. This isdue to the VDM’s capability to detect droops with 8mV resolution in under 2.5nscombined with the DPLL's ability to accurately drop frequency at 1/32 granularityin 1.8ns without increased jitter. Unlike previous techniques, this is achieved withnegligible silicon area, minimal performance loss during the droop event, andwithout complex recovery architecture or intrusive detection circuitry.

References: [1] C. Takahashi, S. Shibahara, et al., “A 16nm FinFET Heterogeneous Nona-coreSoC Complying with ISO26262 ASIL-B: Achieving 10-7 Random HardwareFailures per Hour reliability,” ISSCC, pp. 80-81, 2016.[2] K. Bowman, C. Tokunaga, et al., “A 22 nm All-Digital Dynamically AdaptiveClock Distribution for Supply Voltage Droop Tolerance”, JSSC, vol. 48, no. 4, pp.907-916, 2013.[3] D. Ernst, N. Kim, et al., ”Razor: A Low-Power Pipeline Based on Circuit-LevelTiming Speculation," IEEE Int’l. Symp. on Microarchitecture, 2003.[4] D. Blaauw, S. Kalaiselvan, et al., “Razor II: In Situ Error Detection andCorrection for PVR and SER Tolerance," ISSCC, pp. 400-401, 2008.[5] A. Grenat, S. Pant, et al., “Adaptive Clocking System for Improved PowerEfficiency in a 28nm x86-64 Microprocessor,” ISSCC, pp. 106-107, 2014.[6] A. Drake, R. Senger, et al., “A Distributed Critical-Path Timing Monitor for a65nm High-Performance Microprocessor," ISSCC, pp. 398-399, 2008.

978-1-5090-3758-2/17/$31.00 ©2017 IEEE



Figure 26.5.1: Die photo of the POWER9 Quad structure showing 5 independentpower regions and its associated adaptive clocking network. Logicalconnectivity is shown between voltage sense points (circles), VDMs (triangles),and DPLL.

Figure 26.5.2: (a) Block diagram of the voltage-droop monitor (VDM) circuit;(b) VDM digital output response to sensed input voltage (VDD).

Figure 26.5.3: (a) Block diagram of the DPLL control circuits with droopprotection paths highlighted; (b) response latency breakout of each element ofthe adaptive droop protection mechanism.

Figure 26.5.5: Scope trace of 2.0GHz mesh clock, showing two-level frequencyjumps followed by a gradual recovery, in response to VDM detection. Note thatthe ringing on the VDD trace after the initial droop is being amplified by the labprobe and scope setup. The gradual recovery from large droop can be seenbetween 10 and 50ns, while the small droop recovery occurs between 155 and180ns. The DPLL then slews back to full target frequency by 220ns.

Figure 26.5.6: Full system noise model of maximum activity-step droop eventon all cores, with adaptive clocking enabled and disabled.

Figure 26.5.4: Scope trace of 1GHz mesh clock, showing 6% frequencyreduction 4 cycles after VDM detects a small droop.

26

session 26 overview: processor-power management and …

Documents