design - uc davis ecevojin/classes/eec280/protected/islped04/... · 200 300 400 500 600 700 0 200...

74
1 Managing Standby and Active Mode Leakage Power in Deep Sub-micron Design Lawrence T. Clark Dept. of Electrical Engineering Arizona State University Rakesh Patel Intel Corp. Timothy S. Beatty Intel Corp.

Upload: phunghanh

Post on 30-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

1

Managing Standby and Active Mode Leakage Power in Deep Sub-micron

Design

Lawrence T. ClarkDept. of Electrical Engineering

Arizona State University

Rakesh PatelIntel Corp.

Timothy S. BeattyIntel Corp.

Page 2: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

2

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 3: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

3

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 4: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

4

Power in Hand-held Electronics• Battery capacity is limited

– Batteries are heavy• Capacity is proportional to weight

• Problematic for hand-held devices– Cell phone batteries typically 600 to 1200 mA hrs

• Power budget shared between analog, digital, and transmit– Digital IC budget decreasing while performance increases

• Two scenarios– Active operation—100’s mW

• Limits talk time (typically few hrs)– Standby—100 µW

• Limits time waiting for calls (typically 100’s hrs)– There is active power in standby mode

• Each contact with the cell is an active transmit/receive operation—occurs on the order of once every 1-2 seconds

Page 5: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

5

Voltage scaling• Voltages must scale as transistors scale to avoid excessively

high fields—this in turn requires Vt scaling�Higher leakage• Total IC power

• VDD2 active power dependence makes supply scaling the

most effective lever for low power design– Makes low power and high performance design the same

• Higher absolute (maximum VDD performance) affords meeting lower application demand for performance at lower voltages

• High performance equals low power if done efficiently– Assumes that operating voltage is not a constraint

• Frequency proportional to voltage so dropping voltage derives roughly VDD

3 change in power– Assumes that lower frequency is acceptable

• Also greatly affects leakage components on advanced processes

PTOTAL = αCVDD2F + ILEAKVDD

Page 6: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

6

0.1

100

200

300

400

500

600

700

0 200 400 600 800

Frequency (MHz)

Pow

er (m

W)

Vt=390mV

Vt=500mV

Voltage scaling: Effect of Vt• Low Vt helps active power if VDD scaled

Page 7: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

7

Voltage Scaling: Effect of Vt

0.1

1

10

100

0 50 100 150 200

Frequency (MHz)

Pow

er (m

W)

Vt=390mVVt=500mV

• Low Vt problematic for standby power

Page 8: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

8

Deep Sub-micron MOSFET Leakage• Four Primary Components

– Drain Source Leakage (Ioff)– Gate Leakage– Gate induced drain leakage (GIDL)– Junction band to band tunneling currents

Vdd

Gate

Drain

Bulk

Source

Igate

Ioff

IGIDL, Izener

POLY

OXIDE

Page 9: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

9

Circuit Methods for Leakage Control• Body bias techniques

– Reverse body bias (RBB) [1-2]– Forward body bias (FBB) [3]– These are the least invasive to the design, small area cost

• MTCMOS techniques [4]– State retentive

• Balloon latches [5]• Multi-Vt design [6-7]

– Non-state retentive• Thick gate storage

– Alleviates gate leakage [8-10]• Essentially store state in a generation N-x transistor

– Highest area cost, most effective, potentially difficult design

Page 10: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

10

Multi-threshold CMOS (MTCMOS)

• High Vt transistors gate power to low Vt circuits [4]– Leakage dominated by

the high Vt gating transistors

• Not inherently state retentive– Power cost of moving

state off chip is a penalty paid on entry and exit to low power state

VSSSUPpad

VDD pad

VSS

Logiccore

MGate

VSSSUP

Page 11: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

11

The Need for State Retention• Integrated circuits have increasing storage capacity

– The storage constitutes the “state” of the machine• Many commercially shipping low power standby

modes have not been state retentive– Save in external memory

• Incurs power penalty for the IO to save state• Still requires low power storage—somewhere

• Example: SA-1100 StrongARM microprocessor [11]– Write back cache state requires 16 µs at 66 MHz for 8 kB– 3.3V IO pins loaded with 35pF each gives 12.5 µJ

• This creates a power floor of 1.25 mW if used 100 times/second– Still must account for the external storage power– Standby power and leakage of IO ring and real-time clock

specified to be 165 µW

Page 12: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

12

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 13: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

13

Reverse Body Bias• State retentive• Design only

– Can be used on any process– Allows use of a leakier, faster process at same ISB

• Increase VSB during standby– Raises Vt due to “body effect”

• Electrical control allows this only during standby– Use VSB = 0 during active operation

• Done first commercially on 0.25 µm microprocessor [1]– VDD = 1.8 V– N-well driven to IO voltage (3.3 V)– Charge pump drives P type substrate to –2 V– Fine granularity power supply grids

• 1000’s of local supply switches

Page 14: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

14

1

10

100

1000

10000

-1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5

Vds (V)

Ids

(pA

/ µµ µµm

)

Vbs = 0 V

Vbs = 0.2 V

Vbs = 0.4 VVbs = 0.6 V

Vbs = 0.8 V

Vbs = 1, 2, 3, and 4 V

RBB and Power Supply Collapse• 0.18 µm PMOS transistor measurements

Page 15: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

15

Drowsy: RBB and Supply Collapse• Drain Source Leakage (Ioff)

– Decreases at linear or better rate with VDD collapse• Depends on process DIBL

– Decreases with a square root VSB dependency• Gate Leakage

– Decreases faster than VDG2 (can give V4 power impact)

• Sensitive to physical oxide thickness

• Gate induced drain leakage (GIDL)– Lower voltage has a very large effect

• Essentially eliminated with supply collapse

• Junction band to band tunneling currents– Unaffected!

• This requires careful transistor design and circuit design interaction

• Otherwise likely to be the limiting factor in future usage

Page 16: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

16

RBB Circuit Design• Apply body bias by raising the source

– Naturally applies supply collapse

Page 17: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

17

Power Supply Routing• Substrate and well taps on a 50 µm grid [11]

– Highly doped epi substrate for low VSSSUP impedance– N-wells contiguous to grid in substrate for VDDSUP

50 µµµµm

50 µµµµm

VSSSUP

VDDSUP

= tap cell

N well

N well

Page 18: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

18

RBB Circuit Design• Amplifier & reference voltage based VSS regulator

– Reference tracks with VDD• Allows larger VDS and VSB if needed

Page 19: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

19

Drowsy Operation• Well is actively pulled up for RBB

– Logic circuit leakage passively pulls VSS up to produce RBB and power supply collapse

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

0 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006

time (s)

supp

ly v

olta

ge (V

)

Page 20: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

20

VSS Regulator Stability and Power• Design achieves

60o phase margin at all process corners

• Amplifier operates in subthreshold– Low gain

• VSS regulator consumes less than 4 µA– Key since it

contributes to total power consumption in Drowsy mode

-120

-100

-80

-60

-40

-20

0

20

1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08

frequency(Hz)

-300

-250

-200

-150

-100-50

0

50

100

150

200

1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08

frequency(Hz)

Page 21: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

21

N-well (VDDSUP) Regulator Design• Low value and high cost to using high voltage• Use a textbook bootstrapped voltage reference

– Low Vt VDNMOS source follower• Provides very low dropout even with high body bias• Note startup circuit

RESET

VDDSUP

VDDIO

VDD

Vbias

Vref

Page 22: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

22

Testing with Guardband• External access to the internal supply nodes essential

– Allows observability [12]– Allows controllability

• Find point of fail independent of regulator• Drive current into core to provide guardband

Regulator output Vss

Circ

uit s

tate

loss

Vss

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 0

.1 0

.2 0

.3 0

.4 0

.5 0

.6 0

.7 0

.8 0

.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 0

.1 0

.2 0

.3 0

.4 0

.5 0

.6 0

.7 0

.8 0

.9

Regulator output Vss

Circ

uit s

tate

loss

Vss

Page 23: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

23

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 24: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

24

Time Division Multiplexed Drowsy• Time Multiplex between active operation and

Drowsy mode to simulate a low leakage process [11]– At low effective frequency (FEFF) burst operate at a high

frequency to make time for low standby power mode– E.g., 300 MHz operation, 30 bursts per second, 100k

instructions per burst achieves 3 MHz FEFF• 99% of the time is spent in the low standby power mode

• Energy cost of entry and exit must be small– Need to amortize this penalty with leakage savings– Can’t know a-priori duration of standby state

• Applicable to cellular communications– 1-2 seconds between contact with cells in standby

• Applicable to hand-held devices, e.g., PDAs– Between keystrokes or pen-strokes

Page 25: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

25

Experimental Operation• Board using an Intel XScale 80200 microprocessor

– Power supplies brought external to measure power consumption using Agilent ammeter and PC

– Separate measurements accounted for IR drop

Page 26: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

26

TDM Drowsy: Code and Behavior• Code was programmable to run a loop of instructions

as the interrupt handler– Loop counter determined the interrupt instruction count

• At the end of the loop, the microprocessor re-entered Drowsy mode– Drowsy mode is exited by interrupts

• Code:

• BTB holds state, no cache misses

outerLoop:MOV R0, #instructions_per_interrupt

work: SUBS R0, R0, #1 ; decrement countBNE work ; loop while count != 0DROWSE ; wait for interruptB outerLoop

Page 27: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

27

0.00001

0.0001

0.001

0.01

0.1

1

1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

FEFF (Hz)

I (A

)

1e7 inst/int1e6 inst/int1e5 inst/int1e4 inst/intD model

TDM Drowsy: Results• Measured system results

– 300 MHz, VDD = 1V

Page 28: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

28

0.00001

0.0001

0.001

0.01

0.1

1

1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

FEFF (Hz)

I (A

)

1e7 inst/int1e6 inst/int1e5 inst/intD modelI(pll) model

TDM Drowsy: Results• Comparison with “Standby” mode

– Standby has single IO clock interrupt latency

~100x

Page 29: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

29

0.00001

0.0001

0.001

0.01

0.1

1

1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

FEFF (Hz)

I (A

)

1e7 inst/int1e6 inst/int1e5 inst/intD modelI(pll) modelI model

TDM Drowsy: Results• “Standby” with PLL disabled

– Leakage reduction limited by regulator resolution

28x

Page 30: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

30

Energy Cost of Entry and Exit• Power Components

– Active power• Active mode leakage—lumped with the above

– PLL power• During operation and for 20 µs before active operation to lock

– Drowsy mode leakage– Power supply movement power

• Passive entry saves ½ of the VSS component• Also saves power if resume soon after entering Drowsy• Both VSS and VSSSUP are small swing• CVSS = 55 nF• CVSSSUP = 5 nF

• Total energy overhead equivalent to approximately 60 clock cycles of active power at VDD = 1V

Page 31: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

31

Voltage scaling: Effect of Drowsy

0.1

1

10

100

0 50 100 150 200

Frequency (MHz)

Pow

er (m

W)

Vt=390mVwith Drowsy

Vt=390mV

Vt=500mV

Page 32: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

32

Interfacing Domians• It is easy to create sneak paths! [12]

– Signals between voltage domains must be driven full rail– Avoid pass gate interfaces

• Latches suffice to isolate domains

VDD=1V

0V

1V

Vss(GND)=0V

1V

0.65

V

Drowsy Domain Vss=0.65V

Vss(GND)=0V

See also [13] for a set of rules for MTCMOS designs

Non-Drowsy domain

Drowsy domain

Both “off” transistors

connect the domains

Page 33: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

33

Low Leakage ESD Clamping• Drowsy mode current small enough to make

otherwise negligible contributors significant– Large PMOS transistors in ESD clamps important– Fixed by RBB on clamp devices [14]

• Also provides FBB during ESD transients• Equivalent or better performance when tested using HBM, MM

VSSSUP

P1

VDDVDDIO

N1

N2

Page 34: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

34

Other Implementations• Same scheme used in [15] for 0.13 µm SRAM

– No PMOS body bias• We can speculate that the PMOS leakage was substantially

lower than NMOS, so no value in PMOS RBB• We have seen this on other 0.13 µm processes

• This approach will be used for 65 nm handheld devices [16]– 0.5 V VDD-VSS

Page 35: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

35

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 36: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

36

Limitations of Drowsy Modes• State stability [2]

– High fan-in domino circuits can have N to P ratios of 100’s to 1

• Need highly balanced storage, MTCMOS logic

• Channel length– Aggressively scaled transistors have poor body

transconductance gMB• Need to back off from the highest performance possible

• Drain to bulk tunneling currents– Requires less steep halo doping gradient at drain

• No halo is best—this will limit transistor scaling

• Defects– Stacking faults generate nearly 10 µA of leakage

• Turns Drowsy cells into defect detectors• May be problematic for strained silicon

Page 37: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

37

Leakage Control Limitations• Body bias vs. channel length

– Bulk control lost as transistors approach punchthrough• This is where high performance processes are often targeted

NMOS Transistor Leakage25C, Vd=1V

0.0001

0.001

0.01

0.1

1

10

0.3 0.35 0.4 0.45 0.5 0.55

Drawn Channel Length (um)

Leak

age

(nA

/um

)

Vb=0V

Vb=0.2V

Vb=0.4V

Vb=0.6V

PMOS Transistor Leakage25C, Vd=1V

0.0001

0.001

0.01

0.1

1

10

0.3 0.35 0.4 0.45 0.5 0.55Drawn Channel Length (um)

Leak

age

(nA

/um

)

Vb=0V

Vb=0.2V

Vb=0.4V

Vb=0.6V

Page 38: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

38

Combining MTCMOS and Drowsy

State retaining domainMTCMOSLogic

domain

VSSmemVSS

To regulator

D

VSSSUP

Q

VDD

VDDSUPCLKVDD

M20

DI

• Only state elements have RBB applied [11]– The rest of the circuits are

“slept” using MTCMOS• This eliminates about 2/3 of

the total leakage– Allows highly balanced

state elements• Drowsy can be pushed to

even lower VDS• Leverage high Igate VGS

dependency

Page 39: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

39

“Balloon” Latches

• A state retentive MTCMOS scheme [5]– High Vt transistors gate

low Vt circuits– State retained in high Vt

balloons• Circuit speed remains a

function of low Vt

• Does not address Igateleakage component

STDBYEN

CLK

D

QN

High Vt“bubble”

HVT

HV

THVT

HV

T

HV

T HVTH

VTH

VT

VDDSTDBY

HVT

Gated supply domain

Page 40: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

40

1E-10

1E-09

1E-08

1E-07

0.20 0.40 0.60 0.80 1.00 1.20

VDD (V)

I (A

)

Igate

IoffVSB = 0 to 1V

Leakage on Advanced Processes• Many of the old techniques will still be applicable

– RBB and supply collapse still works• Supply collapse (VOLTAGE SCALING) is key

– 10 µm wide 65 nm NMOS characteristics using BPTM [17]

Page 41: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

41

Leakage on Advanced Processes

• Gate leakage suppressed 2 orders of magnitude– Consistent with previous results [18]

• RBB and supply collapse still works– Cutting the voltage is critical

• Lowers Ioff by the DIBL coefficient• Pulls the transistor away from punchthrough and gives control back to

bulk• Longer channel suppresses DIBL the same way

• Key point not obvious is drain to bulk band to band tunneling– This is becoming the dominant component and is helped by lower

voltage– Transistor design is also important

• Future devices will apply “Drowsy” style RBB and supply collapse for SRAM’s on 90 and 65 nm [16]

Page 42: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

42

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 43: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

43

Addressing Igate: Thick Gate State Retention

• Add a “shadow” thick gate (and high Vt) latch to retain state during standby [8]– Using the thick gate IO

transistors implieshigher Vt

• The gate length can be pushed--not exposed to high drain voltages

• Eliminate thin gate latch if speed is unimportant

STDBYEN

CLK

D#

Q

M4 M5

SHADOWLATCH

Page 44: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

44

Thick Gate State Retention Operation

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00

Time (ns)

PD

D0

CLK

ST0

STN

DN

D

PD

CLK

D

Q

ST0

M5

Thick gatearea

D0 DN

M4

STN

– TSMC 0.18 µm thick gate and BPTM 65 nm thin gate

Page 45: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

45

Safer Thick Gate State Retention• The VDD supply

needn’t be completely discharged if a uni-directional path is provided from the thick to thin gate circuitry [9]– Recall that it is best to

disable supplies, allowing movement via leakage rather than driving them low

ACT2LOW

CLK

D#

Q

M4 M5

SHADOWLATCH

LOW2ACT

M2 M3

M1

Page 46: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

46

Thick Gate State Retention: Another Approach• This approach uses

uni-directional circuits in both directions– More transistors– May ensure thick gate

write-ability over a wider voltage range

• Adding transistors to slave in MSFF does not incur a speed penalty

– From USPTO website [10]

REST#

CLK

D#

Q

M4 M5

SHADOWLATCH

SAVE M1

REST

M2

M3

Page 47: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

47

Thick Gate State Retention: Results• Master-slave flip-flop

– Using projected 65 nm thin gate transistors• Ioff = 10.1 nA/µm• Igate = 8.0 nA/µm

– Using TSMC 0.18 µm thick gate• Ioff = 10 pA/µm• 40 angstrom electrical tox provides negligable gate leakage

– Result is over 7200x leakage savings for just MSFF!• Does not include thin gate logic between the flip-flops

• So we’ve eliminated the Igate standby contribution• But… The real limiter will be drain edge band

to band tunneling– Not modeled in this analysis– Will require process work

• Cost?

Page 48: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

48

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 49: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

49

Managing Active Leakage• Leakage is becoming a large part of overall power

– Up to 40% of total power at the worst-case corners on high performance processes

– Not as problematic on low power processes…yet• Transistor scaling will increasingly force “low

power,” really low-leakage processes to lower Vt’sor some of the scaling value will be lost– Designers that deal with the leakage will provide a

competitive advantage compared to those that don’t• But this is hard

– All schemes create some kind of cost– Cost must be optimized to balance active/standby power

• Including the energy cost of moving between the states

Page 50: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

50

Shutting Down Units• Can work well for low activity factor blocks

– Floating point, small granularity cache banks on µP– Some blocks may be unused for applications on SOC

• Key factor is the cost of discharging and charging the block power supply– The power cost must be amortized by the leakage

savings– Also must be careful about IR drop through the switches

• Very difficult

• Switch overhead is very low– Beware of inductive effects on supplies [18]– My solution is under-driving the switch transistor gates [2]

• [18] used staged turn-on of the switches

Page 51: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

51

Shutting down units: Example• Multiplier total transistor width is 60 mm• Leakage power savings is

P = VI = ½ (0.8) V Ileak = 6.5 mW(essentially 0.065 nJ per cycle at 1 GHz)

at VDD = 1.3 V and Ileak ~ Ioff = 100 nA/µm @ 100oC• Power supply capacitance of 0.5 nF

– One on gating is 0.33 nJ• That’s 5 clock cycles at 1 GHz

– Results will be very sensitive to decoupling capacitance• More is better for performance and noise• Less is better for gating

• It can be difficult to do this effectively and easy for particular behaviors to become higher power

Page 52: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

52

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 53: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

53

Dual Threshold Voltages• About 5% fabrication cost adder

– No active power adder– Leakage cost is over 10x per low Vt compared to high Vt

• Minimize low Vt transistors for low power– Design with high Vt

• Don’t forget this increases active (switching) power• Higher supply voltage for the same speed

– Insert low Vt only on difficult speed paths– Tendency for widely separated high and low Vt targets

• Can be very difficult for high performance design– Tendency to over-insert– Difficulty with noise on dynamic circuits– Less separated high and low Vt targets– Timing accuaracy effects

• High and low Vt needn’t track each other

Page 54: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

54

Multiple Channel Lengths• Longer channel decreases DIBL rapidly

– About 7x Ioff decrease on aggressive 90 nm process• No process cost

– But a small size adder (1-2%)• Active power cost

– Up to 15%, depending on activity factor• Caution required with insertion on high speed circuits

– Slow circuits dominated by leakage can be all long L• High Vt and low Vt track each other at process

corners– Essentially, both get faster and slower together

Page 55: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

55

Multiple Channel Lengths: Physical Design

• Add one grid– Layout must have one grid of space to add the gate length– Small overall area cost

• Can also be added at mask synthesis– Better resolution, but harder to check

Page 56: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

56

Multiple Channel Lengths: Timing• Priority for insertion

– Low activity factor• Leave the clocks alone! Maximize leakage savings

– Wide transistors• Maximize leakage savings

– Low activity factor• Work on post-layout data

– Otherwise you don’t really know your timing margin• Fix hold time violations after long channel insertion

– Slower (long L) gates fix these for free• Logic block long L insertion must be automated

– Timing must be re-calculated after each insertion– We used an insertion tool using Langrangian Relaxation

Page 57: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

57

Timing Margin• Cumulative block timing shows timing slack

– Negative path fixes comprise the work to meet timing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-0.55

-0.42

-0.29

-0.15

-0.02 0.11

0.242

0.374

0.506

0.638 0.77

0.902

1.034

1.166

1.298 1.43

1.562

Cum. Path Slack [ns]

Path

s

Page 58: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

58

Timing Margin• Low Vt insertion moves paths from negative slack to zero

slack– Without Low Vt, this requires logic, sizing changes

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-0.55

-0.42

-0.29

-0.15

-0.02 0.11

0.242

0.374

0.506

0.638 0.77

0.902

1.034

1.166

1.298 1.43

1.562

Cum. Path Slack [ns]

Path

s

Page 59: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

59

Timing Margin: Long L insertion• Don’t fix every path to zero timing margin

– Statistical variation will impact the yield– Timing tools are not perfect

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-0.6

-0.4

-0.3

-0.2

-0.1 0

0.11

0.22

0.33

0.44

0.55

0.66

0.77

0.88

0.99 1.

1

1.21

1.32

1.43

1.54

Cum. Path Slack [ns]

Path

s

Page 60: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

60

Timing Margin: Effect of Variation• Variation modeled as channel length

– Likelihood of a path becoming worse than 0 ns slack with variation (creating a timing failure) vs. original path distance from critical [19]

0.0010

0.0100

0.1000

1.00000.

00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

% Distance from TMaxCP

Prob

abili

ty o

f Fai

lure

Page 61: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

61

Results: Long L insertion• Automatic insertion on microprocessor logic blocks

– 90 nm process

Block 1 2 3 4 5 6 7 8 9 10 11 12 13

Ioff reduction (%)

39 43 44 48 45 38 49 49 48 47 39 42 39

Page 62: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

62

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 63: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

63

Drowsy Memories• Leakage dominates memory power on deep submicron

processes– Low activity factor leads to low active power

• Supply collapse suggested for limiting cache memory leakage [20]– Can be done by bank or row

– Very aggressive processes have low RBB impact• Backing off the gate length fixes this—usually needed anyways

Bank 00

Vdd VddLow

Sel00 Supply collapsedto VDDLow

when not accessed

Normal supply voltagewhen accessed--Key for stability

Page 64: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

64

Drowsy Memories: Driving VSS

• Driving VSS allows greater leakage reduction– Can still be done on a row by row basis

• Helps write speed, Ref. [3] used FBB to improve read speed and stability

– Used effectively on a 0.13 µm process [21]• Note this is the same as the full-chip Drowsy mode described

previously• No PMOS RBB• NMOS and PMOS leakage not always balanced

VssDrowsy

Bank 00

Supply collapsed toVSSDrowsy and RBBApplied to NMOS

Supply is VSSfor read

Page 65: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

65

Memory Decode• Most transistor width in decode

is in WL drivers– WL low when inactive, PMOS

leakage predominates– Long channel can be used

• No logic change– Gating VDD more effective [22]

BSel VDD

Sel0 WL0

Sel1 WL1Virtual VDD

Page 66: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

66

Outline

• Introduction and motivation• Standby leakage management

– Drowsy mode: Reverse body bias and supply collapse• Circuit design and operation• System level results• Limitations

– Thick gate shadow latches• Active leakage management

– Multiple Vt and channel assignment– Drowsy memories– Thick gate SRAM

• Conclusions

Page 67: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

67

Thick Gate SRAM

VDD

WLEN

BL

WLhv

BLN

PRECHN

VDD

Sel2Sel1

VDDhv

VDDhv

• Bitlines precharged to the core VDD– SRAM cells operate from VDDhv –ensures stability– Level shift at the WL driver--keeps decoder low power

• Use thick gate transistors for SRAM– High Vt, no appreciable Ioff or Igate currents

Page 68: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

68

Thick Gate SRAM: Layout and Size

• Gate length cannot scale with thicker tOX

• Transistors must be re-targeted from the I/O transistors– Avoid punchthrough with high Vt– Eliminate halo for low drain to

bulk band to band tunneling– Longer gate to keep gate

control (but short as possible)• Cell about 20-40% larger than

thin gate cells

Page 69: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

69

Thick Gate SRAM: Array Layout

Page 70: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

70

SRAM Cell Read Stability

Vout(n1), Vin(n0)

Vou

t(n0)

, Vin

(n1)

BLBL

WL

Iread

n0 n1

• Current through inverter pulldown raises cell logic low level during read—particularly at low voltage– Due to mis-match (even RDF [23]) the static noise margin

can be much smaller than expected– Some cells flip when read

Page 71: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

71

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

VDD (V)

Failu

re p

roba

bilit

y (m

ax. 1

))

Baseline

TGSRAMVarying transistor sizing

Thick Gate SRAM Cell Stability

• Stability improved in thick gate SRAM design– Size matters! Better matching– Also greatly helped by lower precharge voltage…

• Until VDD – VtTG reached—then looks like a write

Page 72: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

72

Conclusions• Standby power is moving from a process to a

design problem– Process scaling increases leakage

• There is a lot of room for improvement– Huge and growing hand-held, wearable, and medical

markets will stimulate creative solutions• Design solutions can limit many components

– Low VGS limits, thick gate eliminates Igate– High Vt or body bias limits Ioff– Drowsy limits GIDL

• Combines low VGS, simulates high Vt

• Transistor design will also matter– Drain to bulk tunneling current will be limiting

• Requires limited or no halo implants

Page 73: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

73

Questions?

Acknowledgment• Many people contributed to this work:

Kim VelardeShay DemmonsFranco RicciManish BiyaniEd BawolekNeil DeutscherMike MorrowBill BrownDave McCarrollEric HoffmanAlfredo Barrenechea

Page 74: Design - UC Davis ECEvojin/CLASSES/EEC280/protected/islped04/... · 200 300 400 500 600 700 0 200 400 600 800 ... – Write back cache state requires 16 ... – Decreases with a square

74

References[1] H. Mizuno, et al., An 18-µA standby current 1.8-V, 200-

MHz microprocessor with self-substrate-biased data-retention mode, IEEE JSSC, 34, Nov., 1999, pp. 1492-1500.

[2] L. Clark, N. Deutscher, F. Ricci, and S. Demmons, Standby power management for a 0.18 µm microprocessor, Proc. ISLPED, 2002, pp. 7-12.

[3] H. Mizuno and T. Nagano, Driving source-line cell architecture for low-V high-speed low-power applications, IEEE JSSC, 31, April, 1996, pp. 552-558.

[4] S. Mutoh, et al., 1V power supply high-speed digital circuit technology with multithreshold-voltage CMOS, IEEE JSSC, 30, Aug., 1995, pp. 847-854.

[5] S. Shigematsu, et al., A 1-V high-speed MTCMOS circuit scheme for power-down application circuits, IEEE JSSC, 32, June, 1997, pp. 861-870.

[6] J. Kao and A. Chandrakasan, “Dual-threshold voltage techniques for low-power digital circuits,” IEEE JSSC, 25, July, 2000, pp. 1009-1018.

[7] Q. Wang and S. Vruhula, “Algorithms for minimizing standby power in deep submicrometer, dual-Vt CMOS circuits,” IEEE Trans. On Computer-aided Design of Int. Circuits and Systems, pp. 306-318.

[8] L. Clark and F. Ricci, “Low standby power using shadow storage,” US Patent #6,639,827

[9] L. Clark, F. Ricci, and M. Biyani, “Low standby power state retention for sub-130 nm processes,” accepted to IEEE JSSC.

[10] U. Ko, D. Scott, S. Gururajarao, H. Mair, “Retention register for system-transparent state retention,” US patent application 2004000871.

[11] L. Clark, M. Morrow, and W. Brown, Reverse Body Bias and Supply Collapse for Low Effective Standby Power, to appear in IEEE Trans. VLSI, Sept., 2004.

[12] L. Clark, D. McCarroll, and E. Bawolek, Characterization and debug of reverse body bias low power modes, Electronic Device Failure Analysis, 6, Feb., 2004, pp. 13-21.

[13] B. Calhoun, F. Honore, and A. Chandrakasan, “A leakagereduction methodology for distributed MTCMOS,” IEEE JSSC, 39, May, 2004, pp. 818-827.

[14] T. Maloney, S. Poon, and L. Clark, “Methods for DesigningLow-leakage Power Supply Clamps,” to appear in Journal of Electrostatics, October, 2004.

[15] K. Osada, Y. Saitoh, E. Ibe, K. Ishibashi, “16.7pA/cell tunnel-leakage-suppressed 16Mb SRAM for handling cosmic-ray-induced multi-errors,” ISSCC Proc., 2003.

[16] S. Zhao, et al., “Transistor optimization for leakage powermanagement in a 65 nm CMOS technology for wireless andmobile applications,” VLSI Symp. Tech. Dig., 2004, pp. 14-15.

[17] UC Berkeley Device Group. Berkeley predictive technologymodel [online]. http://www-device.eecs.berkeley.edu/~ptm.

[18] S. Kim, S. Kosonocky, and D Knebel, “Understanding andminimizing ground bounce during mode transition of power gating structures,” Proc. ISLPED, 2003, pp. 22 – 25.

[19] A. Barrenechea, “Design impact of process variation,” M.S.Thesis, University of New Mexico, 2004.

[20] K. Flautner, et al., Drowsy caches: Simple techniques forreducing leakage power, Proc. ISCA’02, p. 148, 2002.

[21] K. Min, K. Kanda, and T. Sakurai, “Row-by-row dynamic source-line voltage control (RRDSV) scheme for two orders of magnitude leakage current reduction of sub-1-V-VDD SRAM’s,” Proc. ISLPED, 2003, pp. 66-71.

[22] K. Itoh, “Trends in low-voltage embedded-RAM technology,” Proc. 23rd Int. Conf. on Microelectronics, 2002, p. 497.

[23] A. Bhavnagarwala, T. Xinghai, J. Meindl, The impact ofintrinsic device fluctuations on CMOS SRAM cell stability,IEEE JSSC, 36, no. 4, 2001, pp. 658-665.