tutorial outline - ecs.umass.edu€¦ · ... low power design gate.1 ©mji&vn, psu, 2000...
TRANSCRIPT
1
Gate.Gate.11ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Tutorial Outline
Introduction and motivationSources of power in CMOS designsPower analysis tools and techniquesGate & functional unit design issues & techniquesBREAKArchitectural level issues and techniquesLUNCHLow power memory system designSoftware level issues and techniquesBREAKSoftware level issues and techniques, con’tFuture challenges
8:30 - 8:458:45 - 9:059:05 - 9:30
9:30 - 10:3010:30 - 10:5010:50 - 12:1512:15 - 1:30
1:30 - 2:302:30 - 3:303:30 - 3:503:50 - 4:304:30 - 4:45
Gate.Gate.22ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Design Levels
Abstraction Analysis Analysis Analysis Analysis Energy
Level Capacity Accuracy Speed Resources Savings
Most Worst Fastest Least Most
Application
Behavioral
Architectural (RTL)
Logic (Gate)
Transistor (Circuit)
Least Best Slowest Most Least
2
Gate.Gate.33ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Basic Principles of Low Power Design
l Reduce switching (supply) voltage» quadratic effect -> dramatic savings» negative effect on performance
l Reduce capacitancel Reduce switching frequency
» switching activity» clock rate
l Reduce glitchingl Reduce short circuit currents (slope engineering)l Reduce leakage currents
P = CL VDD2 f0→ 1 + tscVDD Ipeak f0→ 1 + VDD Ileakage
Gate.Gate.44ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gates:Transistor Sizing
lUse the smallest transistors that satisfy the delay constraints» slack time - difference between required
time and arrival time of a signal at a gate output
– Positive slack - size down– Negative slack - size up
lMake gates that toggle more frequently smaller
lSize for slope engineering to reduce short circuit currents
3
Gate.Gate.55ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gates:Transistor Pin Ordering
lLogically equivalent inputs may not have identical energy/delay characteristics
B
AOut
Ci
Cout
lTo conserve energy (and improve speed), connect inputs so that most active input is nearest output
lNeed to know signal statistics
Gate.Gate.66ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gates:Dynamic Gate Pin Ordering
lDynamic gates exhibit higher switching activity (and add to clock load) but are fast
!A !C!B
SelA SelB SelC
A CB
SelA SelB SelC
If A, B, and C have low signalprobability
4
Gate.Gate.77ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gates:Gate Restructuring
lLogically equivalent CMOS gates may not have identical energy/delay characteristics
Gate.Gate.88ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gate Networks: Balanced Delay Paths
Equalize lengths of timing paths through logic
F1
F2
F3
00
0
0
12
F1
F2
F3
00
00
1
1
lReduce glitching by balancing the delay path
5
Gate.Gate.99ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gate Networks: Network Restructuring
Chain implementation has a lower overall switching activity than the tree implementation
Ignores glitching effects
lConsider logic topology alternatives
AB
CD F
WX
0.5
0.5
3/16
0.50.5
7/6415/256
AB
CD Z
F
Y0.5
0.50.5
0.5
3/16
3/16
15/256
Gate.Gate.1010ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Network Restructuring, con’t
lLogically equivalent gate networks may not have identical energy/delay characteristics
Technology mapping
F = ABCD delayarea
energy
6
Gate.Gate.1111ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Low Energy Gate Networks: Network Input Ordering
Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)
l Input ordering
AB
C
X
F
0.5
0.20.1
(1-0.5x0.2)x(0.5x0.2)=0.09BC
A
X
F
0.2
0.10.5
(1-0.2x0.1)x(0.2x0.1)=0.0196
Gate.Gate.1212ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Dual Supply Voltages
l Use two VDD’s (e.g., 2.5V and 1.5V)» use the higher supply for gates on the critical path» use the lower supply for gates off the critical path
l Reduces energy without a performance lossl Cons
» slight area penalty» increased design time» need level converters to interconnect gates on different
supplies (to avoid static currents)
7
Gate.Gate.1313ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Dual Threshold Voltages
l Use two VT’s (e.g., 0.6V and 0.3V for VDD = 2.5V)» use the lower threshold for gates on the critical path» use the higher threshold for gates off the critical path
l Improves performance without an increase in power
l Cons» increased fabrication complexity» increased design time» beware of increased leakage in low VT portion of the
circuit - could end up with increased power!
Gate.Gate.1414ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Functional Unit Energy Optimization
l Key processor core functional units» latches and (pipeline) registers» ALUs - adders, multipliers, barrel shifters» control logic (FSMs)» interconnect» multi-ported register file
l On-chip memories (ROMs, caches, SRAMs,eDRAMs)
l MMU, TLBl Clock generation and distributionl Off-chip interconnect (pads)
8
Gate.Gate.1515ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Flipflops and Pipeline Registers
l Consume a lot of energy because they are clocked every cycle» Clock energy (Ec)
– energy dissipated when the ff is clocked with stable data» Data energy (Ed)
– energy dissipated when the ff is clocked and the data has changed so that the ff changes state
» Typically the data rate (fd) is much lower than the clock rate (fc)
l Also impacts clock energy since a large portion of clock energy is used to drive the sequential elements
Gate.Gate.1616ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Power Consumption in Latches
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5
DataClock
Latch Data AF
% P
ower
CLK
D Q
CLKB
From From TiwariTiwari, 1998, 1998
9
Gate.Gate.1717ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Some Typical CMOS FFs
CLK
D Q
CLK
D
Q
CLK
Static TG FF
DQ
CLK
D Q
Dynamic C2MOS FF
Dyn Precharged TSPC FF Dyn Non-Precharged TSPC FF
Gate.Gate.1818ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
FF Power Comparison
0
5
10
15
20
25
30
0.05 0.15 0.25 0.35 0.45
TGFFC2MOSPTSPCNPTSPC
Latch Data AF
Rel
ativ
e P
ower
Con
sum
ptio
n
From From SvensonSvenson, 1996, 1996
10
Gate.Gate.1919ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Energy Efficient Flipflops
CLK
D
GND
VDD
VDD
Q
Q
D Q
StrongArm SA110 FF
Power PC 603 FF
VDD
CLK
CLKCLKB
CLKB
16 transistorCLK & CLKB
4 clock loads each
20 transistorCLK
3 clock loads
Gate.Gate.2020ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
EDP of Some Low Power FFs
0
1020
3040
50
60
70
80
HLFF
SDFF
PowerP
C
mC2MOS
SA11
0FF
K6ET
L
HighLowAverage
From From StojanovicStojanovic, 1998, 1998
ED
Pto
t(fJ
)
11
Gate.Gate.2121ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Self-Gating FF
lWhen ff input is equal to its output, suppress internal clocking to conserve energy» gating function is derived within the FF
D Q
Φ
Φ Φ
Φ
ΦΦ
CLK
DQ
Φ
Φ
Φ
Strict ruleson when D canchange wrt CLK
Gate.Gate.2222ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Power of Self-Gated FF
0
10
1 2
SG FFReg FF
Data switching rate fd/fc
Pow
er d
issi
patio
n
From Reyes, 1996From Reyes, 1996
12
Gate.Gate.2323ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Double Edge Triggered FF
D Q
Loads data at bothrising and falling
clock edges
CLK
CLK
CLK
CLK
CLKB
CLKB
CLKB
CLKB
Gate.Gate.2424ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
DETFF Pros and Cons
l Advantages» Clock frequency can be halved to achieve the same
computational throughput: Pd = 0.84Ps
» Also get a 2X energy savings in the clock network
l Disadvantages» About 15% larger in transistor count» Maximum operating frequency less» Strict requirements on clock skew» Requires a strict 50% duty cycle» Larger clock load
13
Gate.Gate.2525ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Adders (Subtractors)
synchronous word parallel adders
ripple carry adders (RCA) carry prop min adders
signed-digit fast carry prop residueadders adders adders
Manchester carry carry conditional carry carry chain select lookahead sum skip
T = O(n), A = O(n)
T = O(1), A = O(n)
T = O(log n)A = O(n log n)
T = O(n), A = O(n) T = O(n**1/2), A = O(n)
Gate.Gate.2626ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
PDP of Different Adders
0
25
50
75
100
8 bits 16 bits 32 bits 48 bits 64 bits
RCAMCCACSkAVSkACSlACLABKAELMA
From From NagendraNagendra, 1996, 1996
14
Gate.Gate.2727ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Brent-Kung (CLA) Adder
€
g0p0
g1p1
g2p2
g3p3
g4p4
g5p5
g6p6
g7p7
g8p8
g9p9
g10p10
g11p11
g12p12
g13p13
g14p14
g15p15
€€€€€€€
€ € € €
€
€
€
€
€
€
€ € € € € €
€ €
c1c2c3c4c5c6c7c8c9c10c11c12c13c14c15c16
T =
log 2
n
Par
alle
l Pre
fix C
ompu
tatio
n
T =
log 2
n -1
A =
2lo
g 2n
A = n/2
Gate.Gate.2828ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
0
5
10
15
20
ED
P (p
J)
16 32 64
Number of bits
BK ClassicBK HybridELM ClassicELM Hybrid
BK and ELM Adder Optimization
15
Gate.Gate.2929ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Parallel Multipliers
lForm partial product array in parallel and add it in parallel» can use multiplier recoding to reduce the high
of the partial produce array by half» recoding may cost more energy than it saves!» use delay balancing to reduce glitching
lArray multipliers (regularity)lPipelined multipliers (higher throughput,
longer latency, less glitching but adds to clock load)
Gate.Gate.3030ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Parallel Multiplier Structure
Q (‘ier)
D (‘icand)
DD
D
0
00
0
multiple forming circuits
partial productarray reduction tree
fast CPA
P (product)
muxes + tree reduction (log n) + CPA
16
Gate.Gate.3131ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
PP Array Reduction Process‘icand‘ier
partialproductarray
reduced partial product array
(4,2) counter
to CPA
Gate.Gate.3232ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
(4,2) Countersl Built out of (3,2) counters (FA’s)
l Tiles with neighboring (4,2) countersl Can use delay balancing in cell design and
interconnect to reduce glitching
(3,2)
(3,2)
(3,2)
(3,2)
(3,2)
(3,2)
17
Gate.Gate.3333ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
PP Array Reduction Tree Structure
CPA
multiple generators
multiple selection signals(‘ier)
. . .multiplicand
(4,2) counter slices
(4,2) counter slices
(4,2) counter slices
2
Gate.Gate.3434ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Glitch Reduction by Pipelining
lGlitches are dependent on the logic depth of the circuit
lNodes logically deeper are more prone to glitching» arrival times of the gate inputs are more
spread due to delay imbalances» usually affected by more primary input
switchinglReduce depth by adding pipeline
registers
18
Gate.Gate.3535ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Pipelined Parallel Multiplier
Q (‘ier)
D (‘icand)
DD
D
0
00
0
multiple forming circuits
partial productarray reduction tree
fast CPA
P (product)
helps to reduce glitching but adds to the clock load
clk
Gate.Gate.3636ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
CSA Array Multiplier
M00M01M02M03
M10M11M12M13
M20M21M22M23
M30M31M32M33
q0q1q2q3
d0
p0
d1
p1
d2
p2
d3
p3p4p5p6p7
00000 0 0 0
CSA
qjsuminput
di
carryin
sumoutput
carryout
Longest delay pathn + n - 1 = 2n - 1
19
Gate.Gate.3737ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Multiplier Cell Structure
fulladder
Bjsuminput
Ai
carry incarry out
add delay elementsto minimize glitching
1D2D
Gate.Gate.3838ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Pipelined CSA Array Multiplier
M00M01M02M03
M10M11M12M13
M20M21M22M23
M30M31M32M33
q0q1q2q3
d0
d1
p1d2
p2d3
p3
p4
p5
p6p7
0000clk
0 0 0 0
M41M42M43
M52M53
M63
p0
20
Gate.Gate.3939ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Barrel Shifters0
510
Av
era
ge P
ow
er
(mw
)
log PT Array PT log static logdynamic
0
2
4
6
8
10
12
Del
ay (n
s)
log PT Array PT log static logdynamic
Influence of architecture: Logarithmic, Arrayand Gate types: Pass Transistor, Dynamic/Static Mux
From From AckenAcken, 1996, 1996
Gate.Gate.4040ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Control Unit Design
CombinationalLogic
Sta
te F
Fs
Inputs Outputs
n! different possibleencodings (n states)
11
00 01
0,1/1
1/X
1/X0/0
0/0
State EncodingOne of most important factors determining area, speed, and energy of resulting control logic
21
Gate.Gate.4141ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Energy State Encoding Heuristicl Area driven -> try to reduce the distance in
Boolean n-space between related statesl Energy driven -> try to minimize number of bit
transitions in the state register» fewer transitions in state register» fewer transitions propagated to combinational logic
0.1
0.1
0.1
0.40.3
probability that a transition will occur(sum of all edges
equals unity)1100
01
Gate.Gate.4242ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Caveat
lLowest E[M] may not be lowest in energy → it could require more gates and/or signal transitions in the combinational logic
lExperiments show that the area and energy dissipation of a state machine are correlated when the state encoding is varied
22
Gate.Gate.4343ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
State Encoding Effects
500
550
600
650
700
750
3300 3400 3500 3600 3700 3800 3900 4000 4100Area
Pow
er
From From YeapYeap, 1997, 1997
Gate.Gate.4444ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Practical Considerations
lBalance area-energy by forced encoding of only a subset of states that span the high probability edges» leave assignment of remaining states to the
logic synthesis system for area optimization» fortunately, in practice, most state machines
have this characteristiclUnlike area encoding, energy encoding
requires knowledge of probabilities of state transitions and input signals
23
Gate.Gate.4545ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
A Low Power Processor Core
Example
Gate.Gate.4646ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
M•CORE Architecture
GP reg file
(32bitx16)
Alt reg file
(32bitx16)
Control reg file
(32bitx13)
X port Y port Immed
Scale
Barrel shift, FF1
ALU, priority encode, 0 detect
Sign ext
Instr pipeline
Instr decoder
Branch adder
PC increment
Address bus
Data bus
Writeback busH/W acc bus
24
Gate.Gate.4747ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
M•CORE Power Distribution
36%
36%
28%
DatapathClockControl
42%
14%9%
8%
7%
6%
5%
9%
Reg FileAddr/Data BusInst RegBarrel ShifterX MUXY MUXAddr GenOther
Gate.Gate.4848ISCA Tutorial: Low Power DesignISCA Tutorial: Low Power Design ©©MJI&VN, PSU, 2000MJI&VN, PSU, 2000
Key ReferencesHossain, Low power design using double edge triggered flipflop, IEEE Trans. on VLSI
Systems, 2(2):261-265, 1994.Motorola, M•CORE Architecture microRISC Engine, MCORE 1/D,
www.mot.com/SPS/MCORE/info_documentation.htmMutsunori, Low power design method using multiple supply voltages, SLPED, 1997.Rabaey, Digital Integrated Circuits, Prentice-Hall, 1996.Reyes, Low Power FF Circuit and Method Thereof, Patent No 5,498,988, 1996.Roy, Power analysis and design at the system level, Low Power Design in Deep
Submicron Electronics, Nebel and Mermet, Ed., Kluwer, 1997.Sakuta, Delay balanced multipliers for low power, SLPE, 1995.Scott, Designing the Low-Power M•CORE Architecture, Proc. Inter. Symp. Computer
Architecture Power Driven Microarchitecture Workshop, June 1998.Stojanovic, A unified approach in the analysis of latches and FFs for low power
systems, ISLPED, 1998.Tiwari, Reducing power in high-performance microprocessors, DAC, 1998.Yeap, CPU controller optimization for HDL logic synthesis, CICC, 1997.Yeap, Practical Low Power Digital VLSI Design, KAP, 1998.