optimizing power @ design time – interconnect and clocks … · 2014. 1. 16. · clb μprocessor...
TRANSCRIPT
Chapter 6
Optimizing Power @ Design Time – Interconnect and Clocks
Jan M. Rabaey
Optimizing Power @ Design Time
Interconnect and Clocks
Slide 6.1
Chapter Outline
� Trends and bounds
� An OSI approach to interconnect-optimization– Physical layer– Data link and MAC– Network– Application
� Clock distribution
Slide 6.2
ITRS Projections
Calendar Year 2012 2018 2020
Interconnect One Half Pitch 35 nm 18 nm 14 nm
MOSFET Physical Gate Length 14 nm 7 nm 6 nm
Number of Interconnect Levels 12–16
On-Chip Local Clock 20 GHz 53 GHz 73 GHz
Chip-to-Board Clock 15 GHz 56 GHz 89 GHz
# of Hi Perf. ASIC Signal I/O Pads 2500 3100 3100
# of Hi Perf. ASIC Power/Ground Pads 2500 3100 3100
Supply Voltage 0.7–0.9 V 0.5–0.7 V 0.5–0.7 V
Supply Current 283–220 A 396–283 A 396–283 A
14–18 14–18
Slide 6.3
Increasing Impact of Interconnect
� Interconnect is now exceeding transistors in
– Latency
– Power dissipation
– Manufacturing complexity
� Direct consequence of scaling
Slide 6.4
Communication Dominant Part of Power Budget
65%
21%
9%5%
Interconnect
Clock
I/OCLB
μProcessorClock
LogicMemory
I/O
ASSP
ClocksCaches
Execution Units
Control I/O Drivers
40%20%
15%
15% 10%
FPGA
Signal processor
Slide 6.5
Idealized Wire Scaling Model
Parameter Relation Local Wire Constant Length Global Wire
W, H, t 1/S 1/S 1/S
L 1/S 1 1/SC
C LW/t 1/S 1 1/SC
R L/WH S S 2 S 2/SC
tp ~ CR L2/Ht 1 S 2 S 2/SC2
E CV 2 1/SU 2 1/U 2 1/(SCU 2)
Slide 6.6
Distribution of Wire Lengths on Chip
© IEEE 1998
[Ref: J. Davis, C&S’98]
1.0E + 5
1.0E + 4
1.0E + 3
1.0E + 2
1.0E + 1
N – 86161p – 0.8k – 5.0
Inte
rcon
nect
Den
sity
Fun
ctio
n f(
t)
1.0E + 0
1.0E – 11 10
Interconnect Length, Γ[gate pitches]100
Actual DataStochastic Model
1000
Slide 6.7
Technology Innovations
Reduce dielectric permittivity (e.g., Aerogels or air)
Reduce resistivity (e.g., Copper)
Reduce wire lengths through 3D-integration
Novel interconnect media (carbon nanotubes, optical)
(Courtesy of IBM and IFC FCRP)
© IEEE 1998
Slide 6.8
Logic Scaling
ptp~1/s310–12
10–12
10–9
10–9
10–6
10–6
10–3
10–3
100
100
Pow
er P
(W)
Delay tp (s)
10–6J
10–9J
10–12J
10–15J
10–18J
[Ref: J. Davis, Proc’01]
Slide 6.9
Interconnect Scaling
Delay τ (s)
(Len
gth
)–2 L
–2 (
cm–2
)
Len
gth
L (
cm)
10
10–18 10–3
10–3
10–2
10–2
10–1
10–0100
102
104
106
108
1010
102
10–6
10–5
10–4
10–4
10–910–1210–15
L–2τ = 10–5 [s/cm–2](F = 0.1µ)
L–2τ ~ S 2
10–13
(1000µ)
10–11
(100µ)
10–9(10µ)
10–7(1µ)
[Ref: J. Davis, Proc’01]
Slide 6.10
Lower Bounds on Interconnect Energy
Claude Shannon
kTBC ≤ B log2 (1+ PS )
C : capacity in bits/sec B : bandwidth Ps : average signal power
Ebit = PS / C
Valid for an “infinitely long” bit transition (C/B→0)Equals 4.10–21J/bit at room temperature
)2ln()0/((min) kTBCEE bitbit =→=
Shannon’s theorem on maximum capacity of communication channel
[Ref: J. Davis, Proc’01]
Slide 6.11
Reducing Interconnect Power/Energy
� Same philosophy as with logic: reduce capacitance, voltage (or voltage swing), and/or activity
� A major difference: sending a bit(s) from one point to another is fundamentally a communications/ networking problem, and it helps to consider it as such
� Abstraction layers are different:– For computation: device, gate, logic, micro-architecture– For communication: wire, link, network, transport
� Helps to organize along abstraction layers, well- understood in the networking world: the OSI protocol stack
Slide 6.12
OSI Protocol Stack
� Reference model for wired and wireless protocol design — Also useful guide for conception and optimization of on-chip communication
� Layered approach allows for orthogonalization of concerns and decomposition of constraints
Network
Transport
Session
Data Link
Physical
Presentation/Application
� No requirement to implement all layers of the stack� Layered structure need not necessarily be maintained in
final implementation [Ref: M. Sgroi, DAC’01]
Slide 6.13
The Physical Layer
Transmits bits over physical interconnect medium (wire)�Physical medium
– Material choice, repeater insertion
� Signal waveform– Discrete levels, pulses,
modulated sinusoids
� Voltages– Reduced swing
� Timing, synchronization
Network
Transport
Session
Data Link
Physical
Presentation/Application
So far, on-chip communication almost uniquely “level-based”
Slide 6.14
Repeater Insertion
Optimal receiver insertion results in wire delay linear with L
))(( wwddp crCRLt ∝with RdCd and rwcw intrinsic delays of inverter and wire, respectively
But: At major energy cost!
VinVout
C/m C/m C/m C/m
R/m R/m R/m R/mIpdef
Slide 6.15
Repeater Insertion ─ Example
� 1 cm Cu wire in 90 nm technology (on intermediate layers)– rw = 250 Ω/mm; cw = 200 fF/mm– tp = 0.69rwcwL2= 3.45 ns
� Optimal driver insertion:– tp opt= 0.5 ns– Requires insertion of 13 repeaters– Energy per transition 8 times larger than just charging
the wire (6 pJ verus 0.75 pJ)!� It pays to back off!
Slide 6.16
Wire Energy–Delay Trade-off
1 2 3 4 5 6 7 80.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Eno
rm
Dnorm
wire energy only
L = 1 cm (Cu)90 nm CMOS
(dmin
, emax)
Rep
eate
r ov
erhe
ad
Slide 6.17
Multi-Dimensional Optimization
� Design parameters:Voltage, number of stages, buffer sizes
� Voltage scaling has thelargest impact, followed by selection of number of repeaters
� Transistor sizing secondary
1 2 3 4 5 6 7 80
2
4
6
8
10
12
Dnorm
Num
ber
of s
tage
s0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
VD
D(V
)
Slide 6.18
Reduced Swing
� Ebit = CVDDVswing
� Concerns:– Overhead (area, delay)– Robustness (supply noise, crosstalk, process variations)– Repeaters?
Transmitter (TX) Receiver (RX)
Slide 6.19
Traditional Level Converter
� Requires two discrete voltage levels
� Asynchronous level conversion adds extra delay
V V
C
V
V
V
[Ref: H. Zhang, TVLSI’00]
Slide 6.20
Avoiding Extra References
Transient
VTC
[Ref: H. Zhang, VLSI’00]In
In2
VT
n
|VT
p|
Out
B
A
in
out
|VTHp|
VDD
VDD - VTn
Slide 6.21
Differential (Clocked) Signaling
� Allows for very low swings (200 mV)� Robust� Quadratic energy savings� But: doubling the wiring, extra clock signal, complexity
CL
CL
[Ref: T. Burd, UCB’01]
Slide 6.22
Lower Bound on Signal Swing?
� Reduction of signal swing translates into higher power dissipation in receiver – trade-off between wire and receiver energy dissipation
� Reduced SNR impacts reliability – current on-chip interconnect strategies require Bit Error Rate (BER) of zero (in contrast to communication and network links)– Noise sources: power supply noise, crosstalk
� Swings as low as 200 mV have been reported [Ref: Burd’00], 100 mV definitely possible
� Further reduction requires crosstalk suppression
shielding folding
GND
GND
GND
Slide 6.23
Quasi-Adiabatic Charging
t
V VDD
VDD/n
• Uses stepwise approximation of adiabatic (dis)charging• Capacitors acting as “charge reservoir”• Energy drawn from supply reduced by factor n
CT1
CT2
CTn–1
CT1 CL
CT2
CTn-1
[Ref: L. Svensson, ISLPED’96]
Slide 6.24
Charge Redistribution Schemes
VDD/2
VDD/4
3VDD/4
Precharge Eval Precharge
B0
B0
B1
B1
B0 = 0
B1 = 1
VDD
E
E
E
P
P
GND
RX1
RX0
1
0
B1
B1
B0
B0
� Charge recycled from top to bottom� Precharge phase equalizes differential lines� Energy/bit = 2C(VDD/n)2
� Challenges: Receiver design, noise margins[Ref: H. Yamauchi, JSSC’95]
Slide 6.25
Alternative Communication Schemes
� Example: Capacitively driven wires
Offers some compelling advantages � Reduced swing
Swing is VDD/(n+1) without extra supply
� Reduced loadAllows for smaller driver
� Reduced delayCapacitor pre-emphasizes edges Pitchfork capacitors exploit
sidewall capacitance
[Ref: D. Hopkins, ISSCC’07]
Slide 6.26
Signaling Protocols
Network
ProcessorModule
(μProc, ALU, MPY, SRAM…)
din reqin ackin dout req out ackout
Din
done
Globally Asynchronousself-timed handshaking protocol
Allows individual modulesto dynamically trade-off performancefor energy efficiency
Slide 6.27
Signaling Protocols
Network
Physical Layer Interface Module
ProcessorModule
(mProc, ALU, MPY, SRAM…)
din reqin ackindout reqout ackout
din dout clk
din
reqin
clk
done
Locally synchronous
done
Globally asynchronous
Slide 6.28
The Data Link/Media Access Layer
Reliable transmission over physical link and sharing interconnect medium between multiple sources and destinations (MAC)� Bundling, serialization, packetizing� Error detection and correction� Coding� Multiple-access schemes
Network
Transport
Session
Presentation/Application
Data Link
Physical
Slide 6.29
Coding
N N + k N
LinkTX RX
Adding redundancy to communication link (extra bits) to:� Reduce transitions (activity encoding)� Reduce energy/bit (error-correcting coding)
Slide 6.30
Activity Reduction Through Coding
NN + 1
N
Example: Bus-Invert Coding
Invert bit p
� Data word D inverted if Hamming distance from previous word is larger than N/2.
DDenc
D
D # T Denc p # T
0010101000111011110101000000110101110110…
-2756
0010101000111011001010110000110110001001…
00101
-21+13+12+1
[Ref: M. Stan, TVLSI’95]
Slide 6.31
Bus-Invert Coding
Gain:25 % (at best – for random data)
Overhead: Extra wire (and activity)Encoder, decoder
Not effective for correlated data
Gain:25% (at best – for random data)
Overhead: Extra wire (and activity)Encoder, decoder
Not effective for correlated data
P
Encode
Decode
D DDenc
p
Bus
[Ref: M. Stan, TVLSI’95]
Slide 6.32
Other Transition Coding Schemes
Advanced bus-invert coding (e.g. partition bus into sub-components) (e.g. [M.Stan, TVLSI’97])
Coding for address busses (which often display sequentiality) (e.g. [L. Benini, DATE’98])
Full-fledged channel coding, borrowed from communication links (e.g. [S. Ramprasad, TVLSI’99])
Coding to reduce impact of Miller capacitance between neighboring wires [Ref: Sotiriadis, ASPDAC’01]
Maximum capacitance transition – can be avoided by coding
bit k-1 bitk bit k+1 Delay factor g
1
− 1 + r
1 + 2r
− − 1 + 2r
− 1 + 3r
1 + 4r
Slide 6.33
Slide 6.33
++
Error-Correcting Codes
NN + k
ND
DencD
with
e.g.,
1
1
0
= 3
Example: (4,3,1) Hamming Code
B 3wrong Adding redundancy allows
for more aggressive scaling of signal swings and/or timing
Simpler codes such as Hamming prove most
Adding redundancy allows for more aggressive scaling of signal swings and/or timing
Simpler codes such as Hamming prove most effective
P1P2B3P4B5B6B7
P1 B3 B5 B 7 = 0
P4 B5 B5 B 7 = 0+
P2 B3 B6 B 7 = 0+
+
+++
+
+
+++
+
+
Slide 6.34
Media Access
Sharing of physical media over multiple data streams increases capacitance and activity (see Chapter 5), but reduces areaMany multi-access schemes known from communications – Time domain: Time-Division Multiple Access (TDMA)– Frequency domain: narrow band, code division multiplexing
Buses based on Arbitration-based TDMA most common in today’s ICs
Slide 6.35
Bus Protocols and Energy
Some lessons from the communications world:– When utilization is low, simple schemes are more effective – When traffic is intense, reservation of resources minimizes
overhead and latency (collisions, resends)
Combining the two leads to energy efficiencyExample: Silicon backplane micro-network
CurrentSlot
Independent arbitration for every cycle includes two phases:- Distributed TDMA for guaranteed latency/bandwidth- Round-robin for random access
Arbitration
Command
[Courtesy: Sonics, Inc]
Slide 6.36
The Network Layer
Topology-independent end-to-end communication over multiple data links (routing, bridging, repeaters) Topology Static versus dynamic
configuration/routing
Physical
Transport
Session
Data Link
Network
Presentation/Application
Becoming more important in today’s complex multi-processor designs“The Network-on-a-Chip (NoC)”
[Ref: G. De Micheli, Morgan-Kaufman’06]
Slide 6.37
Network-on-a-Chip (NoC)
Dedicated networks with reserved links preferable for high-traffic channels – but limited connectivity, area overheadFlexibility an increasing requirement in (many) multi-corechip implementations
or
Slide 6.38
The Network Trade-offs
Interconnect-oriented architecture trades off flexibility, latency, energy and area efficiency through the following concepts
Locality – eliminate global structuresHierarchy – expose locality in communication requirementsConcurrency/Multiplexing
Very Similar to Architectural Space Trade-offs
Dedicated wiring Network-on-a-Chip
LocalLogic
Router
Proc
NetworkWires
[Courtesy: B. Dally, Stanford]
Slide 6.39
Networking Topologies
Homogeneous– Crossbar, Butterfly, Torus, Mesh,Tree, …
Heterogeneous– Hierarchy
Mesh (FPGA)
Tree
Crossbar
FF
FF
FF
FF
Slide 6.40
Network Topology Exploration
Manhattan Distance
Ene
rgy
x D
elay
Mesh
Binary Tree
Manhattan Distance
Ene
rgy
x D
elay
Mesh
Binary Tree
Mesh + Inverse-clustering tree
Short connections in tree are redundant
Inverse clustering complements mesh
[Ref: V. George, Springer’01]
Slide 6.41
Circuit-Switched Versus Packet-Based
On-chip reality : Wires (bandwidth) are relatively cheap; buffering and routing expensivePacket-switched approach versatile– Preferred approach in large networks
– But … routers come with large overhead– Case study Intel: 18% of power in link, 82%
in router
Circuit-switched approach attractive for high-data-rate quasi-static linksHierarchical combination often preferred choice
BusBus
C C
C C
Bus to connect over short distances
Hierarchical circuit- and packet-switched networks for longer connections
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
R R
R R
Bus Bus
Bus Bus
Slide 6.42
Example: The Pleiades Network-on-a-Chip
Configuration Bus• Configurable platform for low-energy communication and signal-processing applications (see Chapter 5)• Allows for dynamic task-level reconfiguration of process networks
Energy-efficient flexible network essential to the concept
Configurable Interconnect
ArithmeticModule
ArithmeticModule
ArithmeticModule
ConfigurableLogic
ConfigurableLogicμPμP
Configuration
DedicatedArithmetic
Network Interface
[Ref: H. Zhang, JSSC’00]
Slide 6.43
Pleiades Network Layer
Universal Switchbox
hseM2-leveLhseM1-leveL
Hierarchical Switchbox
• Network statically configured at start of session and ripped up at end• Structured approach reduces interconnect energy by a factor of 7 over straightforward crossbar
Hierarchical reconfigurable mesh network
Cluster
Cluster
Slide 6.44
Top Layers of the OSI Stack
Abstracts communication architecture to system and performs data formatting and conversion
Establishes and maintains end-to-end communications – flow control, message
reordering, packet segmentation and reassembly
Physical
Transport
Session
Data Link
Presentation/Application
Network
Example: Establish, maintain, and rip up connections in dynamically reconfigurable systems-on-a-chip –Important in power management
Slide 6.45
What about Clock Distribution?
Clock easily the most energy-consuming signal of a chip– Largest length– Largest fan-out
– Most activity (α = 1)
Skew control adding major overhead– Intermediate clock repeaters– De-skewing elements
Opportunities– Reduced swing– Alternative clock distribution schemes– Avoiding a global clock altogether
Slide 6.46
Reduced-Swing Clock Distribution
Similar to reduced-swing interconnectRelatively easy to implementBut extra delay in flip-flops adds directly to clock period
Example: half-swing clock distribution scheme
Regular two-phase clock
Half-swing clock
V DD
VDD
Cp
Cn
Gnd
P-device
N-device
GND
V DD
GND
NMOS clock
PMOS clock
NMOS clock
PMOS clock
© IEEE 1995
[Ref: H. Kojima, JSSC’95]
Slide 6.47
Alternative Clock Distribution Schemes
Canceling skew in perfecttransmission line scenario
Example: Transmission-Line Based Clock Distribution
[Ref: V. Prodanov, CICC’06]
Slide 6.48
Summary
Interconnect important component of overall power dissipationStructured approach with exploration at different abstraction layers most effectiveLot to be learned from communications and networking community – yet, techniques must be applied judiciously – Cost relationship between active and passive
components different
Some exciting possibilities for the future: 3Dintegration, novel interconnect materials, optical or wireless I/O
Slide 6.49
Books and Book ChaptersT. Burd, Energy-Efficient Processor System Design,http://bwrc.eecs.berkeley.edu/Publications/2001/THESES/energ_eff_process-sys_des/index.htm, UCB, 2001.G. De Micheli and L. Benini, Networks on Chips: Technology and Tools, Morgan-Kaufman, 2006. V. George and J. Rabaey, “Low-energy FPGAs: Architecture and Design”, Springer 2001.J. Rabaey, A. Chandrakasan and B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003.C. Svensson, “Low-Power and Low-Voltage Communication for SoC’s,” in C. Piguet, Low-Power Electronics Design, Ch. 14, CRC Press, 2005.L. Svensson, “Adiabatic and Clock-Powered Circuits,” in C. Piguet, Low-Power Electronics Design, Ch. 15, CRC Press, 2005.G. Yeap, “Special Techniques”, in Practical Low Power Digital VLSI Design, Ch 6., Kluwer Academic Publishers, 1998.
ArticlesL. Benini et al., “Address Bus Encoding Techniques for System-Level Power Optimization,” Proceedings DATE’98, pp. 861–867, Paris, Feb. 1998.T. Burd et al., “A Dynamic Voltage Scaled Microprocessor System,” IEEE ISSCC Digest of Technical Papers, pp. 294–295, Feb. 2000.M. Chang et al., “CMP Network-on-Chip Overlaid with Multi-Band RF Interconnect”, International Symposium on High-Performance Computer Architecture, Feb. 2008.D.M. Chapiro, “Globally Asynchronous Locally Synchronous Systems,” PhD thesis, Stanford University, 1984.
References
Slide 6.50
W. Dally, “Route packets, not wires: On-chip interconnect networks,” Proceedings DAC 2001, pp. 684–689, Las Vegas, June 2001.J. Davis and J. Meindl, “Is Interconnect the Weak Link?,” IEEE Circuits and Systems Magazine, pp. 30–36, Mar. 1998.J. Davis et al., “Interconnect limits on gigascale integration (GSI) in the 21st century,” Proceedings of the IEEE, 89(3), pp. 305–324, Mar. 2001.D. Hopkins et al., "Circuit techniques to enable 430Gb/s/mm2 proximity communication," IEEE International Solid-State Circuits Conference, vol. XL, pp. 368–369, Feb. 2007. H. Kojima et al., “Half-swing clocking scheme for 75% power saving in clocking circuitry,” Journal of Solid Stated Circuits, 30(4), pp. 432–435, Apr. 1995.E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” Proceedings ISLPED’98, pp.155–160, Monterey, Aug. 1998.V. Prodanov and M. Banu, “GHz serial passive clock distribution in VLSI using bidirectionalsignaling,” Proceedings CICC 06.S. Ramprasad et al., “A coding framework for low-power address and data busses,” IEEE Transactions on VLSI Signal Processing, 7(2), pp. 212–221, June 1999.M. Sgroi et al.,“Addressing the system-on-a-chip woes through communication-based design,”Proceedings DAC 2001, pp. 678–683, Las Vegas, June 2001.P. Sotiriadis and A. Chandrakasan, “Reducing bus delay in submicron technology using coding,”Proceedings ASPDAC Conference, Yokohama, Jan. 2001.
References (cont.)
Slide 6.51
References (cont.)
M. Stan and W. Burleson, “Bus-invert coding for low-power I/O,” IEEE Transactions on VLSI, pp. 48–58, Mar. 1995.M.. Stan and W. Burleson, "Low-power encodings for global communication in CMOS VLSI", IEEE Transactions on VLSI Systems, pp. 444–455, Dec. 1997.
V. Sathe, J.-Y. Chueh and M. C. Papaefthymiou, “Energy-efficient GHz-class charg-recovery logic”, IEEE JSSC, 42(1), pp. 38–47, Jan. 2007.
L. Svensson et al., “A sub-CV2 pad driver with 10 ns transition time,” Proc. ISLPED 96, Monterey, Aug. 12–14, 1996.
D. Wingard, “Micronetwork-based integration for SOCs,” Proceedings DAC 01, pp. 673–677, Las Vegas, June 2001.
H. Yamauchi et al., “An asymptotically zero power charge recycling bus,” IEEE Journal of Solid- Stated Circuits, 30(4), pp. 423–431, Apr. 1995.H. Zhang, V. George and J. Rabaey, “Low-swing on-chip signaling techniques: Effectiveness and robustness,” IEEE Transactions on VLSI Systems, 8(3), pp. 264–272, June 2000.
H. Zhang et al., “A 1V heterogeneous reconfigurable processor IC for baseband wireless applications,” IEEE Journal of Solid-State Circuits, 35(11), pp. 1697–1704, Nov. 2000.
Slide 6.52