optimizing power @ design time – interconnect and clocks … · 2014. 1. 16. · clb μprocessor...

Chapter 6

Optimizing Power @ Design Time – Interconnect and Clocks

Jan M. Rabaey

Optimizing Power @ Design Time

Interconnect and Clocks

Slide 6.1

Chapter Outline

� Trends and bounds

� An OSI approach to interconnect-optimization– Physical layer– Data link and MAC– Network– Application

� Clock distribution

Slide 6.2

ITRS Projections

Calendar Year 2012 2018 2020

Interconnect One Half Pitch 35 nm 18 nm 14 nm

MOSFET Physical Gate Length 14 nm 7 nm 6 nm

Number of Interconnect Levels 12–16

On-Chip Local Clock 20 GHz 53 GHz 73 GHz

Chip-to-Board Clock 15 GHz 56 GHz 89 GHz

# of Hi Perf. ASIC Signal I/O Pads 2500 3100 3100

# of Hi Perf. ASIC Power/Ground Pads 2500 3100 3100

Supply Voltage 0.7–0.9 V 0.5–0.7 V 0.5–0.7 V

Supply Current 283–220 A 396–283 A 396–283 A

14–18 14–18

Slide 6.3

Increasing Impact of Interconnect

� Interconnect is now exceeding transistors in

– Latency

– Power dissipation

– Manufacturing complexity

� Direct consequence of scaling

Slide 6.4

Communication Dominant Part of Power Budget

65%

21%

9%5%

Interconnect

Clock

I/OCLB

μProcessorClock

LogicMemory

I/O

ASSP

ClocksCaches

Execution Units

Control I/O Drivers

40%20%

15%

15% 10%

FPGA

Signal processor

Slide 6.5

Idealized Wire Scaling Model

Parameter Relation Local Wire Constant Length Global Wire

W, H, t 1/S 1/S 1/S

L 1/S 1 1/SC

C LW/t 1/S 1 1/SC

R L/WH S S 2 S 2/SC

tp ~ CR L2/Ht 1 S 2 S 2/SC2

E CV 2 1/SU 2 1/U 2 1/(SCU 2)

Slide 6.6

Distribution of Wire Lengths on Chip

© IEEE 1998

[Ref: J. Davis, C&S’98]

1.0E + 5

1.0E + 4

1.0E + 3

1.0E + 2

1.0E + 1

N – 86161p – 0.8k – 5.0

Inte

rcon

nect

Den

sity

Fun

ctio

n f(

t)

1.0E + 0

1.0E – 11 10

Interconnect Length, Γ[gate pitches]100

Actual DataStochastic Model

1000

Slide 6.7

Technology Innovations

Reduce dielectric permittivity (e.g., Aerogels or air)

Reduce resistivity (e.g., Copper)

Reduce wire lengths through 3D-integration

Novel interconnect media (carbon nanotubes, optical)

(Courtesy of IBM and IFC FCRP)

© IEEE 1998

Slide 6.8

Logic Scaling

ptp~1/s310–12

10–12

10–9

10–9

10–6

10–6

10–3

10–3

100

100

Pow

er P

(W)

Delay tp (s)

10–6J

10–9J

10–12J

10–15J

10–18J

[Ref: J. Davis, Proc’01]

Slide 6.9

Interconnect Scaling

Delay τ (s)

(Len

gth

)–2 L

–2 (

cm–2

)

Len

gth

L (

cm)

10

10–18 10–3

10–3

10–2

10–2

10–1

10–0100

102

104

106

108

1010

102

10–6

10–5

10–4

10–4

10–910–1210–15

L–2τ = 10–5 [s/cm–2](F = 0.1µ)

L–2τ ~ S 2

10–13

(1000µ)

10–11

(100µ)

10–9(10µ)

10–7(1µ)


Slide 6.10

Lower Bounds on Interconnect Energy

Claude Shannon

kTBC ≤ B log2 (1+ PS )

C : capacity in bits/sec B : bandwidth Ps : average signal power

Ebit = PS / C

Valid for an “infinitely long” bit transition (C/B→0)Equals 4.10–21J/bit at room temperature

)2ln()0/((min) kTBCEE bitbit =→=

Shannon’s theorem on maximum capacity of communication channel


Slide 6.11

Reducing Interconnect Power/Energy

� Same philosophy as with logic: reduce capacitance, voltage (or voltage swing), and/or activity

� A major difference: sending a bit(s) from one point to another is fundamentally a communications/ networking problem, and it helps to consider it as such

� Abstraction layers are different:– For computation: device, gate, logic, micro-architecture– For communication: wire, link, network, transport

� Helps to organize along abstraction layers, well- understood in the networking world: the OSI protocol stack

Slide 6.12

OSI Protocol Stack

� Reference model for wired and wireless protocol design — Also useful guide for conception and optimization of on-chip communication

� Layered approach allows for orthogonalization of concerns and decomposition of constraints

Network

Transport

Session

Data Link

Physical

Presentation/Application

� No requirement to implement all layers of the stack� Layered structure need not necessarily be maintained in

final implementation [Ref: M. Sgroi, DAC’01]

Slide 6.13

The Physical Layer

Transmits bits over physical interconnect medium (wire)�Physical medium

– Material choice, repeater insertion

� Signal waveform– Discrete levels, pulses,

modulated sinusoids

� Voltages– Reduced swing

� Timing, synchronization

Network

Transport

Session

Data Link

Physical


So far, on-chip communication almost uniquely “level-based”

Slide 6.14

Repeater Insertion

Optimal receiver insertion results in wire delay linear with L

))(( wwddp crCRLt ∝with RdCd and rwcw intrinsic delays of inverter and wire, respectively

But: At major energy cost!

VinVout

C/m C/m C/m C/m

R/m R/m R/m R/mIpdef

Slide 6.15

Repeater Insertion ─ Example

� 1 cm Cu wire in 90 nm technology (on intermediate layers)– rw = 250 Ω/mm; cw = 200 fF/mm– tp = 0.69rwcwL2= 3.45 ns

� Optimal driver insertion:– tp opt= 0.5 ns– Requires insertion of 13 repeaters– Energy per transition 8 times larger than just charging

the wire (6 pJ verus 0.75 pJ)!� It pays to back off!

Slide 6.16

Wire Energy–Delay Trade-off

1 2 3 4 5 6 7 80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Eno

rm

Dnorm

wire energy only

L = 1 cm (Cu)90 nm CMOS

(dmin

, emax)

Rep

eate

r ov

erhe

ad

Slide 6.17

Multi-Dimensional Optimization

� Design parameters:Voltage, number of stages, buffer sizes

� Voltage scaling has thelargest impact, followed by selection of number of repeaters

� Transistor sizing secondary

1 2 3 4 5 6 7 80

2

4

6

8

10

12

Dnorm

Num

ber

of s

tage

s0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

VD

D(V

)

Slide 6.18

Reduced Swing

� Ebit = CVDDVswing

� Concerns:– Overhead (area, delay)– Robustness (supply noise, crosstalk, process variations)– Repeaters?

Transmitter (TX) Receiver (RX)

Slide 6.19

Traditional Level Converter

� Requires two discrete voltage levels

� Asynchronous level conversion adds extra delay

V V

C

V

V

V

[Ref: H. Zhang, TVLSI’00]

Slide 6.20

Avoiding Extra References

Transient

VTC

[Ref: H. Zhang, VLSI’00]In

In2

VT

n

|VT

p|

Out

B

A

in

out

|VTHp|

VDD

VDD - VTn

Slide 6.21

Differential (Clocked) Signaling

� Allows for very low swings (200 mV)� Robust� Quadratic energy savings� But: doubling the wiring, extra clock signal, complexity

CL

CL

[Ref: T. Burd, UCB’01]

Slide 6.22

Lower Bound on Signal Swing?

� Reduction of signal swing translates into higher power dissipation in receiver – trade-off between wire and receiver energy dissipation

� Reduced SNR impacts reliability – current on-chip interconnect strategies require Bit Error Rate (BER) of zero (in contrast to communication and network links)– Noise sources: power supply noise, crosstalk

� Swings as low as 200 mV have been reported [Ref: Burd’00], 100 mV definitely possible

� Further reduction requires crosstalk suppression

shielding folding

GND

GND

GND

Slide 6.23

Quasi-Adiabatic Charging

t

V VDD

VDD/n

• Uses stepwise approximation of adiabatic (dis)charging• Capacitors acting as “charge reservoir”• Energy drawn from supply reduced by factor n

CT1

CT2

CTn–1

CT1 CL

CT2

CTn-1

[Ref: L. Svensson, ISLPED’96]

Slide 6.24

Charge Redistribution Schemes

VDD/2

VDD/4

3VDD/4

Precharge Eval Precharge

B0

B0

B1

B1

B0 = 0

B1 = 1

VDD

E

E

E

P

P

GND

RX1

RX0

1

0

B1

B1

B0

B0

� Charge recycled from top to bottom� Precharge phase equalizes differential lines� Energy/bit = 2C(VDD/n)2

� Challenges: Receiver design, noise margins[Ref: H. Yamauchi, JSSC’95]

Slide 6.25

Alternative Communication Schemes

� Example: Capacitively driven wires

Offers some compelling advantages � Reduced swing

Swing is VDD/(n+1) without extra supply

� Reduced loadAllows for smaller driver

� Reduced delayCapacitor pre-emphasizes edges Pitchfork capacitors exploit

sidewall capacitance

[Ref: D. Hopkins, ISSCC’07]

Slide 6.26

Signaling Protocols

Network

ProcessorModule

(μProc, ALU, MPY, SRAM…)

din reqin ackin dout req out ackout

Din

done

Globally Asynchronousself-timed handshaking protocol

Allows individual modulesto dynamically trade-off performancefor energy efficiency

Slide 6.27

Signaling Protocols

Network

Physical Layer Interface Module

ProcessorModule

(mProc, ALU, MPY, SRAM…)

din reqin ackindout reqout ackout

din dout clk

din

reqin

clk

done

Locally synchronous

done

Globally asynchronous

Slide 6.28

The Data Link/Media Access Layer

Reliable transmission over physical link and sharing interconnect medium between multiple sources and destinations (MAC)� Bundling, serialization, packetizing� Error detection and correction� Coding� Multiple-access schemes

Network

Transport

Session


Data Link

Physical

Slide 6.29

Coding

N N + k N

LinkTX RX

Adding redundancy to communication link (extra bits) to:� Reduce transitions (activity encoding)� Reduce energy/bit (error-correcting coding)

Slide 6.30

Activity Reduction Through Coding

NN + 1

N

Example: Bus-Invert Coding

Invert bit p

� Data word D inverted if Hamming distance from previous word is larger than N/2.

DDenc

D

D # T Denc p # T

0010101000111011110101000000110101110110…

-2756

0010101000111011001010110000110110001001…

00101

-21+13+12+1

[Ref: M. Stan, TVLSI’95]

Slide 6.31

Bus-Invert Coding

Gain:25 % (at best – for random data)

Overhead: Extra wire (and activity)Encoder, decoder

Not effective for correlated data

Gain:25% (at best – for random data)

Overhead: Extra wire (and activity)Encoder, decoder

Not effective for correlated data

P

Encode

Decode

D DDenc

p

Bus

[Ref: M. Stan, TVLSI’95]

Slide 6.32

Other Transition Coding Schemes

Advanced bus-invert coding (e.g. partition bus into sub-components) (e.g. [M.Stan, TVLSI’97])

Coding for address busses (which often display sequentiality) (e.g. [L. Benini, DATE’98])

Full-fledged channel coding, borrowed from communication links (e.g. [S. Ramprasad, TVLSI’99])

Coding to reduce impact of Miller capacitance between neighboring wires [Ref: Sotiriadis, ASPDAC’01]

Maximum capacitance transition – can be avoided by coding

bit k-1 bitk bit k+1 Delay factor g

1

− 1 + r

1 + 2r

− − 1 + 2r

− 1 + 3r

1 + 4r

Slide 6.33

Slide 6.33

++

Error-Correcting Codes

NN + k

ND

DencD

with

e.g.,

1

1

0

= 3

Example: (4,3,1) Hamming Code

B 3wrong Adding redundancy allows

for more aggressive scaling of signal swings and/or timing

Simpler codes such as Hamming prove most

Adding redundancy allows for more aggressive scaling of signal swings and/or timing

Simpler codes such as Hamming prove most effective

P1P2B3P4B5B6B7

P1 B3 B5 B 7 = 0

P4 B5 B5 B 7 = 0+

P2 B3 B6 B 7 = 0+

+

+++

+

+

+++

+

+

Slide 6.34

Media Access

Sharing of physical media over multiple data streams increases capacitance and activity (see Chapter 5), but reduces areaMany multi-access schemes known from communications – Time domain: Time-Division Multiple Access (TDMA)– Frequency domain: narrow band, code division multiplexing

Buses based on Arbitration-based TDMA most common in today’s ICs

Slide 6.35

Bus Protocols and Energy

Some lessons from the communications world:– When utilization is low, simple schemes are more effective – When traffic is intense, reservation of resources minimizes

overhead and latency (collisions, resends)

Combining the two leads to energy efficiencyExample: Silicon backplane micro-network

CurrentSlot

Independent arbitration for every cycle includes two phases:- Distributed TDMA for guaranteed latency/bandwidth- Round-robin for random access

Arbitration

Command

[Courtesy: Sonics, Inc]

Slide 6.36

The Network Layer

Topology-independent end-to-end communication over multiple data links (routing, bridging, repeaters) Topology Static versus dynamic

configuration/routing

Physical

Transport

Session

Data Link

Network


Becoming more important in today’s complex multi-processor designs“The Network-on-a-Chip (NoC)”

[Ref: G. De Micheli, Morgan-Kaufman’06]

Slide 6.37

Network-on-a-Chip (NoC)

Dedicated networks with reserved links preferable for high-traffic channels – but limited connectivity, area overheadFlexibility an increasing requirement in (many) multi-corechip implementations

or

Slide 6.38

The Network Trade-offs

Interconnect-oriented architecture trades off flexibility, latency, energy and area efficiency through the following concepts

Locality – eliminate global structuresHierarchy – expose locality in communication requirementsConcurrency/Multiplexing

Very Similar to Architectural Space Trade-offs

Dedicated wiring Network-on-a-Chip

LocalLogic

Router

Proc

NetworkWires

[Courtesy: B. Dally, Stanford]

Slide 6.39

Networking Topologies

Homogeneous– Crossbar, Butterfly, Torus, Mesh,Tree, …

Heterogeneous– Hierarchy

Mesh (FPGA)

Tree

Crossbar

FF

FF

FF

FF

Slide 6.40

Network Topology Exploration

Manhattan Distance

Ene

rgy

x D

elay

Mesh

Binary Tree

Manhattan Distance

Ene

rgy

x D

elay

Mesh

Binary Tree

Mesh + Inverse-clustering tree

Short connections in tree are redundant

Inverse clustering complements mesh

[Ref: V. George, Springer’01]

Slide 6.41

Circuit-Switched Versus Packet-Based

On-chip reality : Wires (bandwidth) are relatively cheap; buffering and routing expensivePacket-switched approach versatile– Preferred approach in large networks

– But … routers come with large overhead– Case study Intel: 18% of power in link, 82%

in router

Circuit-switched approach attractive for high-data-rate quasi-static linksHierarchical combination often preferred choice

BusBus

C C

C C

Bus to connect over short distances

Hierarchical circuit- and packet-switched networks for longer connections

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

R R

R R

Bus Bus

Bus Bus

Slide 6.42

Example: The Pleiades Network-on-a-Chip

Configuration Bus• Configurable platform for low-energy communication and signal-processing applications (see Chapter 5)• Allows for dynamic task-level reconfiguration of process networks

Energy-efficient flexible network essential to the concept

Configurable Interconnect

ArithmeticModule

ArithmeticModule

ArithmeticModule

ConfigurableLogic

ConfigurableLogicμPμP

Configuration

DedicatedArithmetic

Network Interface

[Ref: H. Zhang, JSSC’00]

Slide 6.43

Pleiades Network Layer

Universal Switchbox

hseM2-leveLhseM1-leveL

Hierarchical Switchbox

• Network statically configured at start of session and ripped up at end• Structured approach reduces interconnect energy by a factor of 7 over straightforward crossbar

Hierarchical reconfigurable mesh network

Cluster

Cluster

Slide 6.44

Top Layers of the OSI Stack

Abstracts communication architecture to system and performs data formatting and conversion

Establishes and maintains end-to-end communications – flow control, message

reordering, packet segmentation and reassembly

Physical

Transport

Session

Data Link


Network

Example: Establish, maintain, and rip up connections in dynamically reconfigurable systems-on-a-chip –Important in power management

Slide 6.45

What about Clock Distribution?

Clock easily the most energy-consuming signal of a chip– Largest length– Largest fan-out

– Most activity (α = 1)

Skew control adding major overhead– Intermediate clock repeaters– De-skewing elements

Opportunities– Reduced swing– Alternative clock distribution schemes– Avoiding a global clock altogether

Slide 6.46

Reduced-Swing Clock Distribution

Similar to reduced-swing interconnectRelatively easy to implementBut extra delay in flip-flops adds directly to clock period

Example: half-swing clock distribution scheme

Regular two-phase clock

Half-swing clock

V DD

VDD

Cp

Cn

Gnd

P-device

N-device

GND

V DD

GND

NMOS clock

PMOS clock

NMOS clock

PMOS clock

© IEEE 1995

[Ref: H. Kojima, JSSC’95]

Slide 6.47

Alternative Clock Distribution Schemes

Canceling skew in perfecttransmission line scenario

Example: Transmission-Line Based Clock Distribution

[Ref: V. Prodanov, CICC’06]

Slide 6.48

Summary

Interconnect important component of overall power dissipationStructured approach with exploration at different abstraction layers most effectiveLot to be learned from communications and networking community – yet, techniques must be applied judiciously – Cost relationship between active and passive

components different

Some exciting possibilities for the future: 3Dintegration, novel interconnect materials, optical or wireless I/O

Slide 6.49

Books and Book ChaptersT. Burd, Energy-Efficient Processor System Design,http://bwrc.eecs.berkeley.edu/Publications/2001/THESES/energ_eff_process-sys_des/index.htm, UCB, 2001.G. De Micheli and L. Benini, Networks on Chips: Technology and Tools, Morgan-Kaufman, 2006. V. George and J. Rabaey, “Low-energy FPGAs: Architecture and Design”, Springer 2001.J. Rabaey, A. Chandrakasan and B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003.C. Svensson, “Low-Power and Low-Voltage Communication for SoC’s,” in C. Piguet, Low-Power Electronics Design, Ch. 14, CRC Press, 2005.L. Svensson, “Adiabatic and Clock-Powered Circuits,” in C. Piguet, Low-Power Electronics Design, Ch. 15, CRC Press, 2005.G. Yeap, “Special Techniques”, in Practical Low Power Digital VLSI Design, Ch 6., Kluwer Academic Publishers, 1998.

ArticlesL. Benini et al., “Address Bus Encoding Techniques for System-Level Power Optimization,” Proceedings DATE’98, pp. 861–867, Paris, Feb. 1998.T. Burd et al., “A Dynamic Voltage Scaled Microprocessor System,” IEEE ISSCC Digest of Technical Papers, pp. 294–295, Feb. 2000.M. Chang et al., “CMP Network-on-Chip Overlaid with Multi-Band RF Interconnect”, International Symposium on High-Performance Computer Architecture, Feb. 2008.D.M. Chapiro, “Globally Asynchronous Locally Synchronous Systems,” PhD thesis, Stanford University, 1984.

References

Slide 6.50

W. Dally, “Route packets, not wires: On-chip interconnect networks,” Proceedings DAC 2001, pp. 684–689, Las Vegas, June 2001.J. Davis and J. Meindl, “Is Interconnect the Weak Link?,” IEEE Circuits and Systems Magazine, pp. 30–36, Mar. 1998.J. Davis et al., “Interconnect limits on gigascale integration (GSI) in the 21st century,” Proceedings of the IEEE, 89(3), pp. 305–324, Mar. 2001.D. Hopkins et al., "Circuit techniques to enable 430Gb/s/mm2 proximity communication," IEEE International Solid-State Circuits Conference, vol. XL, pp. 368–369, Feb. 2007. H. Kojima et al., “Half-swing clocking scheme for 75% power saving in clocking circuitry,” Journal of Solid Stated Circuits, 30(4), pp. 432–435, Apr. 1995.E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” Proceedings ISLPED’98, pp.155–160, Monterey, Aug. 1998.V. Prodanov and M. Banu, “GHz serial passive clock distribution in VLSI using bidirectionalsignaling,” Proceedings CICC 06.S. Ramprasad et al., “A coding framework for low-power address and data busses,” IEEE Transactions on VLSI Signal Processing, 7(2), pp. 212–221, June 1999.M. Sgroi et al.,“Addressing the system-on-a-chip woes through communication-based design,”Proceedings DAC 2001, pp. 678–683, Las Vegas, June 2001.P. Sotiriadis and A. Chandrakasan, “Reducing bus delay in submicron technology using coding,”Proceedings ASPDAC Conference, Yokohama, Jan. 2001.

References (cont.)

Slide 6.51

References (cont.)

M. Stan and W. Burleson, “Bus-invert coding for low-power I/O,” IEEE Transactions on VLSI, pp. 48–58, Mar. 1995.M.. Stan and W. Burleson, "Low-power encodings for global communication in CMOS VLSI", IEEE Transactions on VLSI Systems, pp. 444–455, Dec. 1997.

V. Sathe, J.-Y. Chueh and M. C. Papaefthymiou, “Energy-efficient GHz-class charg-recovery logic”, IEEE JSSC, 42(1), pp. 38–47, Jan. 2007.

L. Svensson et al., “A sub-CV2 pad driver with 10 ns transition time,” Proc. ISLPED 96, Monterey, Aug. 12–14, 1996.

D. Wingard, “Micronetwork-based integration for SOCs,” Proceedings DAC 01, pp. 673–677, Las Vegas, June 2001.

H. Yamauchi et al., “An asymptotically zero power charge recycling bus,” IEEE Journal of Solid- Stated Circuits, 30(4), pp. 423–431, Apr. 1995.H. Zhang, V. George and J. Rabaey, “Low-swing on-chip signaling techniques: Effectiveness and robustness,” IEEE Transactions on VLSI Systems, 8(3), pp. 264–272, June 2000.

H. Zhang et al., “A 1V heterogeneous reconfigurable processor IC for baseband wireless applications,” IEEE Journal of Solid-State Circuits, 35(11), pp. 1697–1704, Nov. 2000.

Slide 6.52

optimizing power @ design time – interconnect and clocks … · 2014. 1. 16. · clb μprocessor...

Documents