cse241 l3 asics.1kahng & cichy, ucsd ©2003 cse241 vlsi digital circuits winter 2003 lecture 07:...

CSE241 L3 ASICs.1 Kahng & Cichy, UCSD ©2003

CSE241VLSI Digital Circuits

Winter 2003

Lecture 07: Timing II


Delay Calculation

Cap\Tr 0.05 0.2 0.5

0.01 0.02 0.16 0.30

0.5 0.04 0.32 0.60

2.0 0.08 0.64 1.20

Cap\Tr 0.05 0.2 0.5

0.01 0.03 0.18 0.33

0.5 0.06 0.36 0.66

2.0 0.09 0.72 1.32

Cell Fall

Cell Rise

1.0pf

0.1ns

0.12ns

Fall delay = 0.178nsRise delay = 0.261nsFall transition = 0.147nsRise transition = …

0.178

0.261

Cap\Tr 0.05 0.2 0.5

0.01 0.01 0.09 0.15

0.5 0.03 0.27 0.45

2.0 0.06 0.54 0.90

Fall Transition

0.147

0.147ns


PVT (Process, Voltage, Temperature) Derating

Actual cell delay = Original delay x KPVT


PVT Derating: Example + Min/Typ/Max Triples

Proc_var (0.5:1.0:1.3)Voltage (5.5:5.0:4.5)Temperature (0:20:50)KP = 0.80 : 1.00 : 1.30KV = 0.93 : 1.00 : 1.08KT = 0.80 : 1.07 : 1.35

KPVT = 0.60 : 1.07 : 1.90

Cell delay = 0.261nsDerated delay = 0.157 : 0.279 : 0.496 {min : typical : max}


Conservatism of Gate Delay Modeling

True gate delay depends on input arrival time patterns

STA will assume that only 1 input is switching Will use worst slope among several inputs

Time

A B Ftpd

Time

A Ftpd

Vdd

Vdd

DA

B

F

CLD

A

B

F

CL


This Class + Logistics

Reading Smith, Chapters 15, 16 http://vlsicad.ucsd.edu/Presentations/ICCAD00TUTORIAL/ Possibly: Sarrafzadeh/Wong Chapters 2 - placement, 3 - routing,

(4 – performance modeling)

Schedule- MT will be take-home (and, easy), BUT you lose 5% if you

don’t show up on Thursday (attendance will be taken by Ben)

- Thursday: Surprise guest lecturer on floorplan / placement

HW #12: Suppose that you want to work on timing edges that are most critical according to some F(slack of the edge, #paths through the edge). How would you modify the STA calculation (longest path in a DAG) so that it also calculates the number of paths through each edge?

Slide courtesy of S. P. Levitan, U. Pittsburg


Buffer Clustering

Sylvester / Shepard, 2001

Hierarchical clustering connecting clock source (= root) to clock sinks (= leaves) of clustering tree

Fanout at each level between 5 and 200 (depends on buffer library)

Often specify a clock topology in the tool as, e.g., (1)-6-8-5 root has 6 children, each of which has 8 children, each of which has 5 (leaf) children 240 clock sinks

Big question: how to perform the hierarchical buffer clustering? What makes a “good” cluster?


Buffer Clustering by Space Partitioning


Example: Cadence CT-Gen

Pick fanout (e.g., 6-4)

Pick “long axis” of bounding box of sinks

Place buffers at medians (essentially) of chunks of sinks identified by space-partitioning

Why is this good? Uses (or assumes) min wire; easily routed (Steiner routing; robust to ECOs; …

Why is it bad? Oversizes drivers; commits to skew which could be avoided


Buffer Clustering by Traditional Clustering


Example: SPC, old Cell3 CTS

Pick fanout (e.g., 6)

Find clusters of size 6

Place buffers at centers or centroids or … of clusters

Recurse

Why is this good? Can get near-zero skew trees?

Why is this bad? ECOs; hard to route; more wire(?); difficult algorithms!

HW #13: Propose a hierarchical clustering strategy for buffered clock trees, and explain its pros and cons


Outline

Clocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models


Skew Reduction Using Package

• Most clock network latency occurs at global level (largest distances spanned)

• Latency Skew

• With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming



System clock

P/ASIC Solder bump

substrate

Incorporate global clock distribution into the package

Flip-chip packaging allows for high density, low parasitic access from substrate to IC

• RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring

• Global skew reduced

• Lower capacitance lower power

• Opens up global routing tracks

• Results not yet conclusive

Skew Reduction Using Package



Useful Skew (= cycle-stealing)

FF fast FF FFslow

Zero skew

hold setup hold setup

Timing Slacks

FF fast FF FFslow

Useful skew

hold setup hold setup

Useful skew

• Local skew constraints

• Shift slack to critical paths

Zero skew

• Global skew constraint

• All skew is badW. Dai, UC Santa Cruz


Skew = Local Constraint

D : longest pathd : shortest path

FF FF

safe

Skew

race condition cycle time violation

-d + thold Tperiod - D - tsetup< <

permissible range

Timing is correct as long as the signal arrives in the permissible skew range

W. Dai, UC Santa Cruz


Skew Scheduling for Design Robustness

“0 0 0”: at verge of violation

FF FF FF2 ns 6 ns

T = 6 ns

“2 0 2”: more safety margin4 0

-22

4 0

Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge

Can solve a linear program to maximize robustness = determine prescribed sink skews



Potential Advantages of Useful Skew

CLK

0-skew

CLK

U-skew

Reduce peak current consumption by distributing the FF switch point in the range of permissible skew

Affords extra margin to increase clock frequency or reduce sizing (= power)



Conventional Zero-Skew Flow

PlacementPlacement

SynthesisSynthesis

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

0-Skew Clock Synthesis0-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing



Useful-Skew Flow

Existing PlacementExisting Placement

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

U-Skew Clock SynthesisU-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing

Permissible range generationPermissible range generation

Initial skew schedulingInitial skew scheduling

Clock tree topology synthesisClock tree topology synthesis

Clock net routingClock net routing

Clock timing verificationClock timing verification



Outline

Clocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and used-skew degrees of freedom

Clock power issues

Gate timing models


Power consumption in clocks due to: Clock drivers Long interconnections Large clock loads – all clocked elements (latches, FF’s) are driven

Different components dominate Depending on type of clock network used Ex. Grid – huge pre-drivers & wire cap. drown out load cap.

Clock Power



Clock Power Is LARGE

Not only is the clock capacitance large, it switches every cycle!

P = C Vdd2 f



Low-Power Clocking

Gated clocksGated clocks Prevent switching in areas of chip not being usedPrevent switching in areas of chip not being used Easier in static designsEasier in static designs

Edge-triggered flops in ARM rather than transparent latches Edge-triggered flops in ARM rather than transparent latches in Alphain Alpha Reduced load on clock for each latch/flopReduced load on clock for each latch/flop Eliminated spurious power-consuming transitions during latch flow-Eliminated spurious power-consuming transitions during latch flow-

through (transparency)through (transparency)



Clock Area

Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area

Routing area is most vital

Top-level metals are used to reduce RC delays These levels are precious resources (unscaled) Power routing, clock routing, key global signals

Reducing area also reduces wiring capacitance and power

Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing



Clock Slew Rates

To maintain signal integrity and latch performance, minimum slew rates are required

Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew (ps)], more short-circuit power for large clock drivers

Too fast – burns too much power, overdesigned network, enhanced ground bounce

Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target)

1 GHz clock; Trise = Tfall = 100-200ps



Example: Alpha 21264

Grid + H-tree approach

Power = 32% of total

Wire usage = 3% of metals 3 & 4

4 major clock quadrants, each with a large driver connected to local grid structures



Alpha 21264 Skew Map

Ref: Compaq, ASP-DAC00Sylvester / Shepard, 2001


Power vs. Skew

Fundamental design decision Meeting skew requirements is easy with unlimited

power budget Wide wires reduce RC product but increase total C Driver upsizing reduces latency ( reduces skew as well)

but increases buffer cap SOC context: plastic package power limit is 2-3 W



Clock Distribution Trends

Timing Clock period dropping fast, skew must follow Slew rates must also scale with cycle time Jitter – PLL’s get better with CMOS scaling but other sources of noise

increase- Power supply noise more important

- Switching-dependent temperature gradients

Materials Cu reduces RC slew degradation, potential skew Low-k decreases power, improves latency, skew, slews

Power Complexity, dynamic logic, pipelining more clock sinks Larger chips bigger clock networksSylvester / Shepard, 2001

cse241 l3 asics.1kahng & cichy, ucsd ©2003 cse241 vlsi digital circuits winter 2003 lecture 07:...

Documents

cse241 l3 asics

kahng cichy

ns slide

pvt slide

ns rise delay

ns derated delay

pittsburg slide

slide courtesy