design and optimization techniques of high-speed vlsi circuits

8/7/2019 Design and optimization techniques of high-speed VLSI circuits

1/310

Design and optimization techniques of

highspeed VLSI circuits

Marco Delaurenti

Politecnico di Torino


2/310


3/310

Design and optimization techniques of

highspeed VLSI circuits

Marco Delaurenti

PhD Dissertation

December 1999

Politecnico di Torino

Advisor

Prof. Maurizio Zamboni

Coordinator

Prof. Ivo Montrosset


4/310

Copyright c1999 Marco Delaurenti


5/310

Writing comes more eas-

ily if you have something

to say.

(Sholem Asch)

When I use a word,

Humpty Dumpty said in

rather a scornful tone, it

means just what I choose

it to meanneither more

nor less.

(Lewis Carroll)


6/310

Acknoledgments

First of all I would like to thank my advisor, Prof. M. Zamboni, Prof. G

Piccinini, Prof. G. Masera for their invaluable help, and Prof. P. Civera for

his being a bridge toward the real world. Also many thanks to the VLSI

LAB members at Politecnico of Turin, Italy: Mario for his input about the

critical paths (no, I do not thank you for the jazz songs that you play all

day long), Luca for the long discussions about books and movies (no, I

havent seen the last Kubricks movie), Andrea for his very good cocktails

(especially the Negroni one) and Danilo, because I forgot him every time

we went to lunch. Thanks also to Max (for he gave me the root password),

and to Yuan&Svensson for the invention of the TSPC.

Special thanks, finally, to Mg, for her support and for have been tolerating

me till now.


7/310

CONTENTS

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Part I CMOS Logic 1

1. Introduction to CMOS logic . . . . . . . . . . . . . . . . . . . . . 3

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 CMOS logic families . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Static logic families . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Dynamic logic families . . . . . . . . . . . . . . . . . . 6

1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Part II Circuit Modeling 13

2. A simple model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 The Elmores model . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3. A complex model . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 The FAST model . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 MO S equations . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 Internal nodes approximation . . . . . . . . . . . . . . 24


8/310

viii Contents

3.1.3 Body effect . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Delay estimation . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Equation solving . . . . . . . . . . . . . . . . . . . . . 32

3.3 Power estimation . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Switching energy . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Shortcircuit energy . . . . . . . . . . . . . . . . . . . 39

3.3.3 Subthreshold energy . . . . . . . . . . . . . . . . . . 39

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Part III Optimization 45

4. Mathematic Optimization . . . . . . . . . . . . . . . . . . . . . 47

4.1 Optimization theory . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 Mono-objective optimization . . . . . . . . . . . . . . 49

4.1.1.1 Unconstrained problem . . . . . . . . . . . . 51

4.1.1.2 Constrained problem . . . . . . . . . . . . . 52

Lagrange multiplier and Penalty functions . . 52

4.1.2 Multi-objective optimization . . . . . . . . . . . . . . 54

4.1.2.1 Unconstrained . . . . . . . . . . . . . . . . . 56

4.1.2.2 Constrained . . . . . . . . . . . . . . . . . . 57

Compromise solution . . . . . . . . . . . . . . 57

4.2 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . 58

4.2.1 One-dimensional search techniques . . . . . . . . . . 59

4.2.1.1 The section search . . . . . . . . . . . . . . . 59

Dicotomic search . . . . . . . . . . . . . . . . . 59

Fibonacci Search . . . . . . . . . . . . . . . . . 60


9/310

Contents ix

The golden section search . . . . . . . . . . . . 60

Convergence considerations . . . . . . . . . . . 61

4.2.1.2 Parabolic interpolation . . . . . . . . . . . . 62

The Brents rule . . . . . . . . . . . . . . . . . . 62

4.2.2 Multi-dimensional search . . . . . . . . . . . . . . . . 63

4.2.2.1 The gradient direction: steepest (maximum)

descent . . . . . . . . . . . . . . . . . . . . . 63

4.2.2.2 The optimal gradient . . . . . . . . . . . . . 65

Convergence considerations . . . . . . . . . . . 66

4.2.3 The conjugate direction method . . . . . . . . . . . . 67

4.2.3.1 The FletcherReeves conjugate gradient al-

gorithm . . . . . . . . . . . . . . . . . . . . . 68

4.2.3.2 The Powell conjugate gradient algorithm . . 69

4.2.4 The SLOP algorithm . . . . . . . . . . . . . . . . . . 70

4.2.5 The simulated-annealing algorithm . . . . . . . . . . 72

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5. Circuit Optimization . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Optimization targets . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.1 Circuit delay . . . . . . . . . . . . . . . . . . . . . . . . 79

Critical Paths . . . . . . . . . . . . . . . . . . . 80

5.1.1.1 Delay formula obtained by the Elmore model 84

5.1.1.2 Delay measurement obtained by the FAST

model and by HSPICE . . . . . . . . . . . . . 86

5.1.2 Power consumption . . . . . . . . . . . . . . . . . . . 87

5.1.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2 Optimization examples . . . . . . . . . . . . . . . . . . . . . . 91

5.2.1 Algorithm choice . . . . . . . . . . . . . . . . . . . . . 94


10/310

x Contents

5.2.2 Mono-objective optimizations . . . . . . . . . . . . . . 95

5.2.2.1 Area . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.2.2 Power . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2.3 Delay . . . . . . . . . . . . . . . . . . . . . . 97

5.2.3 Multi-objective optimizations . . . . . . . . . . . . . . 102

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6. ACA D

tool for optimization . . . . . . . . . . . . . . . . . . . . 1076.1 Logical description . . . . . . . . . . . . . . . . . . . . . . . . 107

6.1.1 The optimization algorithm module (OA M) . . . . . . 107

6.1.2 The function evaluation module (FE M) . . . . . . . . . 109

6.1.3 Core engine . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2 Code implementation . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.1 The classes CircuitNetlist and Circuit . . . . . . . . . 110

6.2.2 The class EvaluationAlgorithm . . . . . . . . . . . . . 112

6.2.3 The class OptimizationAlgorithm . . . . . . . . . . . 113

6.2.4 The critical path retrieving . . . . . . . . . . . . . . . 115

6.2.5 The derived classes . . . . . . . . . . . . . . . . . . . . 116

6.3 Program flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7. Results and conclusions . . . . . . . . . . . . . . . . . . . . . . 121

7.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.1.1 Mono-objective vs. Multiobjective . . . . . . . . . . . 122

7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


11/310

Contents xi

Appendix 143

A. Class graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

B. Source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

B.1 Main functions . . . . . . . . . . . . . . . . . . . . . . . . . . 149

B.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . 208

B.3 Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216


12/310

xii Contents


13/310

LIST OF FIGURES

1.1 Static and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Pass-transistor logic xor . . . . . . . . . . . . . . . . . . . . . 61.3 Domino typical gate . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 CVSL typical gate . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 C2MOS typical gate . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 TSPC Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 RC MO S equivalence . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 RC chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 RC single cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Elmore impulse response . . . . . . . . . . . . . . . . . . . . . 18

3.1 Inverter voltages waveform . . . . . . . . . . . . . . . . . . . 23

3.2 Mos chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Node voltages . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Voltages wave form in the nMO S chain . . . . . . . . . . . . 27

3.5 Voltages wave forms in the pMOS chain . . . . . . . . . . . . 28

3.6 VDS and VGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 MOSFET chain with static voltages . . . . . . . . . . . . . . . 30

3.8 Threshold variation . . . . . . . . . . . . . . . . . . . . . . . . 31

3.9 Delay comparison . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 Energy comparison . . . . . . . . . . . . . . . . . . . . . . . . 43


14/310

xiv List of Figures

4.1 Section search . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Minimization by Powell algorithm . . . . . . . . . . . . . . . 70

4.3 Minimization by Powell algorithm . . . . . . . . . . . . . . . 71

4.4 Minimization by SLOP algorithm . . . . . . . . . . . . . . . . 72

4.5 Minimization by Simulated-annealing algorithm . . . . . . . 73

4.6 Minimization by Simulated-annealing algorithm . . . . . . . 74

5.1 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Delay definition . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Critical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Critical path tree . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5 Elmore delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Elmore delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 HSPICE delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8 FAST delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.9 HSPICE Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.10 CMOS Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.11 TSPC Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.12 TSPC And gates . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.13 TSPC Or gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.14 Static and-or gate . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.15 Static parity gate . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.16 Static full-adder . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.17 TSPC full-adder (onestage) . . . . . . . . . . . . . . . . . . . 101

6.1 Tool block diagram . . . . . . . . . . . . . . . . . . . . . . . . 108


15/310

List of Figures xv

7.1 Comparison of 0.7 m and 0.25 m. gates @ minimum tech-

nology width . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.2 Delay optimization of 0.7 m gates. . . . . . . . . . . . . . . . 125

7.3 Delay optimization of 0.25 m gates. . . . . . . . . . . . . . . 126

7.4 Technology comparison of delay optimization. . . . . . . . . 127

7.5 Several delaypower optimization policies of 0.7 m gates. . 132

7.6 Energy-dissipation variation (zoom of figure 7.5(b)) . . . . . 133

7.7 Several delaypower optimization policies of 0.25 m gates. 134

7.8 Energy-dissipation variation (zoom of figure 7.7(b)) . . . . . 135

7.9 Delaypower optimization (50%50%) comparison of 0.7 m

and 0.25 m gates. . . . . . . . . . . . . . . . . . . . . . . . . 136

7.10 Delay and power trajectory during 4 different multi-objective

optimizations for the andor gate . . . . . . . . . . . . . . . . 137


optimizations for the parity gate . . . . . . . . . . . . . . . . 138

7.12 Delay and power trajectory during 4 different multi-objectiveoptimizations for the static full-adder . . . . . . . . . . . . . 139


optimizations for the dynamic full-adder . . . . . . . . . . . 140


16/310

xvi List of Figures


17/310

LIST OF TABLES

3.1 Mean Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . 75

5.1 Basic gates: complexity . . . . . . . . . . . . . . . . . . . . . . 92

5.2 Basic gates: pre-optimization delay, power consumption and

area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3 Full-adder: delay optimization . . . . . . . . . . . . . . . . . 99

5.4 Agreements of targets . . . . . . . . . . . . . . . . . . . . . . 103

5.5 Full-adder: delay and power optimization . . . . . . . . . . 105

5.6 Full-adder: optimizations comparison . . . . . . . . . . . . . 105

7.1 Library gates list . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.2 Delay and energy dissipation @ minimum width (HSPICE) . 123

7.3 Delay decreasing and energy increasing (both relative) in a

delay optimization. . . . . . . . . . . . . . . . . . . . . . . . . 128

7.4 Elapsed time and total number of function evaluations for a

full-delay optimization with HSPICE on a ULTRA-sparc 5 129

7.5 Constrained delay optimization of a few 0.25 m gates. . . . 130

7.6 Delay worsening and energy improvement between a full

delay optimization and delay-power optimization . . . . . . 133


18/310

xviii List of Tables


19/310

Preface

The design of high speed integrated circuit is a long and complex op-

eration; nonetheless the total timetomarket required from the idea to the

silicon masks is reducing along the way.

To help the designer during this long and winding road several CAD tools

are available. In the first step the only thing existing is the description of

the circuit behaviour (the idea); in the central step of the design flow the

designer knows only the logic functioning of each block composing the cir-

cuit, but he ignores the technology realization of these blocks; in the last

steps, finally, the designer knows exactly the technology implementation

of every single gate of the circuit, and can compose the final layout with

every gate. Ca va sans dire that the CAD tool are nowadays of vital import-

ance in the design flow, and moreover the goodness or the badness of such

tools influence a lot the quality of the final design.

Among all the possible instruments, the optimization tools have a pri-

mary role in all the phases of a project, starting from the optimization at

higher level and descending to the optimization made at the electrical level.

This thesis focuses its efforts in developing new strategies and new

techniques for the optimization made at the transistor dimension level, that

is the one done by the cell library engineer, and developing also a CAD in-

strument to make this work as more as harmless as possible.


20/310

xx Preface


21/310

Part I

CMOS LOGIC


22/310


23/310

Chapter 1

INTRODUCTION TO CMOS LOGIC

THE optimization of VLSI circuits involves the optimization of single

CMOS cell. In this chapter are briefly reported the basic CMOS logic

families, with their pros and cons. The simple goal is to pick up among

the static and dynamic logic families the most appealing for the use in vlsi

circuits, and, in some measure, the most actually used, and then apply to

them the optimization techniques shown in the next chapters.

1.1 Introduction

We might ask: why to optimize a single cell in VLSI circuit, when the

design nowadays is shifting toward higher and higher level?

Some answers could be:

Need of re-usable library cells. This makes easier to reuse the samelibrary for different projects. It is a must nowadays, in order to reduce

the total time to target/market.

An optimized library makes easier the design at higher level: floor-planning, routing, can have relaxed constraints, since the gates have

a better behaviour. It is possible to reduce the time to repeat some

critical steps like floorplanning or routing until all the specifications

are met: these specifications are met earlier, since the cell globally

have a better behaviour.

Need of having some equivalent libraries with different kind of op-timization. It is possible to have different libraries that have different


24/310

4 Chapter 1. Introduction to CMOS logic

specifications, but are functionally equivalent, so that it is possible to

create different version of a project simply substituting the basic lib-rary. It would be possible, for example, to have, of the same project, a

version that runs at full speed, and version optimized for low-power

dissipation.

This swapping of libraries does not involve the higher levels of design,

for it is totally transparent to the designer during floorplanning or

routing. Just before the layout production, during the cell mapping,

it is possible to choose the library on to which the project would be

mapped.

These answer have led to consider the appropriateness of the produc-

tion of a tool able to perform the optimization of a cell library, in a way

appropriate for the designer. The goal is to produce some results to show

that this optimization is worth during a design cycle, and also to make the

insertion of the tool in a design cycle as smooth as possible.

In order to attain results that are related to a real production cycle, we

have to choose some cells that are almost present in a real library.

For this purpose we introduce a very brief description of the most used

CMOS logic families, and among them we choose the cells to develop and

test the optimization framework.

1.2 CMOS logic families

The first basic distinction inside the CMOS logic families is among the

static logics and the dynamic logics ([1]).

Static logic: The static logic is a logic in which the functioning of the cir-cuit is not synchronized by a global signal, namely the clock of the

circuit. The output is solely function of the input of the circuit, and

it is asynchronous with respect to them. The timing of the circuit is

defined exclusively by its internal delay.

Dynamic logic: The dynamic logic is a logic in which the output is syn-

chronized by a global signal, viz. the clock. The output is, then, func-

tion both of the inputs of the circuit and of the clock signal; and the


25/310

1.2. CMOS logic families 5

timing of the circuit is defined both by its internal delay and by the

timing of the clock.

Both the static and dynamic logics comprehend several logic families.

1.2.1 Static logic families

The principal static families are:

Conventional static logic It is the logic normally referred when speakingofstatic logic. A static circuit has the same number ofNMOS and PMOS

transistors, but the n and p branches are respectively one the dual

of the other. As an example see figure 1.1, which represents a static

A

B

OUT = A and B

Fig. 1.1: Static and

and gate. It has two NMOS transistor connected in series and two

PMOS connected in parallel.

The static logic is quite fast, does not dissipate power in steady state

and has a very good noise margin.

Pseudo-NMOS It is an evolution of the yet surpassed NMOS logic. It is ob-

tained by substituting the whole PMOS branch in a static logic with

a single PMOS transistor with its gate connected to ground. So this


26/310


PMOS is always conducting and leads the output node to the high

state. When the NMOS branch conducts also, then the output dis-charges, if the ratio among the NMOS and PMOS transistor is well de-

signed.

This logic is cited here only for historical reason, since it is not so fast,

it dissipates static power in a steady state (when the output is in the

low state) and it is sensible to noise.

Pass-logic The pass-logic is relatively new logic, and, for many digital de-

signs, implementation in pass-transistor logic (PTL) has been shown

to be superior in terms of area, timing, and power characteristics tostatic CMOS.

As an example see figure 1.2,

A

A

B

B

OUT = A xor B

Fig. 1.2: Pass transistor logic xor

1.2.2 Dynamic logic families

The principal dynamic families have a characteristic in common: every

dynamic logic needs of a pre-charge (or pre-discharge) transistor to lead to

a known state some pre-charged nodes. This is done during the working

phase known as pre-charge phase or memory phase; during another working

phase, the evaluation phase the output has a stable value1.

1 This brief introduction is limited to systems that have a single global clock, or onephase, intending here the word phase as synonym of clock, and not as above as a synonymof working period. There are systems that have two, or even four phase, but they are notintroduced here. The basic functioning, however, remains the same.


27/310


The principal dynamic logics are divided yet in two sub-families, pipe-

lined and not-pipelined. The first two these are non-pipelined, while the oth-ers are pipelined:

Domino logic and NP Domino logic The typical domino gate is depicted

in figure 1.3

NMOS Block

CLOCK

OUT

INPUTs

Fig. 1.3: Domino typical gate

During the pre-charge phase the clock is at its low state, so that the

pre-charged node before the static inverter is high, and the output is

low. During the evaluation phase the clock is high, so that the inputs

of the nblock (that can perform any logical function) can discharge

the pre-charged node and lead the output to the high state.

We can cascade several of these gates, given that each gate has its

own output inverter, and we can drive every gate with the same clock

signal, given that the evaluation phase lasts the time necessary to all

the gates to finish their inputs evaluation. This last fact explains why

this is a non-pipelined logic: the output of every cell is available when

the cell has finished its evaluation phase.

Moreover this logic has a limited area occupancy, since it has a low

number of PMOS transistors. On the other hand it is not possible to

implement inverting-structure and, as all the other dynamic logics,

this logic is subject to the charge-sharing problem2.

2 The charge-sharing problem, or charge-redistribution, is a problem that affects the dy-


28/310


A natural evolution of the domino logic is the N-P domino logic, or

zipper logic. It consist of two typical cells, the one depicted in fig-ure 1.3, and the dual one obtained by that, simply swapping the n-

block with a p-block, and a PMOS pre-charge transistor with a NMOS

pre-discharge transistor, driven by the negated clock.

This logic has a lower are occupancy, since there is no need of a static

inverter, but has also a lower speed, given by the presence of PMOS

transistors.

Cascode voltage switch logic (CVSL) The CVSL is part of the large family

ofdifferential logics. It needs both the inputs and the inputs negated,and two complementary n-block that perform the logic function, as it

is possible to see in figure 1.4.

OUTOUT

IN

PUTs

IN

PUTs

Fig. 1.4: CVSL typical gate

It has the advantage to be quite fast, since the positive feed-back of

the two PMOS accelerates the switching of the gate, and also it has

very good noise margins. Moreover it produces both the outputs and

namic logics. Basically the charge stored in an precharged node node during the memoryphase does not remain fully stored in it. Lets think to a domino gate during the pre-chargephase, when the clock is low. If there is one input in the n-block that is high, then its cor-responding transistor is conducting. The n-branch is still not conducting, since the clockedNMOS transistor is not conducting, but some charge from the precharged node can flow toothers node via the conducting transistors in the n-block. This redistribution of charge issimply a charge of a cap8citor partition and lead to a state of the precharged node lesserthan the high state.

This problem can produce logic errors, and surely diminishes the noise margins of


29/310


negated outputs without needing an inverter. As a drawback, it has

a large area occupancy.

C2MO S logic The typical C2MO S gate is shown in figure 1.5. It is basically

a three-state gate, since when the clock is at the low state, the output

is floating at the high impedance state.

NMOS Block

INP

UTs

CLOCK

PMOS Block

INPUTs

CLOCK

OUT

Fig. 1.5: C2MO S typical gate

It is principally used as a dynamic latch, as an interface among static

logics and dynamic-pipelined logics.

NO RAce logic (NORA) The NORA logic, as acronym of no race, is an evol-ution of the N-P domino logic. The static inverter of the domino logic

is substituted with a C2MO S inverter. This is the first of the pipelined

logics, since the output of every gates is available only when the clock

switch its state, and not before.

Since the output stage of every cell is also dynamic (a C2MO S in-

verter), then this logic is more subject to the charge-sharing problem

that the domino logic is.


30/310


True Single Phase Clock logic (TSPC) The final evolution of the NORA is

the TSPC logic, or true single phase clock logic ([2]).The TSPC logic is a n-p logic, since of each gate exists the n-version

and the p-version. For example the n-latch and the p-latch are shown

in figure 1.6.

OUT

A

CLK

(a) Type n

CLK

A

OUT

(b) Type p

Fig. 1.6: TSPC Latches

The ultimate advantage of the TSPC logic is the presence of a single

clock, since for its internal structure it is not necessary the presence of

the clock negated.

The TSPC logic is among the faster dynamic families, and surely it has

a great appealing for its very low number of transistor employed.


31/310

1.3. Conclusion 11

1.3 Conclusion

After this very brief introduction to several CMOS families, we chose

two different logics, in order to apply the study of the optimization tech-

niques objects of this thesis. The criteria that drove us in choosing these

families was both the diffusion in VLSI circuits, and the presence of very

good qualities, perhaps not yet fully exploited in the real production of

circuits.

For these reasons we have chosen to include in our library a few static

gates (an and gate, an or gate, and a few more) and a few dynamic

gates, and in particular gates from the TSPC family. This family has shown

good characteristics in term of speed, area occupancy and power dissipa-

tion; it has also the very important feature to need only a single clock.

The complete list of the gates comprising the library can be found in the

table 7.1 (page 122), with their relative schematic diagram of CMOS imple-

mentation.


32/310


33/310

Part II

CIRCUIT MODELING


34/310


35/310

Chapter 2

A SIMPLE MODEL

THE first model applied in the calculus of the delay in MO S circuits is

the Elmores model ([3]). It is a simple RC delay model, and it is the

basement of a switch MO S model (figure 2.1): the generic MOS is represen-

ted, during the ON state, by its dynamic resistance across the drain pin and

the source pin, and the parasitic capacitances and resistances at the drain

and source pins.

G

D

S

ON= G

C

C

D

S

CG

Rd

Rg

S

D

RL

CL

R0

Fig. 2.1: RC MOS equivalence

If this simple MO S model is valid, then the Elmores delay formula can

be used in every structure containing some MO S. The Elmores formula is


36/310

16 Chapter 2. A simple model

appealing for its simplicity and its easy of use; however the accuracy of the

formula can worsen in the deep submicron domain, since the modeling ofa MO S through its resistance it is no more valid.

Since the use of Elmores model is almost quite limited to comparis-

ons with other models, of for introduction to delay modelling, section 2.1

presents here only the very basic of the Elmores model and section 2.2

shows the conclusions about the use of this model for VLSI models.

2.1 The Elmores model

The Elmores model or the Elmores delay formula can predict the delay

of a RC chain as shown in figure 2.2.

R RRi-1 i i+1

C C Ci-1 i+1i

Vi-1 Vi Vi+1V0

Fig. 2.2: RC chain

In order to obtain the formula, lets start with a single RC cell, as shown

in figure 2.3. We can express the voltage V1(t) by means of a differential

equation such as:

C0dV1

dt=

V1(t) V0(t)R0

(2.1)

Integrating the equation (2.1), we can write

V1 = V0(t)

1 e tR0C0

.

The time constant is = R0C0, and with t = we obtain:


37/310

2.1. The Elmores model 17

R

C

V0

0

0

V1

Fig. 2.3: RC single cell

V1 = 0.63V0(t).

So the time tD = represents the 63% delay from V0(t) to V1(t). Extend-

ing the formula of the time constant to the chain of figure 2.2, we obtain:

tD=

N

i=0 ij=0

RjC

i.

This delay is the inputoutput delay. When there is the need to know

the delay between the input and one of the inner nodes, a more complex

formula (a semi-empirical one) can be used; for example, with N= 2:

t1 = R0C0+ qR1C1 delay from the input note to the first node

t2 = R0C0+ (R0+ R1)C1 delay from the input note to the output node

where q is:

q =

R0R0+ R1

ifR1 2R0,R0C0

R0C0+ R1C1ifR1 > 2R0.


38/310

18 Chapter 2. A simple model

The first case (with R1 2R0) is named strong coupling, while the secondone is named weak coupling.

Given the unit impulse response h(t) (figure 2.4) of the output node of

the RC tree, Elmore proposed to approximate the delay by the mean of

h(t), considering h(t) as a distribution. The 50% delay is given by:

h(

t)

t

m

Fig. 2.4: Elmore impulse response

Z

0h(t)dt = 0.5

while the original work of Elmore proposed:

tD = m =Z

0t h(t)dt

with

Z

0h(t)dt = 1.


39/310

2.2. Conclusions 19

This approximation is valid only when h(t) is a symmetrical distribu-

tion, as in figure 2.4, while in real cases the h(t) distribution is asymmetrical;however in [4] is proved that the Elmore approximation is an upper bound

for the 50% delay, even when the impulse response is not symmetrical, and,

furthermore, the real delay asymptotically approaches the Elmore bound as

the input signal rise (or fall) time increases.

2.2 Conclusions

The model shown in this chapter is quite appealing for the calculus ofthe delay in CMOS structure, but it is inaccurate as far as we go into the

submicron domain, so its use should be limited to a first validation of an

optimization algorithm, but not for real production.

About this, it is important to note that the delay functions obtained by the

Elmores formula satisfy some properties useful in the optimization realm

(for example equation (4.1), page 50): then the Elmore model is very useful

for optimization algorithms testing.


40/310


41/310

Chapter 3

A COMPLEX MODEL

THE target of the model developed here is to offer limited estimation

errors with respect to physical SPICE simulations and to improve the

computation speed of more than one order of magnitude. This could be

useful in optimization algorithms.

Thus the aim of the model is to evaluate the delay and power dissipation

ofCMOS structures.

Several approaches have been used to evaluate the delays of CMOS

structures: some models are derived from SPICE simulations by means of

lookuptables [5]; some are analytical [6] while others approximate the

evaluation of the delay with step or ramp inputs [7, 8, 9, 10, 11].

Regarding the power consumption the main contributions are: switch-

ing power, short circuit current and subthreshold conduction. The first

one occurs during the charge and discharge of internal capacitances; short

circuit current originates from the simultaneous conduction ofp and n net-

works and it is dominated by the slope of node voltages; subthresholdcurrents are due to the weak inversion conduction ofMOSFETs and become

relevant when the power supply is scaled in sub-micron technologies.

Most of the proposed power models use estimation algorithms not com-

patible with the delay analysis. The purpose of the FAST model is to com-

bine delay and power evaluations in the same estimation procedure, allow-

ing the simultaneous optimization of delay and power.


42/310

22 Chapter 3. A complex model

The section 3.1 reports the theory behind the FAST model, and in par-

ticular: 3.1.1 shows the MO S equations used in the model, 3.1.2 showsthe internal nodes voltage approximation made by the model and 3.1.3explains how the threshold voltage variation are taken into account in the

model. Section 3.2 shows how the FAST model estimates the delay, and in

particular 3.2.1 shows how the equation are solved; while section 3.3 re-ports the method used for the calculation of the power consumption, and

in particular 3.3.1 accounts for the switching power, 3.3.2 accounts for theshort-circuit power, and 3.3.3 accounts for the subthreshold power.Finally the section 3.4 presents some results by the comparison of the model

with HSPICE and the section 3.5 draws some conclusions.

3.1 The FAST model

The low complexity and the accuracy that can be obtained by taking

care of the phenomenon of carriers velocity saturation, which is domin-

ant in submicron technologies, suggested the use of the classical charge

control analysis and the gradualchannel approximation (Hodges model),

described in 3.1.1.

Estimation accuracy and low computational effort can be achieved by

operating both on the waveforms of internal signals and on the topology

considerations: in particular all the waveforms in the circuit are approxim-

ated with linear ramps.

By approximating the input waveform with a ramp, a strong simplific-

ation of the I(V) equations is obtained. Figure 3.1 shows the output voltage

of an inverter driven by a ramp input. It can be noticed that a ramp can

properly approximate the output voltage variation, especially in the central

phases of the commutation. The increasing error on the tail of the switching

does not affect significatively the delay and power estimation.

The voltage ramp approximation are described in 3.1.2.


43/310

3.1. The FAST model 23

0

1

2

3

4

5

1.2 1.25 1.3 1.35 1.4 1.45 1.5

V

Time (ns)

VoutVinModel

Fig. 3.1: Inverter voltages waveform

3.1.1 MO S equations

The well known equations for the MOS transistors are (for the ntype

and ptype transistors)[1]:

below saturation

IDSn,p = n,p

(VGS VTn,p )VDS

V2DS2

(3.1)

above saturation

IDSn,p =n,p

2

VDSsatn,p

2(3.2)

where n,p =n,pCox W

L , with n,p modified by the carrier velocity saturation

effect:

n =n0

1+ VDSLEc

p =p0

1 VDSLEc


44/310


The saturation voltage (drainsource), not including the carrier velocity

saturation effect, is given by the well known formula:

VDSn,p = VGSn,p VTn,p

while considering the effect abovementioned:

VDSn,p =Vc

1 2(VGSn,p VTn,p )

Vc 1

(3.3)

where the plus signs are for nMOSFETs and the minus signs are for the

pMOSFETs, and Vc = |EcL|

3.1.2 Internal nodes approximation

Fig. 3.2: Mos chain with proper numbering

Let be N the number of nMOSFETs in the nchain and P as the num-

ber of pMOSFETs in the pchain, and lets label the transistor in the chain


45/310


from 1 to Nor from 1 to P (figure 3.2). Lets assume that the label 1 comes

with the driving transistor (i.e. the nMOSFET with source connected to VSS

as the pMOSFET with source connected to VDD), as in figure 3.2. This hy-

pothesis is only for the develop of the discussion; in our model any (but

only one) transistor can be a driving transistor, that is a transistor with a

changing gate voltage.

Notation 3.1. In the following equations the superscript index refers to the

node number (with the variable i always for the nMOSFETs and j always

for the pMOSFETs), and the smallletter subscript indexes n and p refer, re-

spectively, to nMOSFETs and pMOSFETs, both for the voltage variables or

for the time variables; for the voltage variables the capital subscript indexes

G and D refer to the drain node and the gate node, while the smallletter

index d refers to the initial conditions of the drain nodes.

So, for example, ViGn (t) is the gate voltage at the node i for the nMOSFETs

(function of time), and Vjdp

is the initial condition of the drain voltage at

node j for the pMOSFETs.

The wave forms of the voltage are shown in figure 3.4 and figure 3.5,

with the hypothesis t10n= t2

0n=

= tN

0nand t1

0p= t2

0p=

= tP

0p; that is

because we suppose the start of conduction of all the MOSFETs in a chain

contemporary1.

We can write, referring to figures 3.4, 3.5:

V1Gn (t) =

0 t < 0VDD

1int 0 t < 1in

VDD 1in t

(3.4a)

V1Gp (t) =

VDD t < 0

VDD VDD1ip

t 0 t < 1ip0 1ip t

(3.4b)

ViGn (t)

i=2,3,...,N= VDD t (3.4c)

1 This hypothesis is well supported by simulations


46/310


VjGp

(t)j=2,3,...,P = VSS t (3.4d)

ViDn (t)

i=1,2,...,N=

Vidn t < ti0n

Vidn Vidnt ti0n

ion ti0nti0n t < ion

VSS ion t

(3.4e)

VjDp

(t)j=1,2,...,P

=

Vjdp

t < tj0pVDD Vjdp

jop tj0p

t+jop V

jdp

tj0p VDD

jop tj0p

tj0p

t < jop

VDD jop t

(3.4f)

Fig. 3.3: The ith and i+1th MOSFETs with node voltages

It is also possible to define iin,p = i1on,p and the source voltage V

is =V

i+1d ,

as shown in figure 3.3 for the ith nMOS . The same is valid for the p

MOSFETs.

The starting level Vdn,p are determined with a static analysis, described

in 3.1.3.

3.1.3 Body effect: threshold variation and its approximation

It is known that a MOS transistor with the sourcebody voltage differ-

ent from zero has the threshold voltage modified by the body effect, that


47/310

ooo i


48/310


ooooi

Fig. 3.5: Voltages wave forms in the pmos chain

The source potential of the top transistor is

Vs = VDD VTn ,

and, ifVTn0 is the threshold voltage with Vsb = 0, then VTn = VTn0+ VTn

and we can solve for Vsb:

Vsb =

4

2|p|+ 8|p|+ 4VDD 4VTn0+ 22

+ 2|p|+VDD VTn0+2

2

(> 0)

We can find an analogue equation for pMOSFETs: knowing that, for

the pMO S chain depicted in figure 3.7(b), the drain potential of transistor

is VPdp = 0, while VPsp =VDD VTp; for the middle transistors Vjdp = V

jsp =

VDD VTp ; and for the first (topMO St) transistor V1dp =VDD VTp andV1sp = VDD .

The threshold voltage variation function ofVsb again is:


49/310


oo

i

Fig. 3.6: Drainsource (VDS) and gatesource (VGS) voltages of th ith nMOS

VTp

=(2|p|+Vsb 2|p|)

(for pMO S transistors threshold voltage is negative).

Again, solving:

Vsb = VDD VTp = VDD VTp0+ (

2|p|+Vsb

2|p|)

where VTp0 is the threshold voltage with Vsb = VDD ; thus we find:

Vsb =

4

2|p|+ 8|p|+ 4VDD + 4VTp0+ 22

2|p| VDD VTp0 2

2(< 0)

The threshold variation is approximated in the model by a linear ap-

proximation given by:


50/310


51/310

3.2. Delay estimation 31

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

0 1 2 3 4 5

VTn

Vsb

VTn

(Vsb

)VTn approx

(a) nMOSFET

-1.7

-1.6

-1.5-1.4-1.3

-1.2

-1.1

-1

0 1 2 3 4 5

VTp

Vsb

VTp

(Vsb

)VTp approx

(b) pMOSFET

Fig. 3.8: Threshold variation with Vsb (solid line) and its linear approxima-tion (dashed line)

In figure 3.8(a) and 3.8(b) the actual threshold variation (of a nMO S

transistor and a pMO S transistor) when a Vsb voltage is applied is com-

pared with the linear approximation used in our model, for a 0.7 m tech-

nology.

The max error due to the linear approximation is limited to 7%.

3.2 Delay estimation

The delay estimation of the structures reported in figure 3.2 implies the

evaluation ofion,p and ti0n,p

, for each transistor in the chains.

The currents in each transistor can be obtained from equations (3.1),

(3.2) (page 23), with the voltage function of time defined in equations (3.4a)

(3.4f) (page 25). So we can calculate the quantity of charge at each node and

thus apply the charge conservation law, i.e. at each node the total chargevariation must be equal to zero:

Qin = 0 Qjp = 0 i = 1, 2, . . . Nand j = 1, 2, . . . , P (3.5)

The generic term Qin is the sum of three elements, Qin = Q

i+1I QiI QiC,

define below:


52/310


Qi+1I is the charge due to the (i+ 1)th MOSFET placed above the ithnode:

Qi+1I =Z ti+1sn

ti+10n

Ii+1sat (t)dt+Z i+1on

ti+1sn

Ii+1lin (t)dt (3.6a)

which includes the contributions due to the currents above and be-

low saturation; ts is the time at which the MOSFET switches from the

saturation to the linear region;

QiI is the charge due to the (i)th mos below the ith node:

QiI=Z tisn

ti0n

Iisat(t)dt+Z ion

tisn

Iilin (t)dt (3.6b)

QiC is the charge due to the discharging of the capacitor at the ithnode, Ci:

QiC= CiVi

dn. (3.6c)

Similarly equations apply for pMOSFET.

For each circuit node, a charge conservation equation can be written.

3.2.1 Equation solving

Referring to the nMOS chain in figure 3.3, we can write at the output

node N:

QNn = QNC = CNVNdn (3.7)

because, neglecting the contribution of the pMOS chain above (if it exists),

QNI = 0.

At the node N 1 we can write:


53/310


QN1n = QNI QN1I QN1C ,

and combining with eq. (3.7) (page 32)

QN1n = CNVNdn QN1I QN1C ,

and so on:

QN2n =CNVNdn CNVN1dn QN1I QN2C .

More generally:

Qin =N

k=i+1

CkVkdn QiI QiC

=N

k=i

CkVkdn QiI= 0

Proceeding till the first transistor, we obtain:

Q1n =N

k=1

CkVkdn Q1I= 0 , (3.8)

the same applies for pMOSFETs.

In order to solve nonlinear equation (3.8) one must substitute the defin-ition of the current to calculate the charge Q, as in equations (3.6a), (3.6b)

(page 32), moreover one must substitute both the current calculated in the

saturation region and the one calculated in the linear region, extending the

integrals of the aforementioned equations to the proper extremes.

Finally we must distinguish among several different cases, depending

on the instant of time on which the transistor switch from the saturation

region to the linear region. For example, the first transistor can switches


54/310


between the two regions when the rising of the input has already finished,

or on the contrary can switches when the input is still rising.All the possible cases are:

t10 t1s

1i

1o t

10

1i t

1s

1o

t1s t10

1i

1o

1i t

10 t

1s

1o

t1s 1i t

10

1o t

10 t

1s

1o

1i

t1s t10

1o

1i

(3.9)

Evaluating all the possible cases, the equation (3.8) becomes a non

linear equation of the variables t1s , t10,

1o ,

1i , with t

1s , t

10,

1o as unknowns.

A further step must be done, with the purpose of eliminating all the vari-

ables but one. The real unknown is the time 1o , while all the other un-

knowns can be expressed in function of1o : in particular, the times t1s and

t10 can be calculated together, with the equation VDS = VGS VTand withthe equation that states the charge conservation at node 1 between the time

0 and the time t10, similar to the equation (3.5) (page 31), including the boot-

strap effect due to capacitive coupling between the gate and the drain of

the first transistor.

Both these equations are functions of t1s , t10,

1o ,

1i . By this way one has

three equations with three unknowns, and by means of some approxim-

ated methods2 it is possible to evaluate the three unknowns.

This solution scheme ought to be repeated for all the seven cases shown

in equation (3.9). Each case gives as a solution a triple t1s , t10,

1o that is com-

patible with one and only one of the conditions expressed by these cases.

Thus, only one working condition is really selected, as it can be expected.

Indeed all the previous solving scheme is true only if the equation (3.6c)

(page 32) apply, i.e. only if the capacitance at the node i is not a function of

the voltage at the same node. But the capacitance actually is function of the

voltage in this manner:

Or, taking into account the carrier velocity saturation effect, the equation (3.3) (page 24).2 The problem is always strictly nonlinear.


55/310


Ci = Cij

1+

Vi

b

mj+Cip

1+

Vi

b

mp(3.10)

where Cj and Cp are, respectively, function of area and function of peri-

meter of a junction, because the capacitance at the node i is due to the para-

sitics capacitances of the transistors connected to this node.

If the capacitance at each node are functions of the voltage at the node it-

self, then one equation is no more sufficient: one must write equations like

the equation (3.8) (page 33), one for each node, and the solve them with

standard solving algorithm for nonlinear equations. The only difference

among the equations applied at the nodes above the first and the first node

equation is that not all of the cases of equation (3.9) are possible: in par-

ticular these conditions apply only when the transistor can pass from the

saturation region to the linear region, and moreover, only when the input

rising time 1i can assume whichever value. The passage from saturation to

linearity can be made only by the first and the last transistors of the chain,

as they are the only that can saturate3. But in the last transistor, the time Niis governed by Ni =

N1o , giving thus only two possible cases:

tN0 tNs

Ni

No t0

Ni t

Ns

No

In order to make the algorithm convergent, two other fictitious cases

must be included:

tN0 tNs , No Ni

t0 tNs ,

No

Ni

These conditions can never verify in a real circuit, since they imply that

the voltages at the source node and at the drain node of the last transistor

3 This is because they are the only that have a full voltage swing at some node, e.g. thegate node the first, and the drain the last. All the transistor in the middle of the chainare prevented to saturate by the body-effect, that makes the saturation condition VDS =VGS VT, (or, better, the equation (3.3), page 24) impossible.


56/310


crosses, making the transistor current flowing in an inverse direction (see

figure 3.6 for a visual explanation of the terms i and o and why they relat-ive voltage waveforms cannot cross). Their inclusion help finding the real

circuit conditions when solving the equation (3.8) for each of these four

cases: the solution of one the fictitious cases gives only unknowns compat-

ible with one of the real cases.

All the other transistors, that can not saturate during the switching from

off to on, have only one possible working condition, again that the voltages

at source and drain nodes do not cross:

ji jo j = 2, . . . N 1

Solving all the equations, one for each node, the unknowns jo can be

evaluated, giving thus an estimate of the voltage waveform at each node

of the chain. The rising/falling time of the last node of the chain gives also

the delay of the chain itself.

3.3 Power consumption estimation

3.3.1 Switching energy

The contribution to the power dissipation due to the charge and dis-

charge of internal nodes for each MOSFET can be defined as the integral of

the voltage across the MOSFET times the current flowing through.

Theorem 3.2. The switching energy in generic nnetworks and pnetworks can

be written as:

Eswn =1

2

N

i=1

Ci

V 2i V 2i

(3.11)

Eswp =1

2

P

j=1

Cj

VDD Vj

2 VDD Vj 2

(3.12)

where Ci is the generic total capacitance of node i-th and Vi , Vi are, re-

spectively, the initial and final value of the voltage swing at the same node.


57/310

3.3. Power estimation 37

Corollary 3.2.1. If the voltage swing of each node of the network is the full swing

V= VDD 0, then equations (3.11), (3.12) can be written as:

Eswn =1

2

N

i=1

CiV2 (3.13)

Eswp =1

2

P

i=1

CiV2 (3.14)

Proof of theorem 3.2. Since the internal voltages and currents are known from

the delay analysis, the energy for the nMO S network can be written by

summing all the contributions of internal nodes (see figure 3.3)

Eswn =N

i=1

Z Vi+1Dn (t) ViDn (t)

IiDn (t)dt

where the notation of figure 3.3 is adopted.

This equation can be written in this way:

Eswn =Z

VNDn (t)I

NDn (t)+

N1i=1

ViDn (t)

IiDn (t) Ii+1Dn (t)

dt (3.15)

It is possible to rewrite the previous equations by noting that in general:

Ii+1Dn

IiDn = C

idViDn

dt

and, in particular, if we neglect the current of the pMO S chain above the

node N,

INDn = CNdVNDn

dt.

Thus, for the n network it is possible to define the Eswn energy in the

following way:


58/310


Eswn = N

i=1

CiZ t0

t0ViDn

dViDndt

dt

= N

i=1

CiZ Vi

ViViDn dV

iDn

=1

2

N

i=1

Ci

V 2i V 2i

If we integrate the equation (3.11) (page 36) only when the argument of

the integrals are non zero, then the first integral in this equation goes fromt0 = t

i0n

to t0 = ion , so that the second integral goes from V

i = V

iDn

(ti0n ) to

Vi = ViDn

(ion ). Since ViDn

(ion ) = 0, we have Eswn =12

Ni=1 C

iV 2i , where Vi

is the actual voltage swing at the node i.

The energy dissipated in the p network (Eswp ) can be calculated with

similar considerations leading to

Eswp =P

j=1

CjZ t0

t0

VDD VjDp

dV

i

Dndt

dt

=

P

j=1

CjZ Vj

Vj

VDD VjDp

dV

jDp

=1

2j

Cj

VDD Vj

2 VDD Vj 2

Again, Vj = VjDp

(ti0n ) and V

j = VjDp

(jop ), and in the same way V

j =

VDD

, so that Eswp=

1

2

P

j=1Cj(V

DD V

2

j), where (V

DD V

2

j) is the voltage

swing at the node j.

In the equations (3.11) and (3.12) (page 36) the voltage variation of ca-

pacitance must be included, obtaining expression for Eswn,p slightly more

complicated, but still in closed form.


59/310

3.3. Power estimation 39

3.3.2 Shortcircuit energy

The shortcircuit contribution (for a output falling transition) is given

by:

Esc =Z o

t0VD ID dt

where ID is the pMOSFET current flowing through the pMOSFET that

has a changing gate voltage, during the output falling; of course all the

pMOSFETs among this one and the output node must be on to have this

contribution of power dissipation. So if we neglect the little discharging of

the source voltage of this MOSFET, we can easily calculate the shortcircuit

energy, calculating the current flowing.

A similar equation can be written for the nMO S network.

Since voltage swings, internal currents and capacitances are known from

the delay analysis, the power supply dissipation does not require addi-

tional computations.

3.3.3 Subthreshold energy

The subthreshold current in a MOSFET is given by ([12]):

IDSsubth = 0W

L

kT

qQ(VS)

1 e

qVDSkT

where

Q(VS) kTq

qsNa|p| e

q(VGVT)kT

and

= 1+1

2Cox

s Na|p| .

This current is proportional to the MOSFET width W, but, usually is neg-


60/310


ligible. However, with the scaling down of the dimensions and hence of the

threshold voltage this current may become no more negligible, and withlow VG and higher VD, the current becomes independent from VG.

Moreover, while the shortcircuit current is limited by the switching times

of the circuit, the subthreshold current is not limited in time, so its dissip-

ation can be comparable to the shortcircuit dissipation.

3.4 Results

The circuit in figure 3.2 with 2 nMO S and 2 pMOS transistors (in a

0.7 m technology) has been simulated using HSPICE (level 6) and the pro-

posed model, for each combination ofMOSFET widths from 1 mto100 m.

Figure 3.9 shows the comparison between delay (defined as the delay at

50% between an input rise ramp of 200 ps and an output falling ramp)

calculated by the model and the delay simulated by HSPICE for each com-

bination of widths among 5 m and 30 m; similarly figure 3.10 shows the

comparison between the energy dissipated (during the output discharging)

by the circuit calculated by the model and by HSPICE.

Tab. 3.1: Mean Error

Mean error Max Error Min Error

Delay 6.115% 12.985 % 0.905%Energy dissipated 2.1% 6.3% 0.11%

Tab. 3.2: Execution time

HSPICE execution time FAST execution time

6384.3 sec. 188.91 sec.

The errors between the proposed model and the HSPICE simulation is

reported in table 3.1 while table 3.2 shows corresponding execution time.

These results are taken from the analysis of the circuit varying the dimen-

sions of the MOSFETs continuously from 1 m to 100 m.


61/310

3.5. Conclusions 41

3.5 Conclusions

The model of this chapter is suitable for the optimization application of

chapter 5. It is able to compute the delay and the power consumption of

CMOS structures with good accuracy and a consistent speedup regarding

to the HSPICE simulation taken as a reference.

In a real production design cycle, this model might be used for a first pre

optimization of some basic cell; then in the last steps of the design flow the

optimization using a more accurate model for the delay (or power) evalu-

ation must be used.


62/310


63/310

3.5. Conclusions 43

Energy Model

510

1520

2530

W1 [micron] 5

10

15

20

25

30

W2 [micron]

200

300

400

500

600

700

800

900

1000

Energy [fJ]

(a) FAST model

Hspice Simulation

510

1520

2530

W1 [micron] 5

10

15

20

25

30

W2 [micron]

200

300

400

500

600

700

800

900

1000

Energy [fJ]

(b) HSPICE

Fig. 3.10: Energy dissipated by the circuit of figure 3.2 with several combin-ation ofW1 and W2


64/310


65/310

Part III

OPTIMIZATION


66/310


67/310

Chapter 4

MATHEMATIC OPTIMIZATION

THE very basic theory of optimization is introduced here, in order to

develop some optimization schemes, useful later for the optimization

of real circuits.

The theory of mono-objective optimization involves some properties and

theorems regarding finding the minimum of functions, hence the annulling

of the functions first derivatives. These results can be extended (with some

restrictions) to the case of multivariable functions but when the functions

to be optimized are more than one, being optimized simultaneously, the anew theory may be introduced.

The whole goal of this introduction to mathematical optimization is

both the developing of reliable algorithms, and the justification of some as-

sumptions made in the chapter 5 (page 77), especially for the multi-objective

case.

In section 4.1 some mathematical optimization foundations are repor-

ted, and in particular in

4.1.1 is shown the theory of mono-objective optim-

ization (unconstrained, 4.1.1.1, and constrained, 4.1.1.2), while in 4.1.2 isshown the theory of multi-objective optimization (unconstrained, 4.1.2.1,and constrained, 4.1.2.2).The section 4.2 reports the basic and most useful numerical algorithms for

optimization purposes: in 4.2.1 some one-dimensional search techniques,in 4.2.2 some multi-dimensional search techniques, and in 4.2.4, 4.2.5some special algorithms.

Some conclusion and summarized characteristics are reported in section 4.3.


68/310

48 Chapter 4. Mathematic Optimization

4.1 Optimization theory

Notation 4.1. In the following section, the function f is defined as:

f: X Rp Y R. X is called the decisions space, and Y is called the criteriaspace.

Problem 4.2 (Unconstrained optimization). Given the function f that de-

pends on one or more variable x X, the problem of optimize f, in thiscontext, is equal to find:

minx

Xf(x)

this is also known as an unconstrained optimization, since there are not any

constraints on the values the function f may assumes.

The unconstrained optimization is seldom applied in the field of digital

circuits, so the constrained optimization is defined as:

Problem 4.3 (Constrained optimization). Find

minxX

f(x) subject to gj(x) hj, j = 1, 2, . . . , m

where the n equations gi(x) hi constitute the set ofconstraints of the op-timization.

The function f is also called the objective of the optimization, or the cost

function of the problem.

The above problems are classical optimization problems, or mono-objec-

tive problems. The multi-objective unconstrained optimization is defined as

the problem to optimize a vectorial function, so that the objective-functionis a vector of objective-functions.

Notation 4.4. In the following (multi-objective optimization), the function f

is defined as:

f: X Rp Y Rn, or f= (f1, f2, . . . , n)|fi : X Rp Y R,Problem 4.5 (Unconstrained multi-objective optimization). Find

minxX

fi(x), i = 1, 2, . . . , n


69/310

4.1. Optimization theory 49

where there are n objective functions.

Finally, the multi-objective constrained optimization is defined as:

Problem 4.6 (Constrained multi-objective optimization). Find

minxX

fi(x), i = 1, 2, . . . , n subject to gi(x) hi, i = 1, 2, . . . , m

where there are n objective functions and m constraints.

The multi-objective optimization is a very complex problem, since the

problem of finding the minimum of two or more functions is apparently

only trivial: the set of independent variables xmin that minimizes, lets say,

the function f1, it is not supposed to minimizes (and generally it does not)

the other functions. So there should be a way to combine the information of

minimum among all the functions. The intuitive way of linear combination

is somewhat problematic:

ftot(x) =n

i=1

ifi(x), i R

because the functions fi

cannot be commensurable among them. For ex-

ample, if there is one function fj that is fj >> fi, i = j, then this functiondominate the total objective, giving false results for the optimization prob-

lem. This problem is illustrated in 4.1.2.

4.1.1 Mono-objective optimization

The mono-objective optimization is the standard optimization problem,

and is widely treated in literature (see [13] for an introduction). With this

preliminary statement, here are reported some results, useful to find a solu-tion for the problems 4.2, 4.3.

The existence of the minimum (at least one) is granted by the Weierstrass

Theorem1, but these minimums can be local or global:

Definition 4.7 (Local Minimum). The point x X is a local (or relative)minimum of the function f iff

> 0 : f(x) f(x) x X |x x| < .1 iffX is a compact set, as is in this context


70/310


Definition 4.8 (Global Minimum). The point x X is a global (or abso-lute) minimum of the function f iff f(x) f(x) x X.Definition 4.9 (Feasible direction). d Rn is a feasible direction if >0 : x+d X, : 0

In an intuitive manner the concept of feasible direction is useful to solve

the problem of minimization: we search all the direction in which the func-

tion f is decreasing.

Lemma 4.10 (First order necessary condition). If x

X is a minimum of

f C1 then d Rn, where d is an feasible direction, dT f(x) 0, where() has the usual definition of scalar product in the space Rn.

Corollary 4.10.1. If x X is an internal point of X, then dT f(x) = 0

Lemma 4.11 (Second order necessary condition). If x X is a minimum off C2 then d Rn, where d is an feasible direction,

i) dT f(x) 0;

ii) if dT f(x) = 0 then dT 2f(x) d 0

Corollary 4.11.1. Ifx X is an internal point of X, then

i) dT f(x) = 0

ii) dT 2f(x) d 0

The conditions of the corollary 4.1.1 are necessary and sufficient con-

ditions for the existence of the minimum (local). In order to have some

information about the existence of a global minimum, the theory of convex

functions must be very briefly reported.

Definition 4.12 (Convex function). The function f: X Y, where X is aconvex set2, is convex ifx1, x2 X : 0 1

f(x1+ (1 )x2) f(x1)+ (1 )fx2) (4.1)2 A set X R n is convex ifx, y X the segment [x, y] is totally contained in X


71/310


If in the equation (4.1) the sign < applies, then the function is said to be

strictly convex.

Another way to write the equation (4.1) is:

Lemma 4.13. The function f C1 : X Y is convex over a convex set X if

f(y) f(x)+f(x)f(y x), y, x X

or, if f is twice derivable,

Lemma 4.14. The function f C2 : X Y is convex over a convex set X if

2f(x) 0, x X

The convex functions are a very useful mathematical tool in the class of

optimization problem, mainly for the next two results:

Theorem 4.15. If f: X Y is convex over a convex set X, the set A of the min-imum of the function is convex, and every local minimum is also a global min-

imum.

Theorem 4.16. If f C1 : X Y is convex over a convex set X, and if x X : x Xf(x)(x x) 0, then x is a global minimum of f over X.

The theorem 4.16 also implies that the conditions of the lemma 4.10 and

corollary 4.10.1 (first order conditions) are both necessary and sufficient

conditions for the existence of a global minimum.

4.1.1.1 Unconstrained problem

All the previous results are, almost in theory, sufficient to solve the

problem 4.2. The theory of the convex function ensures the existence of

a global minimum, while lemma 4.10, corollary 4.10.1, and theorem 4.16

suggest a method to find this minimum. We will see in 5.1 how thesemethods apply to real circuits, in which, for example, the functions deriv-

ative are not available.


72/310


4.1.1.2 Constrained problem

The solution of problem 4.3 is slightly more complicated. The pres-

ence of constraints reduces the feasible set of independent variables that

are solutions of the problem. So the solutions, (i.e. the value of independ-

ent variables that minimize the objective function), must be searched in the

set x C X that satisfies all the constraints.The most important method to solve the problem of the minimization tak-

ing into account the satisfaction of some constraints (and, incidentally, the

method most useful for our real problem) is the method of the Lagrange

multiplier (and its derived, the method of the penalty function).

Lagrange multiplier and Penalty functions The first method defines a

Lagrangian function:

L(x, ) = f(x)+m

i=1

igi(x) (4.2)

If we define x as the solution that:

x =minxX

f(x) gi(x) 0, i = 1, 2, . . . , m

then we can write the necessary KuhnTucker conditions for the existence

of the minimum:

x L(x, ) = 0 (4.3)

L(x

, )

0 (4.4)

()Tg(x) = 0 (4.5)

0 (4.6)

In order to find out sufficient conditions, we define the saddle-point condi-

tions:

Theorem 4.17. A point (x, ) with 0 is a a saddle-point of the LagrangianL(x, ) iff


73/310


i) x minimizes L(x, ) over the whole X

ii) gi(x) 0, i = 1, 2, . . . , m

iii) i gi(x) = 0, i = 1, 2, . . . , m

It can be proved that if the functions f,g are even not differentiable but

are convex, then the saddle-point conditions are necessary and sufficient

conditions. Although these conditions must hold at the minimum, they are

not very useful in determining the optimum point. The determination of

the optimum by direct solution of these equations is rarely practicable.

A more feasible way is to convert the constrained problem into an un-

constrained one, by defining the new objective function:

P(x, K) = f(x)+m

i=1

Ki[gi(x)]2 (4.7)

The sum added to the objective function is called penalty function, since it

penalizes the objective function adding a positive quantities (recall that we

want to minimize the cost function). The constants K = [K1, K2, . . . , Km]T

are weighting factors (positive) that define how strongly must be satisfied

the ith constraint, and can also made it commensurable.

Wherever x is inside the feasible region, we can ignore the constraints,

so a new objective function can be defined as:

P(x, K) = f(x)+

m

i=1

Ki[gi(x)]2ui(gi) (4.8)

where ui(gi) is the usual step function:

ui(gi) =

0 ifgi(x) 01 ifgi(x) > 0

The introduction of the step function makes possible to relate the pen-


74/310


alty function defined in (4.8) with the Lagrangian function of (4.2) (page 52):

P(x, K) = L(, K)

if we let i = Kigi(x)ui(gi), so that all previous results valid for the Lag-

rangian function are valid for the penalty function.

Note that the solution x found optimizing the penalty function P(x, K)

converges to (x, ), defined by the KuhnTucker conditions, only in the

limit K .

4.1.2 Multi-objective optimization

The multi-objective optimization is not a standard problem in the engin-

eering, but is quite common in economics ([14]). While with the mono-

dimensional problem the concept of optimum as a minimum is quite clear

and defined (the idea of greater or lesser is intuitive with the real number),

with multi-objective (also multi-criteria) the concept of minimum is less in-

tuitive. So we must define some relation of order among the points in a

multi-dimensional space.

Notation 4.18. Given x, y Rn, define

x = y iff xk = yk k = 1, 2, . . . , nx y iff xk yk k = 1, 2, . . . , nx y iff x y and x = y (sok : xk < yk)x < y iff xk < yk k = 1, 2, . . . , n

Notation 4.19. In the following section, the function f is defined as: f: X

Y, X Rp, Y Rn. X is called the decisions space, while Y is calledthe criteriaspace.

Given two outcome y1, y2 of the cost functions, y1 = f(x1) and y2 =

f(x2), we must define which is better and we indicate that y1 is better than

y2 with y1 y2, that y1 is worse than y2 with y1 y2, and, finally, that y1 isindifferent with respect to y2 with y1 y2.

In the optimization theory a great importance has the definition ofPareto


75/310


point or Pareto preference:

Definition 4.20 (Pareto preference). Given y1, y2 Y, the Pareto preferenceis defined by

y1 y2 iff y1 y2.

A Pareto preference is intuitively guided by the relation lesser is better.

Definition 4.21 (Non-Dominated and Dominated set). Ify1 y2 is a bin-ary preference defined on Y, the dominated and the non-dominated set

with respect to {} are defined as:

N({}, Y) = {y0 Y | y Y : y y0}D({}, Y) = {y0 Y | y Y : y y0}

If y0 N({}, Y), y0 is a Npoint. Similarly, if y0 D({}, Y), y0 is a Dpoint.

Definition 4.22 (Pareto optimum). y

Y is a Pareto optimum iff it is a N

point with respect to Pareto preference.

We will give now two theorems that are fundamental for the solution of

the multi-objective optimization problem; first we introduce the definition

ofconvex cone in Rn:

Notation 4.23 (convex cone).

> ={d Rn |d > 0} =

{d

Rn

|d

0}

= ={d Rn |d 0}

Theorem 4.24. i) ify0 Y minimizes y over Y for some >, then y0is a Npoint;

ii) ify0 Y uniquely minimizes y over Y for some , then y0 is aNpoint.


76/310


77/310


4.1.2.2 Constrained

Again, the solution is to reduce the complexity of the problem from the

multi-objectivity to a mono-objective one. It is possible to combine the two

previous methods, that is to minimize a linear weighted function plus a

sum of penalty function; the only critical point is to ensure the same order

of magnitude of each term of the sum, such that there is not a dictatorship

of one term of the sum. The third chance to solve an unconstrained problem

(or a constrained, but with some care) is to use the method of the compromise

solution:

Compromise solution Given the problem 4.3, it is possible to define y as

the ideal outcome of the cost function f(x) without any constraints, so that

y = infxX

f(x); the compromise solution is defined as the minimum ofregret:

r(y) = y y;

typically, the Lpnorm (the distance between the actual solution and the

ideal point) ) it is used:

r(y) = r(y;p) =

n

i=1

|yi yi |p 1

p

.

Again, a weight can be associated for each term of the sum:

r(y;p, w) =

n

i=1

wpi |yi yi |p

1p

.

Definition 4.26 (Compromise solution). The compromise solution with re-spect to Lpnorm is yp Y that minimizes r(y;p, w) over Y.

The compromise solution enjoys several properties, the most important

is:

Property 4.27 (Pareto optimality). The compromise solution yp Y is anNpoint, for 1 p < with respect to Pareto preference (definition 4.20).Ify is unique, then it is also an Npoint.


78/310


79/310


80/310


81/310

4.2. Optimization Algorithms 61

This implies that |b a| = |x c|, and that at each iteration the interval isscaled of the same ratio .Then we repeat the process with the new triplet. So the interval (a, c) is di-

vided in two parts, a smaller and a larger, and the ratio between the whole

interval and the larger is the same between the larger and the smaller, or in

other words:

1

=

1 ,

giving for the positive solution

=

5 1

2.

This fraction is known as the golden-mean or golden-section, whose aes-

thetic properties come from ancient Pythagoreans.

Convergence considerations All the three previous methods have a lin-ear convergence, since at each iteration the ratio between the interval con-

taining x and the new smaller interval is:

0 Ik+1Ik

1.

The asymptotic convergence rate is defined as

lim

k

Ik+1

Ik

.

For the dicotomic search, since 2Ik+1 = Ik + , taking = 0 we have

limk

Ik+1Ik=

1

2.

For the Fibonacci search, first we must write the generic number of the

Fibonacci sequence in a closed form:


82/310


fk =1

5

1+

5

2

k+1

1 52

k+1.

then it can be proved that, taking = 0:

limk

Ik+1Ik= lim

kfk+1

fk=

5 1

2

For the golden section search, as previously saidIk+1

Ik= , so

limk

Ik+1Ik= =

5 1

2.

Thus the convergence rate of the Fibonacci and the golden-section search are

identical.

4.2.1.2 Parabolic interpolation

Given a triplet (a, b, c) that brackets a minimum, we approximate the

objective function in the interval (a, c) with the parabola fitting the triplet.

Then we find the minimum of this parabola with the formula (since we

want the abscissa, the method is indeed an inverse parabolic interpolation):

x = b 12

(b a)2[f(b) f(c)] (b c)2[f(b) f(a)]

(b a)[f(b) f(c)] (b c)[f(b) f(a)]

This method is useful only when the function is quite smooth in the in-

terval, but it has the advantage that the convergence is almost quadratic,and it is perfectly quadratic when the function to be optimized is a quad-

ratic form.

The Brents rule The Brents rule is a mix of the last two techniques: it

uses the golden section when the function is not regular and switches to a

parabolic interpolation when the function is sufficiently regular. In particu-

lar, it tries always a parabolic step. When the parabolic step is useless then


83/310

4.2. Optimization Algorithms 63

the method use the golden section search.

4.2.2 Multi-dimensional search

Thi

design and optimization techniques of high-speed vlsi circuits

Documents