© 2013 jikai chen - university of...

165
1 LOW-POWER HIGH-SPEED SERIAL LINK DESIGN By JIKAI CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

Upload: nguyenduong

Post on 02-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

1

LOW-POWER HIGH-SPEED SERIAL LINK DESIGN

By

JIKAI CHEN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2013

2

© 2013 Jikai Chen

3

To my wife and my parents

4

ACKNOWLEDGEMENTS

During the seven years as a PhD student at the University of Florida, I received

much help from many people. Although there is only person listed as the author, the

work presented in this Dissertation would not have been possible without them. To each

one of them I owe many thanks.

I want to thank my advisor, Dr. Rizwan Bashirullah, for his encouragement when

things might go wrong, his tolerance and patience when things did go wrong, and his

high standard which I will carry though the rest of my life.

I want to thank Dr. Jenshan Lin, Dr. Robert Fox, and Dr. Sanjay Ranka for being

in my committee and spending their precious time on this Dissertation.

My special thanks go to my friends at ICR. Walker Turner, Qiuzhong Wu, Hang

Yu, Chris Dougherty, Chun-ming Tang, Lin Xue, Zhiming Xiao, Chun-chin Peng, Yan

Hu, Pawan Sabharwal, Deepak Bhatia, Lawrence Fomundam, and Felipe Garay offered

me help when I needed it the most, and brought fun to my supposedly dull PhD life. I

will miss the basketball games that we played in those hot summer days.

I want to thank Professor Paul Kohl and his group at Georgia Institute of

Technology for their wonderful cooperation, especially Brad Chen and Todd Spencer.

I feel blessed to have such wonderful friends outside ICR, including Shuo Cheng,

Mingqi Chen, Changzhi Li, Xiaogang Yu and Yan Yan. There is no doubt I enjoyed and

will always cherish our friendship.

I am grateful to my manager, Yanli Fan, and my colleagues, Karl Muth, Archie Hu

and Huawen Jin, at Texas Instruments. Yanli has been very supportive when I needed

to take time off for my defense. I learnt a lot from each one of them, and look forward to

making my own contribution to the team.

5

I want to thank my parents, my parents in law, and my sister. Throughout the ups

and downs in the past years, they supported me with their love without condition. If

there is only one thing that I want to achieve in my life, I want to make them proud.

Finally I want to thank my dear wife, Yuan Rao, the most caring and lovely

woman in my life. I cannot thank her enough for her love, encouragement, patience, and

everything she has done for me. Marrying her is by far the best thing that ever

happened to me. I won’t hesitate a moment to give everything in the world for my wife,

and dedicating this Dissertation to her is the least I can do.

6

TABLE OF CONTENTS

page

ACKNOWLEDGEMENTS ............................................................................................... 4

TABLE OF CONTENTS .................................................................................................. 6

LIST OF TABLES ............................................................................................................ 9

LIST OF FIGURES ........................................................................................................ 10

LIST OF ABBREVIATIONS ........................................................................................... 16

ABSTRACT ................................................................................................................... 18

CHAPTER

1 INTRODUCTION .................................................................................................... 21

1.1 Research Motivation ......................................................................................... 21

1.2 Dissertation Organization .................................................................................. 24

2 HIGH-SPEED SERIAL LINK OVERVIEW ............................................................... 27

2.1 Chapter Overview ............................................................................................. 27

2.2 The Channel ..................................................................................................... 28 2.3 Equalization ...................................................................................................... 32

2.3.1 FFE ........................................................................................................ 33 2.3.2 CTLE ..................................................................................................... 34

2.3.3 DFE ....................................................................................................... 35 2.4 Clocking ............................................................................................................ 36

2.4.1 Clock Generation ................................................................................... 36

2.4.2 Clock Recovery...................................................................................... 39 2.5 Signaling ........................................................................................................... 41

2.5.1 Signaling Efficiency ................................................................................ 42 2.5.2 Effects of Channel Loss ......................................................................... 43 2.5.3 Effects of FFE and DFE ......................................................................... 45

2.5.4 Effects of Back Termination ................................................................... 46

2.5.5 Effects of Signaling and Termination Modes ......................................... 49 2.6 Summary .......................................................................................................... 52

3 AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES ............................. 54

3.1 Chapter Overview ............................................................................................. 54 3.2 Transmission Line Design ................................................................................. 57 3.3 Fabrication ........................................................................................................ 60 3.4 Link Implementation .......................................................................................... 62

7

3.4.1 Link Architecture .................................................................................... 62

3.4.2 TX Design .............................................................................................. 63 3.4.3 RX Design ............................................................................................. 64

3.4.3.1 Preamp design .......................................................................... 64 3.4.3.2 DFE design ............................................................................... 68

3.5 Experimental Results ........................................................................................ 69 3.5.1 Air-Cavity Transmission Line Measurement .......................................... 70 3.5.2 Link Measurement ................................................................................. 71

3.6 Summary .......................................................................................................... 74

4 A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR ................................................. 76

4.1 Chapter Overview ............................................................................................. 76

4.2 Baud-Rate CDR ................................................................................................ 77 4.3 Majority-Voting DFE .......................................................................................... 81 4.4 Chip Implementation ......................................................................................... 86

4.4.1 Architecture ........................................................................................... 86 4.4.2 Slicer ...................................................................................................... 88

4.4.3 DMUX .................................................................................................... 89 4.4.4 Clocking ................................................................................................. 90

4.5 Experimental Results ........................................................................................ 92

4.6 Summary .......................................................................................................... 96

5 A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER ...................................... 98

5.1 Chapter Overview ............................................................................................. 98 5.2 TX Implementation ............................................................................................ 99

5.2.1 TX Architecture ...................................................................................... 99 5.2.2 PRBS Generator .................................................................................. 100 5.2.3 LDO ..................................................................................................... 102

5.2.4 TX Driver ............................................................................................ 103 5.3 RX Implementation ......................................................................................... 104

5.3.1 RX Architecture.................................................................................... 104 5.3.2 Slicer Design ....................................................................................... 105 5.3.3 Level Shifting and DFE Tap Generation .............................................. 106

5.3.4 DFE with Look-Ahead Selection Tree .................................................. 108 5.3.5 Decimated Baud-Rate CDR ................................................................. 109

5.4 Injection-Locking-Based Clock Generation ..................................................... 109 5.4.1 Clock Generation Overview ................................................................. 109

5.4.2 ILRO Core ........................................................................................... 110 5.4.3 Delay Line ............................................................................................ 111

5.5 Experimental Results ...................................................................................... 112 5.5.1 TX Measurement ................................................................................. 112 5.5.2 Clocking Measurement ........................................................................ 114

5.5.3 RX Measurement ................................................................................. 115 5.5.4 Transceiver Measurement ................................................................... 117

5.6 Summary ........................................................................................................ 120

8

6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE .......................... 122

6.1 Chapter Overview ........................................................................................... 122 6.2 Background Calibration ................................................................................... 124

6.2.1 Review of Prior Art ............................................................................... 124 6.2.2 Proposed Background Calibration Scheme ......................................... 128

6.2.2.1 Calibration accuracy ............................................................... 130 6.2.2.2 Convergence speed ................................................................ 131 6.2.2.3 Calibration overhead and performance considerations ........... 133

6.3 Chip Implementation ....................................................................................... 134 6.3.1 ADC Architecture ................................................................................. 134 6.3.2 Resistor Ladder ................................................................................... 136 6.3.3 T/H ....................................................................................................... 136

6.3.4 Comparator .......................................................................................... 138 6.3.5 Digital Backend .................................................................................... 144

6.3.6 Reference ADC.................................................................................... 144 6.3.7 Calibration Engine and Supporting Circuitry ........................................ 145

6.3.8 Clock and Power Distribution ............................................................... 146 6.4 Experimental Results ...................................................................................... 146 6.5 Summary ........................................................................................................ 151

7 CONCLUSIONS ................................................................................................... 153

LIST OF REFERENCES ............................................................................................. 155

BIOGRAPHICAL SKETCH .......................................................................................... 165

9

LIST OF TABLES

Table Page 2-1 Summary of signaling and termination modes ........................................................ 52

3-1 Final air-cavity microstrip dimensions ..................................................................... 58

3-2 Performance summary ............................................................................................ 74

4-1 CDR truth table ....................................................................................................... 79

4-2 update ...................................................................................................... 81

4-3 Clock phase update ................................................................................................ 81

4-4 Selector truth table .................................................................................................. 83

4-5 Majority-voter truth table ......................................................................................... 84

4-6 Performance summary ............................................................................................ 96

5-1 Performance summary of the receiver .................................................................. 117

5-2 Performance summary of the transceiver ............................................................. 120

6-1 Comparison of proposed and existing background calibration schemes ............... 134

6-2 Comparison with recently published work ............................................................. 150

10

LIST OF FIGURES

Figure Page 1-1 Evolution of Intel Microprocessors. ....................................................................... 22

1-2 ITRS predictions for transistor count and on-chip clock frequency for the next decade. ............................................................................................................... 22

1-3 ITRS predictions of I/O and power for the next decade .......................................... 23

1-4 Power efficiency of high-speed links vs. year ......................................................... 23

2-1 A typical high-speed serial link ............................................................................... 27

2-2 Conductor loss. ..................................................................................................... 29

2-3 Physical mechanism of dielectric loss ..................................................................... 30

2-4 Channel loss .......................................................................................................... 31

2-5 A sample SBR ........................................................................................................ 32

2-6 Main cursor vs. Nyquist loss .................................................................................. 32

2-7 Eye degradation due to channel loss ..................................................................... 32

2-8 FFE. ....................................................................................................................... 33

2-9 CTLE. ..................................................................................................................... 34

2-10 DFE block diagrams. ............................................................................................ 36

2-11 Block diagrams of a PLL and a DLL. .................................................................... 37

2-12 Block diagrams of an injection-locked 5-stage ring oscillator ............................... 38

2-13 Simulated phase noise suppression with injection-locking ................................... 39

2-14 CDR block diagram .............................................................................................. 39

2-15 Block diagram and principle of Alexander PD ...................................................... 40

2-16 Simulated performances of an inverter in a 0.13-μm CMOS technology.............. 41

2-17 A typical link frontend ........................................................................................... 42

2-18 Main cursor amplitude and signaling power penalty vs. channel loss .................. 43

11

2-19 Post-cursor amplitudes vs. channel loss .............................................................. 44

2-20 The effects of channel loss and equalization on .......................................... 45

2-21 Effects of FFE and DFE in frequency domain ...................................................... 46

2-22 Lattice diagram for reflection calculation .............................................................. 48

2-23 Eye opening vs. RX mismatch ............................................................................. 48

2-24 CM signaling. ....................................................................................................... 50

2-25 VM signaling......................................................................................................... 51

3-1 Cross-sections of microstrips. ................................................................................ 55

3-2 Simulated of conventional and air-cavity microstrip ..................................... 56

3-3 Simulated of conventional and air-cavity microstrip .......................................... 56

3-4 Simulated dielectric loss of conventional and air-cavity microstrip .......................... 56

3-5 Picture of the 3D model and simulated loss at various line widths ......................... 58

3-6 Simulated dielectric loss of air-cavity and conventional transmission lines ............ 58

3-7 Improvement with air-cavity transmission line ........................................................ 59

3-8 Signaling power reduction with air-cavity. .............................................................. 59

3-9 Fabrication process for the air-cavity structure ....................................................... 61

3-10 Picture and cross-section of the fabricated air-cavity structure ............................. 61

3-11 Link block diagram ................................................................................................ 62

3-12 Schematics of the latch and multiplexer. ............................................................... 63

3-13 Schematic of the 5-b DAC ..................................................................................... 63

3-14 Preamp model for gain optimization ...................................................................... 64

3-15 Preamp design. ..................................................................................................... 65

3-16 Input impedance tuning. ........................................................................................ 67

3-17 Simulated RX eye diagrams. ................................................................................. 67

3-19 Layout of the test board with the air-cavity active link ........................................... 69

12

3-20 Measured performances of a 5-cm air-cavity microstrip. ...................................... 70

3-21 Loss of the air-cavity line ....................................................................................... 71

3-22 Chip micrographs of the TX and the RX ................................................................ 71

3-23 Picture of the populated test board ....................................................................... 72

3-24 Test setup ............................................................................................................. 72

3-25 Measured waveforms ............................................................................................ 73

3-26 Measured link performances. ................................................................................ 74

4-1 Different ISI seen by the edge and data samples ................................................... 76

4-2 CDR block diagrams. .............................................................................................. 78

4-3 Operation principle of the proposed baud-rate CDR ............................................... 80

4-4 Block diagram of a 1-tap speculative DFE .............................................................. 82

4-6 Proposed majority voter schematic ......................................................................... 83

4-7 Simulated delay....................................................................................................... 85

4-8 Simulated selector and majority-voter performances. ............................................. 86

4-9 Block diagram of the RX ......................................................................................... 87

4-10 Schematic of the slicer with threshold control ....................................................... 88

4-11 Simulated slicer performances. ............................................................................. 89

4-12 Schematics of the CML and CMOS DMUX cells ................................................... 90

4-13 Schematic of the divider for I/Q generation ........................................................... 90

4-14 Principle of PI ........................................................................................................ 91

4-15 Schematic of the phase interpolator ...................................................................... 91

4-16 Level-converter schematic. ................................................................................... 92

4-17 Die micrograph and board picture ......................................................................... 92

4-18 Test setup ............................................................................................................. 93

4-19 Measured 20” channel performances. ................................................................... 94

13

4-20 Measured DFE performances. .............................................................................. 95

4-21 CDR measurement results. ................................................................................... 95

4-22 Measured CDR jitter tolerance .............................................................................. 96

5-1 TX block diagram .................................................................................................. 100

5-2 PRBS block diagram ............................................................................................. 100

5-3 All-zero detector .................................................................................................... 102

5-4 Schematic of the self-biased comparator with offset ............................................. 102

5-5 Simulated waveforms confirming the function of the all-zero detector .................. 102

5-6 Stability of the LDO ............................................................................................... 103

5-7 RX block diagram .................................................................................................. 104

5-8 Schematic of the slicer .......................................................................................... 105

5-9 Level shifters. ........................................................................................................ 106

5-10 Detailed schematic of the level shifter ................................................................. 107

5-11 Simulated frequency response of the level shifter at different gain settings ........ 107

5-12 Simulated pre-layout selector delay vs. power supply ......................................... 108

5-13 DFE selection tree. .............................................................................................. 109

5-14 Block diagram of the injection-locking-based clock generation ........................... 110

5-15 Schematic of the ILRO core ................................................................................ 111

5-16 Start-up issue of the pseudo-differential oscillator .............................................. 111

5-17 Schematic of the current-starved delay line ........................................................ 112

5-18 Simulated delay line tuning curve........................................................................ 112

5-19 Chip micrograph and transceiver layout .............................................................. 113

5-20 TX measurement results at 6.25 Gb/s. ................................................................ 113

5-21 ILRO measurement results. ................................................................................ 114

5-22 Measured phase noise with and without injection locking ................................... 115

14

5-23 Measured CDR delay line tuning curve showing >2-UI tuning range .................. 115

5-24 Measured loss characteristics of the 20” channel ............................................... 116

5-25 Measured 4-Gb/s eye diagrams before and after the 20” channel ...................... 116

5-26 RX bathtubs with and without DFE ...................................................................... 116

5-27 Jitter histogram of the recovered clock ................................................................ 117

5-28 Measured 5-Gb/s TX eye diagrams. ................................................................... 118

5-29 Measured CDR waveforms. ................................................................................ 119

5-30. RX bathtubs with and withou DFE ...................................................................... 119

6-1 An ADC-based serial link ...................................................................................... 122

6-2 Schematic of a preamp ......................................................................................... 123

6-3 Correlation-based calibration ................................................................................ 125

6-4 Redundancy-based calibration .............................................................................. 126

6-5 Reference-ADC-based calibration......................................................................... 127

6-6 Principle of reference-ADC-based calibration. ...................................................... 127

6-7 Proposed reconfigurable-comparator-based calibration ........................................ 129

6-9 Mechanism of noise-induced calibration error ....................................................... 131

6-10 Required conversions for convergence with different resolutions ....................... 133

6-11 Block diagram of the ADC ................................................................................... 135

6-12 T/H Design. ......................................................................................................... 137

6-13 T/H Bandwidth vs. switch width .......................................................................... 137

6-14 Comparator block diagram. ................................................................................. 138

6-15 Schematics of the first two stages of the preamplifier ......................................... 139

6-16 Effects of M3. ....................................................................................................... 140

6-19 Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators. ......................................................................... 142

6-20 Simulated comparator performances. ................................................................. 143

15

6-21 Block diagram of the digital backend ................................................................... 144

6-22 FSM flow chart. N is the calibration index, which is also the SRAM address. ..... 145

6-23 Chip micrograph. ................................................................................................. 147

6-24 Measured ADC linearity. ..................................................................................... 148

6-25 Test setup for dynamic performance evaluation.................................................. 149

6-26 Output spectrums. ............................................................................................... 149

6-27 ENOB w/ and w/o calibration .............................................................................. 149

16

LIST OF ABBREVIATIONS

Term: Definition ADC Analog-to-digital converter

CDR Clock and data recovery

CG Common-gate

CM Current mode

CML Current-mode logic

CTLE Continuous-time linear equalization

DFE Decision-feedback equalization

DLL Delay-locked loop

DMUX De-multiplexer

DNL Differential non-linearity

DSP Digital signal processor

ENOB Effective number of bits

FFE Feedforward equalization

FSM Finite-state machine

ILRO Injection-locked ring oscillator

INL Integral non-linearity

ISI Inter-symbol-interference

ITRS International technology roadmap of semiconductors

I/O Input/output

LFSR Linear-feedback shift register

LPF Low-pass filter

LSB Least significant bit

MUX Multiplexer

NRZ Non-return-to-zero

17

PD Phase detector

PFD Phase-and-frequency detector

PI Phase interpolator

PLL Phase-locked loop

PM Phase modulation

PRBS Pseudo-random bit sequence

RX Receiver

SAFF Sense-amplifier flip-flop

SBR Single-bit response

SNR Signal-to-noise ratio

TX Transmitter

UI Unit interval

VCDL Voltage-controlled delay line

VCO Voltage-controlled oscillator

VM Voltage mode

18

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

LOW-POWER HIGH-SPEED SERIAL LINK DESIGN

By

Jikai Chen

May 2013

Chair: Rizwan Bashirullah Major: Electrical and Computer Engineering

With ever increasing integrated functionalities and on-chip clock frequency on a

processor, the off-chip bandwidth is increasing at even higher rates. The ITRS predicts

that the aggregate off-chip bandwidth of future processors will reach 100 Tb/s in the

next ten years, delivered by multiple high-speed serial links in parallel, each running at

multi-Gb/s. At the same time, the total power budget of a processor is practically flat

due to package and cooling technology limitations. To accommodate the increase of off-

chip bandwidth, the power efficiency of high-speed interconnects must be dramatically

improved over the next decade.

Various factors come into play when improving the power efficiency of high-

speed serial links. For multi-Gb/s off-chip signaling, the electrical channel presents the

most difficult challenge with its latency and frequency-dependent attenuation. As a

result, clock and data recovery (CDR) and channel equalization have become essential

functions in all high-speed off-chip serial links. To truly optimize the link power

efficiency, the impact of channel condition, CDR and equalization on the link power

19

must be well understood, in addition to that of such design choices as signaling mode

and termination topology. This Dissertation is the result of such an effort.

The Dissertation starts with an overview of the high-speed serial link. The

channel loss mechanisms are first reviewed and dielectric loss is shown to be the

dominant factor in future high-speed channels. The dependence of the signaling power

on signaling modes, termination topologies and equalization techniques is analyzed to

identify power-efficient solutions. CDR is also briefly reviewed, revealing the need for a

better baud-rate scheme than existing ones.

To reduce the dielectric loss, a low-power active link is presented in Chapter 3

with an air-cavity transmission line which reduces the channel latency and the dielectric

loss by replacing the dielectric material between the signal lines and the ground plane

with air. Other techniques include the use of DFE, a current-sharing frontend, and the

removal of back termination for better power efficiency. The link works up to 6.25 Gb/s

with a power efficiency of 0.6 pJ/bit.

Clock recovery is addressed in Chapter 4. A novel digital baud-rate CDR scheme

is proposed which automatically tracks the maximum eye-opening. Chapter 4 also

proposes replacing the selectors in a traditional speculative DFE with majority-voters

which is faster and more power-efficient. A receiver that incorporates the proposed

baud-rate CDR and majority-voting DFE works at 4.5 Gb/s while consuming 12.4 mW,

yielding a power efficiency of 2.8 pJ/bit.

Building upon the results of Chapters 3 and 4, Chapter 5 presents a complete 5-

Gb/s transceiver which dissipates only 3.7 mW. To improve the power efficiency, the

transceiver uses exclusively static CMOS logic gates instead of the CML gates in

20

Chapters 3 and 4, and employs injection-locking based clock generation. Heavy

parallelism and speculation in the DFE selection tree further reduces the power

consumption. The measured 0.75-pJ/it power efficiency is among the best reported to

date.

While currently most serial links still rely on some analog signal processing, the

continuous scaling of CMOS technology has recently made an ADC-based serial link

attractive in which equalization and timing recovery are all carried out in the digital

domain. One of the key challenges in this ADC-based architecture is the power

consumption of the high-speed ADC. Chapter 6 presents a novel digital background

calibration scheme suitable for high-speed ADCs which features negligible hardware

and power overhead. The efficacy of the proposed calibration scheme is experimentally

confirmed with a 50-mW 2.5-GS/s 5-bit full-flash ADC.

All the test chips in this Dissertation are in a 0.13-µm bulk CMOS technology.

However, they are readily applicable to more advanced technologies. It is therefore

expected that techniques proposed in this Dissertation should help enable future off-

chip serial links with high aggregate bandwidth and low power consumption.

21

CHAPTER 1 INTRODUCTION

1.1 Research Motivation

The past few decades have witnessed the tremendous advancement of the

semiconductor technology. Governed by Moore’s Law [1] [2], the functionality

(represented by the number of transistors) integrated on a single chip and the on-chip

clock frequency both grew exponentially, as can be observed in Figure 1-1, which

shows the transistor number and on-chip clock frequency of Intel’s microprocessors

over the past 40 years. Consequently, higher and higher I/O bandwidth is needed for

the communication between microprocessors, accelerators, and memories [3].

Recently, the aggregate off-chip bandwidth has entered the Tb/s range, necessitating

the integration of multiple (tens or even hundreds of) high-speed serial-link transceivers

on the same chip, each operating at multi-Gb/s. For example, in [4], a 16-core SPARC

processor has 1.1 Tb/s aggregate I/O bandwidth provided by 112 transmitters and 176

receivers with peak signaling rate of 4.08 Gb/s each.

Such exponential growth of functionality and clock frequency is expected to

continue in the coming decade, as predicted by ITRS [5] and shown in Figure 1-2(A)

and (B), giving rise to even faster increase of the I/O bandwidth over the same period.

Figure 1-3(A) and Figure 1-3(B) show the predicted off-chip clock frequency and the

total number of pads, while the resulting aggregate off-chip bandwidth is plotted in

Figure 1-3(C), assuming that differential NRZ signaling is used and that 50% of the

pads are dedicated to off-chip signaling. It can be seen that within 10 years, the total

bandwidth will extend to the hundred Tb/s range.

22

(A) (B)

Figure 1-1. Evolution of Intel Microprocessors. A) Transistor count. B) on-chip clock frequency.

(A) (B)

Figure 1-2. ITRS predictions for transistor count and on-chip clock frequency for the next decade. A) Transistor count. B) on-chip clock frequency.

However, due to packaging and cooling limitations, it is also predicted that the

total power consumption of a processor will be kept practically flat about 140 W over the

same period, as shown in Figure 1-3(D) [5]. State-of-the-art power efficiency of high-

speed serial-link transceivers is around 1 pJ/bit (1 mW/Gb/s), which means 100 W I/O

power consumption if 100 Tb/s aggregate bandwidth is desired. Apparently, the power

efficiency of high-speed transceivers must be greatly improved in order to maintain such

a growth of I/O bandwidth. For example, if the I/O power is to be kept around 20% of the

whole chip, the power efficiency should improve to approximately 0.2 pJ/bit in 2022.

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1E+09

1E+10

1970 1980 1990 2000 2010 2020

Tra

ns

isto

r #

Year

Pentium

Pentium III8086

10-Core Xeon

40040.1

1

10

100

1000

10000

1970 1980 1990 2000 2010 2020

Clo

ck f

req

ue

nc

y (

MH

z)

Year

Pentium

Pentium III

808610-Core Xeon

4004

1E+09

1E+10

1E+11

1E+12

2012 2014 2016 2018 2020 2022

Tra

nsis

tors

#

Year

1

10

100

2012 2014 2016 2018 2020 2022

Clo

ck

fre

qu

en

cy (

GH

z)

Year

23

(A) (B)

(C) (D)

Figure 1-3. ITRS predictions of I/O and power for the next decade

Figure 1-4. Power efficiency of high-speed links vs. year

In response, the power efficiency of high-speed serial links has been steadily

improving at about 20% each year [6] [7] in the past driven by the joint effort of

technology scaling and design innovations. Figure 1-4 shows the power efficiency of the

high-speed serial links published in ISSCC and the VLSI Symposium since 2000.

1

10

100

2012 2014 2016 2018 2020 2022

Off

-ch

ip C

loc

k (

GH

z)

Year

4.6X

0

500

1000

1500

2000

2500

3000

3500

4000

2012 2014 2016 2018 2020 2022

IO p

ad

s

Year

1.4X

1

10

100

1000

2012 2014 2016 2018 2020 2022

Ag

gre

gate

IO

BW

(T

b/s

)

Year

6.4X

0

20

40

60

80

100

120

140

160

180

2012 2014 2016 2018 2020 2022

Po

we

r (W

)

Year

1E-1

1E+0

1E+1

1E+2

1E+3

1E+4

2000 2004 2008 2012

Po

we

r E

ffic

ien

cy (

pJ

/bit

)

Year

10-1

100

101

102

103

104

~-20%/year

24

Extrapolating this trend to 2022 gives about 0.7 pJ/bit, which is 3× the 0.2 pJ/bit goal.

This clearly indicates that more drastic improvement is needed in the future and is the

motivation behind the research work presented in this Dissertation.

1.2 Dissertation Organization

A high-speed serial link involves functions such as equalization, clocking, and

signaling. To improve the power efficiency of the whole link, it is vital to understand

each of these components and their inter-dependencies, which is the topic of Chapter 2.

Chapter 2 starts with the channel, with special emphasis on the intrinsic loss of

transmission lines. It then introduces a few popular equalization techniques to

compensate channel loss. The important topic of clock generation and recovery follows,

revealing the attractiveness of injection-locking-based clock generation and baud-rate

CDR. After that, the signaling power is related to channel loss, equalization, termination,

and signaling modes. The advantages of DFE and voltage mode signaling with

differential termination are demonstrated.

Chapter 3 focuses on reducing the signaling power by joint channel and circuit

optimization. An air-cavity transmission line structure is proposed to reduce the

dielectric loss which dominates at high frequencies. To further reduce the power

dissipation, the link also features speculative DFE and a current-sharing frontend

without back termination. The active link dissipates 3.7 mW at 6.25 Gb/s, which

translates to a power efficiency of 0.6 pJ/bit.

A digital eye-tracking baud-rate CDR scheme is proposed in Chapter 4. The

baud-rate CDR automatically tracks the maximum eye-opening while reducing the

clocking power by more than 50% compared to a conventional oversampling-based

CDR. A majority-voting 1-tap speculative DFE is also proposed which is more amenable

25

to low-power and high-speed designs than the selectors in conventional speculative

DFE’s. Implemented with CML gates, a receiver with the proposed baud-rate CDR and

majority-voting DFE consumes 12.4-mW at 4.5-Gb/s including the clocking circuitry.

To further improve the power efficiency, Chapter 5 presents a complete

transceiver in exclusive static CMOS gates. The RX employs heavy parallelism to

reduce the power supply from the nominal 1.2 V to 1.0 V. Other design features include

a speculative DFE with a look-ahead selection tree, a decimated baud-rate eye-tracking

CDR, and an injection-locked ring oscillator for multi-phase clock generation. The TX

uses a voltage-mode driver with differential termination to reduce the signaling power.

The transceiver consumes 3.7 mW at 5 Gb/s. At 0.75 pJ/bit, the power efficiency is

among the best to date.

With advanced CMOS technologies offering transistors with cut-off frequencies

above 100 GHz and gate delays of around 10 ps, it is now possible for the RX to directly

digitize incoming signal and perform equalization and timing recovery in the digital

domain [8]. One of the key challenges, however, is the ADC’s power consumption. With

a given architecture, an ADC’s power consumption is limited by mismatch which

prevents the use of small transistors. In response, Chapter 6 describes a novel

background ADC calibration scheme that is suitable for high-speed ADCs and incurs

negligible hardware and power overhead. The proposed calibration scheme is

implemented in a 50-mW 2.5-GS/s 5-bit flash ADC and its effectiveness is

demonstrated with experimental results.

All the reported results are in 0.13-μm bulk CMOS technology. It is expected that

the migration to more advanced technologies will lead to even better performances. The

26

proposed techniques should therefore help pave the way toward low-power high-speed

serial links to meet the requirements of future high-performance electronic systems.

27

CHAPTER 2 HIGH-SPEED SERIAL LINK OVERVIEW

2.1 Chapter Overview

Figure 2-1 shows a typical high-speed serial link, which consists of a TX, a

channel, and a RX. The TX multiplexes a low-speed parallel bus into a high-speed

serial stream and drives it toward the channel. The RX resolves the stream into digital

bits with a slicer and de-multiplexes them back to a parallel format. The equalizer (EQ)

compensates the frequency-dependent loss of the channel, and the clock and data

recovery (CDR) unit adaptively adjusts the RX clock phase so that the slicer digitizes

the incoming stream with enough timing margin.

Figure 2-1. A typical high-speed serial link

To improve the power efficiency of a serial link, the various parts of the link must

be well understood. We first examine the channel, with emphasis on transmission line

loss because it plays a vital role in determining the link performance. We then introduce

some popular equalization techniques to compensate the channel loss, including FFE,

CTLE, and DFE. Clocking, including clock generation and clock recovery, is presented

next. We show in this part that injection-locking is an attractive clock-generation

technique, and that baud-rate CDR schemes are generally preferred over their over-

MU

X DRV

CDR

DM

UX

EQ

RXChannelTX

28

sampling counterparts. In the end, we relate the signaling power to channel loss,

equalization, impedance mismatch, signaling modes, and termination schemes. We

demonstrate that DFE usually gives better signaling efficiency than FFE, and that

voltage-mode signaling with differential termination reduces the signaling power

significantly.

2.2 The Channel

At multi-Gb/s, the channel delay is comparable or even larger than the bit time,

rendering the signaling sensitive to reflections due to impedance mismatch. For this

reason, the channel is usually a transmission line with controlled 50-Ω impedance to

accommodate measurement equipment and properly terminated at both the TX and RX.

Discontinuities along the channel such as vias, packages, and connectors should all be

carefully evaluated and controlled.

However, even a perfectly uniform transmission with proper termination presents

challenges to high-speed signaling. At multi-Gb/s, the channel suffers from two

frequency-dependent loss mechanisms, and it’s the channel rather than the transistors

that limit the total signaling bandwidth. For example, it is shown in [9] that, in theory, an

NMOS in 0.8um technology is able to resolve a 48-Gb/s binary bit stream. However, the

experimental results fall way short of the theoretical prediction due to the channel

bottleneck (including the pads and packages).

The first loss mechanism is the conductor resistance. At low frequencies, the

current flows evenly through the conductor cross-sectional area. At high frequencies,

however, the current tends to follow the path with least inductance, flowing only in a

shallow band underneath the conductor surface, a phenomenon known as skin effect,

29

as shown in Figure 2-2(A). The skin depth, the depth at which the current density

decays to e-1 of that at the surface, is given by [10]

where δ is the skin depth, is the frequency, μ is the permeability , and σ is the

conductivity. Figure 2-2(B) plots the skin depth in copper as a function of frequency. In

GHz range, the skin depth is only on the order of μm.

(A)

(B)

Figure 2-2. Conductor loss. A) Skin effect. B) Skin depth vs. frequency in copper

The crowding of current to the conductor surface increases the effective

resistance at high frequencies. Since the skin depth is inversely proportional to √ , the

conductor loss (in dB) increases proportionally to √ .

0

5

10

15

20

25

0 2 4 6 8 10

δ(μ

m)

Frequency (GHz)

30

The second loss mechanism is the dielectric dissipation, which originates from

the polarization of the molecules in the dielectric material. As illustrated in Figure 2-3,

when an alternating electric field is applied to a dielectric material, the molecules rotate

to align with the external field and in doing so rub against each other and convert some

of the electric energy into heat [11]. Because the molecules rotate every time the field

polarity changes, the dielectric loss (in dB) is proportional to frequency, and is given

by [12]

where is the loss tangent of the dielectric material.

Figure 2-3. Physical mechanism of dielectric loss

The total loss is the combined effects of and , and can be expressed as

where and

are constants determined by the transmission line construction. Since

both and increase with frequency, the channel displays a low-pass profile. Figure

2-4 shows an example channel loss, where is the data rate. The loss at half data

rate, , is also known as the Nyquist loss. denotes the frequency at which

the two loss mechanisms contribute the same and is given by

(

)

E

31

For a differential 100-Ω 8-mil 0.5-OZ microstrip line on FR4, is around 2 GHz. For

high-quality cables, may be much higher. For example, a 50-Ω RG-58 cable with

PolyEthylene dielectric material may have an around 100 GHz.

Figure 2-4. Channel loss

In the time domain, this low-pass characteristic can be captured by the channel’s

single-bit response (SBR) . Figure 2-5 shows a sample SBR, where is the

main cursor, those with negative index are pre-cursors, and those with positive index

are post-cursors. It can be seen that due to the limited channel bandwidth, a single bit

spans more than one UI and interferes with neighboring bits, a phenomenon known as

inter-symbol-interference (ISI).

To evaluate the impact of channel loss on the link performances, it is desirable to

establish a relationship between the Nyquist loss and the SBR. However, since the

Nyquist loss does not completely characterize the channel, an exact mapping between

the Nyquist loss and the SBR is not possible. Figure 2-6 shows the main cursor

amplitude at different Nyquist losses. Depending on the relationship between and

, and may have varying significances, and channels with the same Nyquist loss

may have different SBRs. Without loss of generality, the discussion in this chapter

considers the case .

0.00 1.00

α(f)

Frequency

0 0.5fDR fDR

Nyquist loss

fC

32

Figure 2-5. A sample SBR

Figure 2-6. Main cursor vs. Nyquist loss

2.3 Equalization

Figure 2-7 shows the simulated eye diagrams for channels with 6-, 12-, and 18-

dB Nyquist losses. The channel loss degrades both the voltage and timing margins

seen by the RX. When the Nyquist loss is about 12 dB, the eye completely closes. To

extend the bandwidth of the channel, equalization is often employed in high-speed

serial links. This section reviews some of the most popular techniques.

Figure 2-7. Eye degradation due to channel loss

0.0

0.2

0.4

0.6

0.8

1.0

-2 -1 0 1 2 3 4 5 6

SB

R

Time (UI)

hch(0)

hch(1)

hch(2)hch(-1)hch(3)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 3 6 9 12 15

hch(0

)

Nyquist loss (dB)

only

=

only

6 dB

1.0

0

-1.0

Vo

ltag

e (

V)

+1-1Time (UI)

0

12 dB

1.0

0

-1.0

Vo

ltag

e (

V)

+1-1Time (UI)

0

18 dB

1.0

0

-1.0

Vo

lta

ge

(V

)

+1-1Time (UI)

0

33

2.3.1 FFE

Since the ISI originates from the channel’s low-pass characteristic, it is possible

to reverse it with a linear high-pass filter. One way of doing this is through a discrete-

time FIR filter [13] [14] at the TX or RX, of which TX feedforward equalization (FFE) is

the most popular, as Figure 2-8(A) shows. By adjusting the tap weights, a relatively flat

composite frequency response can be obtained, as shown in Figure 2-8(B).

(A)

(B)

Figure 2-8. FFE. A) Block diagram. B) Working principle

Although more drivers are used for FFE, their total size is the same as the driver

without FFE if the same peak gain is maintained. The electronic power overhead of FFE

stems mainly from the additional flip-flops and the associated wiring.

D D D DD

CK

-5

-4

-3

-2

-1

0

0 0.5 1

Ga

in (

dB

)

Frquency

Channel FFE

Composite

0 0.25 fDR 0.5 fDR

34

2.3.2 CTLE

(A)

(B)

Figure 2-9. CTLE. A) Circuit detail. B) Frequency response

Another linear equalization technique is the continuous-time linear equalizer

(CTLE) [6] [7]. Figure 2-9(A) and (B) show the schematic and transfer function of such a

CTLE. The transfer functions has two poles and one zero, which are given by

The product of the gain, the peaking factor, and the bandwidth satisfies [15]

0.01 0.10 1.00 10.00 100.00

Ga

in

Frequency

35

which means the performance of the CTLE is limited by the cut-off frequency of the

technology. Due to the high bandwidth and linearity requirements, a CTLE tends to be

power hungry. For example, implemented in 90-nm CMOS, the CTLE in [6] provides

8.7-dB peaking and accounts for 27% of the total RX power at 6.25 Gb/s. For a 12.5-

Gb/s link implemented in 65-nm CMOS, the CTLE provides 7.5-dB peaking and

represents 38% of the RX power [7].

2.3.3 DFE

Besides the linear equalizers discussed above, a non-linear equalization

technique, known as decision-feedback equalization (DFE), has found interest in recent

high-speed serial links [16] [17] [18]. A 1-tap DFE is depicted in Figure 2-10(A). It works

by directly removing the ISI of the previous bit from the current analog sample. Another

way of viewing it is that the DFE adjust the slicer threshold depending on the previous

bit. The power overhead of the DFE shown in Figure 2-10(A) consists mainly of the

summer.

The feedback path in Figure 2-10(A) must settle within one UI, a difficult design

challenge at high data rates. To relax this stringent timing requirement, speculative DFE

can be used, where possible results are pre-computed and then selected by the

previous bits [19], as shown in Figure 2-10(B). The power overhead of speculative DFE

is comprised of the additional slicers.

36

(A)

(B)

Figure 2-10. DFE block diagrams. A) Conventional DFE. B) Speculative DFE.

2.4 Clocking

At multi-Gb/s, both the timing offset and uncertainty must be well controlled, and

clocking, including clock generation and clock recovery, may constitute a significant or

even dominant portion of the total link power [6] [20].This section looks at both clock

generation and clock recovery, and identifies ways to reduce the clocking power.

2.4.1 Clock Generation

Clock generation in high-speed serial links is usually done with a PLL or a DLL.

Figure 2-11(A) depicts a PLL block diagram, which consists of a phase detector (PD), a

low-pass loop filter (LPF), a voltage-controlled oscillator (VCO), and an optional divider.

At steady state, the negative feedback loop ensures that the VCO output phase is

aligned with that of the reference clock.

A DLL block diagram is shown in Figure 2-11(B), where the VCO in a PLL is

replaced with a voltage-controlled delay line (VCDL). Under locked condition, the delay

DFF

Slicer

DFFSlicers

Selector

37

of the VCDL is equal to one reference clock cycle. Compared to a PLL, a DLL is usually

easier to design because the loop is of first order.

While the cores of a PLL and a DLL are the VCO and VCDL, the other loop

components may consume significant power. For example, in [6], the VCO consumes

only 12% of the total PLL power. Besides, the PD and loop filter also occupy

considerable area.

(A)

(B)

Figure 2-11. Block diagrams of a PLL and a DLL. A) PLL. B) DLL

Another clock generation technique that is found in some recent serial links is the

injection-locked oscillator [21] [22]. Figure 2-12 depicts the block diagram of an

injection-locked 5-stage ring oscillator. In the absence of injection signal, each stage of

the oscillator contributes a delay of , resulting in a free-running frequency of

When a clock with frequency is injected to one of the nodes, the delay of the

injected stage changes by and at rising and falling edges respectively.

PD VCOLPFCKREF

LPF

VCDL

PD

CKREF

38

Designating , under locked condition, the oscillation is sustained at ,

and the following equation holds:

Injection-locking a ring oscillator to a clean reference clock can dramatically

improve its noise performance because periodical correction by the injected clock

prevents jitter from accumulating indefinitely [23]. This can be observed in the frequency

domain as a reduction in the phase with injection-locking, as illustrated in Figure 2-13.

Compared to a PLL or a DLL, an injection-locked oscillator avoids the power and

area overhead of the PD, the LPF and the dividers, while still offering good jitter

performance [24] [23] [25]. Besides, since no feedback loop is involved, an injection-

locking-based clock generation does not have the stability issue of a PLL or DLL.

Figure 2-12. Block diagrams of an injection-locked 5-stage ring oscillator

CK0

CK1

CK2 CK3

CK4

TD

39

Figure 2-13. Simulated phase noise suppression with injection-locking

2.4.2 Clock Recovery

A clock recovery unit is essentially a feedback system consisting of three basic

blocks, namely a phase detector (PD), a phase shifter or rotator, and a loop filter, as

shown in Figure 2-14. The PD determines whether the sampling clock is too early or too

late. The early/late information, after being processed by a loop filter, is used to control

the phase shifter or rotator toward the desired position.

Figure 2-14. CDR block diagram

Various architectures exist for clock recovery [26]. The PD can be either linear

[27] or non-linear [28], with the former giving both the direction and magnitude of the

phase deviation, while the latter only the direction. In high-speed serial links, non-linear

PD is more popular because it does not require processing of narrow pulses [29]. The

loop filter can be analog [30], digital [31], or hybrid [32]. The phase shifter or rotator can

be implemented with an oscillator, a delay line, or a phase interpolator (PI) etc.

-150

-140

-130

-120

-110

-100

-90

-80

1E+05 1E+06 1E+07 1E+08 1E+09

Ph

ase

no

ise

(d

Bc)

Offset frequency (Hz)

w/o injection

w/ injection

Phase

RotatorLPF

PD

40

Non-linear phase detection is usually achieved via oversampling. Figure 2-15(A)

shows the block diagram of an Alexander PD [28]. The input signal is sliced twice for

each UI, one for eye center (data) and one for eye boundary (edge). Whenever a data

transition is detected, the edge sample in between is compared with the two data

samples to determine whether the sampling clock is too early or too late, as illustrated in

Figure 2-15(B). Assuming the clock phases are evenly spaced, at locked condition, the

data-sampling phase is automatically placed at the center.

(A)

(B)

Figure 2-15. Block diagram and principle of Alexander PD

The power overhead of oversampling CDR consists of the additional slicers and

clocking circuitry. While the additional slicers may be disabled to reduce their power

consumption if a low CDR bandwidth is acceptable [6], it is still necessary to generate

the extra clock phases. Moreover, since oversampling requires timing resolution better

DIN CK

D

LO

GIC Early/

Late

E

D0 E0 D1

(D0=E0 &&E0!=D1) CK too early

D0 E0 D1

(D0!=E0 &&E0=D1) CK too late

D0 E0 D1

(D0=E0 &&E0=D1) No transition

41

than the bit time, the clocking power overhead is more than it appears because doubling

the timing resolution requires more than doubling the clocking power. This can be

observed in Figure 2-16, which shows the delay and energy of an inverter in a 0.13μm

CMOS technology. For this reason, baud-rate CDR is preferred to reduce clocking

power.

(A)

(B)

Figure 2-16. Simulated performances of an inverter in a 0.13-μm CMOS technology. A) Delay. B) Energy.

2.5 Signaling

In a high-speed serial link, the TX driver needs to produce a large enough

voltage swing over the low channel impedance. The power consumed by the TX driver,

also known as the signaling power, may constitute a significant portion of the total link

power. For instance, in [7], nearly 40% of the link power is consumed by the TX driver.

0

50

100

150

200

250

0.4 0.6 0.8 1 1.2

t pd

(ps

)

VDD (V)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

0.4 0.6 0.8 1 1.2

En

erg

y/c

yc

le (

fJ)

VDD (V)

42

To improve the power efficiency of the whole link, it is imperative to gain an insight to

the various factors that affect the signaling power.

2.5.1 Signaling Efficiency

Figure 2-17 shows a typical frontend found in high-speed links [17]. The analysis

in this section assumes that the DC loss of the channel is negligible. Without DC loss,

the signal swing at the TX and RX are the same, as shown in Figure 2-17. For the ideal

case with lossless channel and perfect termination, the eye opening is the same as

the signal swing and the signaling power is

Figure 2-17. A typical link frontend

Factors such as channel loss, equalization, termination, and signaling modes

cause to deviate from . If we define the signaling efficiency as

the signaling power now becomes

Z0

ZTX = Z0 ZRX = Z0

43

By studying the relationship between and the various factors such as channel loss,

equalization, termination, and signaling mode, their impacts on the signaling power can

be understood.

2.5.2 Effects of Channel Loss

With the SBR given, the worst-case eye opening can be found using the peak-

distortion technique [33], and is calculated to be

∑| |

For a uniform channel with perfect matching, all the cursors are positive. Since

the DC loss is negligible, i.e.

∑ )

Equation 2-9 can be simplified to

, )

Figure 2-18. Main cursor amplitude and signaling power penalty vs. channel loss

Figure 2-18 shows the simulated amplitudes of the main cursor as a function of

the channel Nyquist loss. Assuming the post-cursors are completely removed by DFE,

the main cursor amplitude equals the RX eye opening. The signaling power penalty of

the channel loss is therefore calculated accordingly and is plotted also in Figure 2-18. It

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 5 10 15

PS

IGp

en

alt

y

Main

cu

rso

r

Nyquist loss (dB)

44

can be seen that when the Nyquist loss exceeds about 9dB, 50% more signaling power

is needed to restore the eye opening seen by the RX slicers.

Besides mandating more signaling power, higher channel loss also necessitates

more equalization and induces power penalty for signal processing thereof. This is

explained with the help of Figure 2-19, which shows the amplitudes of the first three

post-cursors normalized to the main cursor. Generally speaking, with increasing

channel loss, the post-cursors become more and more significant compared to the main

cursor. Specifically, when the Nyquist loss is 9 dB, the second post-cursor is around

10% of the main cursor. While 1-tap DFE may be enough when the Nyquist loss is less

than about 6~9 dB, extra DFE taps are desired beyond that, incurring power penalty for

the extra latches etc.

Figure 2-19. Post-cursor amplitudes vs. channel loss

Figure 2-20 plots for different channel losses. When the Nyquist loss

goes beyond 9 dB, the eye opening quickly degrades and error-free signaling without

equalization becomes impractical or even impossible near 12 dB.

0.0

0.1

0.2

0.3

0.4

0.5

0 5 10 15

Po

st

cu

rso

r a

mp

litu

de

Nyquist loss (dB)

h(1)

h(2)

h(3)

All normalized to h(0).

45

Figure 2-20. The effects of channel loss and equalization on

2.5.3 Effects of FFE and DFE

To facilitate signaling over lossy channels, equalization is often employed in high-

speed serial links. The impacts on the signaling power depend on the specific

equalization scheme.

The FFE operates with an FIR filter in cascade with the channel. With proper tap

weights, the FIR filter inverts the channel response so that the composite frequency

response is flat up to the Nyquist frequency, i.e.

| | | |. )

The peak gain of the FIR filter occurs at the Nyquist frequency, and is kept at

unity for fair comparison, i.e.

| | . )

Equation 2-12 can then be simplified to

| | | | | |. )

The signaling efficiency with FFE is then given by [34]

| |. )

The DFE, on the other hand, directly removes the ISI of the previous bits and is

better understood in the time domain. In the absence of detection errors (no error

propagation), the DFE can be analyzed in a linear fashion and the composite SBR is

0

1

2

3

4

5

6

7

8

0 3 6 9 12 15

PS

IG/P

SIG

0

Nyquist loss (dB)

W/O EQ W/ FFE

W/ DFE

46

)

The signaling efficiency with DFE is then given by

∑| |

. )

The normalized signaling power with FFE and DFE is also plotted in Figure 2-20.

While both FFE and DFE extend the achievable data rate, DFE always yields the lowest

signaling power. For example, when the Nyquist loss is 9 dB, the signaling power with

DFE is 40% lower than that with FFE.

Figure 2-21. Effects of FFE and DFE in frequency domain

Intuitively, this benefit of DFE stems from the fact that DFE boosts the high-

frequency component [16]. This is in contrast to FFE, which merely attenuates the low-

frequency component of the signal so that the high- and low-frequency components

have the same amplitude when arriving at the RX. This is shown in Figure 2-21, which

compares the composite frequency responses with FFE and DFE of a hypothetical

channel which has an SBR of [0.8, 0.2].

2.5.4 Effects of Back Termination

As shown in Figure 2-17, a typical link has termination at both the TX and RX.

Although the TX back termination helps mitigate reflections, it reduces the signal swing

by 50%, which must be compensated for by doubling the signaling power. Note,

-5

-4

-3

-2

-1

0

0 0.5 1

Att

en

ua

tio

n (

dB

)

Frquency

0 0.25 fDR 0.5 fDR

Boosting

W/O EQ.

W/ DFE

W/ FFE

47

however, that this back termination is not necessary if the channel is relatively uniform

and a good impedance matching is ensured at the RX. With the back termination

removed and assuming perfect RX matching, the signaling power now becomes

. )

Comparing Equation 2-17 to Equation 2-7, removing the back termination reduces the

signaling power by half because it doubles the impedance seen by the TX driver [35].

However, without the damping of the back termination, reflections due to RX

impedance mismatch may make multiple trips along the channel before dying out. The

resulting degradation of the eye opening must be evaluated.

The effect of RX impedance mismatch can be studied with the help of the lattice

diagram [36], as shown in Figure 2-22, where and are the reflection coefficients at

the TX and RX respectively. When a pulse first arrives at the RX, the transmitted pulse

is given by

. )

The reflected pulse travels back and gets fully reflected at the TX. When it arrives again

at the RX, the transmitted pulse is

, )

where denotes convolution. Since the channel DC loss is negligible, the worst case

eye opening degradation due to the first reflected pulse is

∑| |

| |. )

Similarly, the degradation due to the nth reflection is

∑| |

| | . )

48

The total effect is obtained by taking the sum and is

∑∑| |

∑ | |

| |. )

The signaling efficiency without back termination is therefore

2( ∑| |

| |) , )

where the factor 2 accounts for the amplitude doubling due to the removal of the back

termination.

Figure 2-22. Lattice diagram for reflection calculation

Figure 2-23 depicts the eye opening improvement with the back termination

removed as a function of RX impedance mismatch. With 9-dB Nyquist loss and 10%

impedance mismatch, the signaling power is reduced by nearly 40%.

Figure 2-23. Eye opening vs. RX mismatch

Also plotted in Figure 2-23 is the effect of RX mismatch when the Nyquist loss is

12 dB. The eye degradation becomes more sensitive to RX mismatch without back

0%

50%

100%

150%

200%

250%

-40% -20% 0% 20% 40%

No

rma

led

eye

op

en

ing

RX mismatch

Nyquist loss =9 dB

12 dB

49

termination when the channel loss increases. Intuitively, this is because the main cursor

decreases with increasing channel loss, while the reflection remains the same as long

as the DC loss is negligible.

Note the above discussion assumes negligible DC loss of the channel. If the

channel has substantial DC loss, the reflections may be heavily attenuated and good

termination may not be required at either TX or RX [37].

2.5.5 Effects of Signaling and Termination Modes

The above discussion considers exclusively current-mode signaling. However,

both current-mode (CM) [16] [20] and voltage-mode (VM) [6] [38] [39] signaling have

been used for high-speed serial links. Besides, the termination may be single-ended or

differential. Their signaling powers are analyzed below.

Figure 2-24(A) shows the schematic of a current-mode frontend with single-

ended termination. The differential pair works in saturation region and steers the tail

current to either branch according to the bit being transmitted. The voltage levels at the

TX outputs are

The voltage swing and the signaling power are therefore

When the termination is differential, as shown in Figure 2-24(A), the voltage

levels become

50

while the single-ended voltage swing and the signaling power are the same as single-

ended termination.

(A)

(B) Figure 2-24. CM signaling. A) Single-ended termination. B) Differential termination.

Figure 2-25 shows the schematic for VM signaling. The transistors work in linear

region and connect the outputs to either voltage rails according to the bit being

transmitted. Termination is provided by series resistors, either by the on-resistance of

the transistors or by explicit resistors in series with the transistors. With single-ended

termination, the voltage levels at the TX outputs are

The single-ended voltage swing and the signaling power are

R

VDD

Z0

R R

Z0

VDD

R

R

VDD

Z0

R

RZ0

51

For the case of differential termination, the voltage levels become

The single-ended voltage swing and the signaling power now are

It can be seen that using differential termination reduces the signaling power by 50% for

VM signaling.

(A)

(B)

Figure 2-25. VM signaling. A) Single-ended termination. B) Differential termination.

Z0

R=Z0

Z0

R R

VDRV

Z0

Z02R

VDRV

R =Z0

52

Table 2-1 summarizes the performance of current-mode and voltage-mode

drivers with single-ended and differential terminations. It can be seen that even with a

linear regulator to generate VDRV, a VM signaling with differential termination consumes

only 25% of CM signaling power.

Table 2-1 Summary of signaling and termination modes Mode CM CM VM VM

Term. SE Diff. SE Diff

2.6 Summary

Various factors come into play when one tries to improve the power efficiency of

a high-speed serial link, with the channel posing the most difficult challenge. At multi-

Gb/s, conductor loss and dielectric loss limit the channel loss and causes temporal

spreading of the transmitted pulses. To compensate for the resulting ISI, high-speed

serial links usually employ equalization such as FFE, CTLE and DFE, with each

involving a different level of complexity.

Clocking, including clock generation and clock recovery, is challenging at high

data rates and sometimes may dominate the total link power budget. Conventional

53

solutions such as PLL and DLL entail considerable area and power overhead due to the

PD and LPF. Injection-locking based clock generation, on the other hand, is a promising

technique because it avoids such overhead while still features low jitter. To reduce the

clocking power, baud-rate CDR is preferred over its oversampling counterpart, such as

the Alexander type CDR, which has found popular use in recent high-speed serial links.

Due to the low channel impedance, the signaling power, the power dissipated by

the TX driver, consumes considerable percentage of the link power. Using the peak

distortion technique and the concept of signaling efficiency, this chapter shows the

attractiveness of DFE and VM signaling with differential termination. It is also shown

that with moderate channel loss and reasonable termination tolerance, back termination

can be removed to further reduce the signaling power.

The rest part of this Dissertation will report a few TX and RX implementations

that embed the analysis results presented in this chapter. Their usefulness is

demonstrated with experimental results.

54

CHAPTER 3 AN ACTIVE LINK WITH AIR-CAVITY TRANSMISSION LINES

3.1 Chapter Overview

As discussed in chapter 2, the bandwidth of transmission lines is limited primarily

by conductor loss and dielectric loss . Because is proportional to √ while is

proportional to [36], the latter mechanism dominates at high frequencies. For

conventional dielectric materials such as FR4, the dielectric loss significantly degrades

the channel bandwidth for multi-Gb/s signaling. While resorting to materials with low

loss tangents or even optics is possible, such solutions incur significant cost overhead.

Figure 3-1(A) shows the cross-sections of a conventional microstrip on FR4

( . Since the field of a microstrip transmission line resides in

both the air and FR4, the effective dielectric constant lies somewhere between

and . The extent to which dominates is characterized by a so-

called filling factor [40], which satisfies

The effective loss tangent can also be related to the filling factor by [40]

Because the dielectric loss is determined by both and through [12]

reduction of the filling factor will reduce the dielectric loss.

Intuitively, since most of the field energy is confined between the signal lines and

the ground plane, if we can somehow fill the space between them with air, the filling

55

factor will be reduced. This can be done by employing the air-cavity microstrip structure

(also known as inverted microstrip [41]), as shown in Figure 3-1(B) Air-cavity microstrips

can be formed by selectively post-processing the FR4 boards for high-speed

interconnects. This avoids the cost overhead associated with expensive substrate

materials for non-critical signals.

(A)

(B)

Figure 3-1. Cross-sections of microstrips. A) Conventional. B) Air-cavity.

Figure 3-3 shows the simulated of conventional and air-cavity differential

microstrips, with the conductor thickness kept at 5 µm. The calculated filling factor is

shown in Figure 3-3. It can be seen that air-cavity microstrip has lower and ,

and that when

, is reduced by 30% by employing the air-cavity. According to

Equation 3-7, such reductions translate to an improvement of 36% of , as shown in

Figure 3-4.

It should be noted that not only is the air-cavity structure attractive for low loss, it

also features lower latency for the same channel length because is reduced.

Encouraged by these results, this Chapter presents the design and fabrication of

air-cavity transmission lines, and their use in an active link. The active link features a

FR4

FR4

56

current-sharing frontend and speculative DFE to reduce the signaling power. Back

termination at the TX is also removed for further power saving. Experimental results

confirm the dielectric loss is reduced by 26% by the air-cavity structure. Operating at

6.25 Gb/s, the link consumes 3.7 mW, yielding a 0.6 pJ/bit power efficiency.

Figure 3-2. Simulated of conventional and air-cavity microstrip

Figure 3-3. Simulated of conventional and air-cavity microstrip

Figure 3-4. Simulated dielectric loss of conventional and air-cavity microstrip

0.0

1.0

2.0

3.0

4.0

0 1 2 3 4 5 6

r,

W/H

Conventional

Air-cavity

0.5

0.6

0.7

0.8

0.9

1.0

0 2 4 6

αf

W/H

conventional

Air-cavity

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 2 4 6

αd(dB/cm

)

W/H

ConventionalAir-cavity

57

3.2 Transmission Line Design

The main design parameters of the proposed air-cavity structure include the

signal line width W and spacing S, the conductor thickness t, and the height of the air-

cavity H. The design goals include 100 Ω differential impedance, low loss and high

density. Considering the process capability, the conductor thickness is chosen to be 5

μm. For simplicity, the signal line width W and spacing S are assumed to be the same.

A meandered transmission line length of 20 cm is used as a representative channel

length for chip-to-chip interconnects [42]. The channel loss is evaluated at 5GHz with a

target of 10 dB or less, or equivalently an attenuation constant of 0.5 dB/cm at this

frequency.

The transmission line is simulated in a 3D electromagnetic simulator. Figure

3-5(A) shows the picture of the 3D model. To reduce the requirement on computation

resources, a short line of 1 cm is simulated. The obtained S-parameters are then

cascaded to get the characteristics of longer lines.

Figure 3-5(B) shows the simulated air-cavity loss performance at 5 GHz at

various signal line widths. While the conductor loss decreases with increasing conductor

sizes due to larger effective conducting surface area, the dielectric loss stays relatively

constant since it is primarily determined by the material properties. From a loss

reduction perspective, it is desirable to use as big a W as possible. However, to achieve

the desired impedance, a proper

must be maintained. The fabrication process limits

the air-cavity height to about 20 μm. Accordingly, the final W is chosen to be 40 μm,

which gives an 8 dB total loss for a 20 cm channel at 5 GHz. The transmission line

dimensions are listed in Table 3-1.

58

Figure 3-5. Picture of the 3D model and simulated loss at various line widths

Table 3-1. Final air-cavity microstrip dimensions W S t H

40 µm 40 µm 5 µm 19 µm

Figure 3-6 compares the dielectric loss in the proposed air-cavity transmission

line and the conventional FR4-based microstrip transmission line (in dB/cm) with the

same conductor width and spacing. The air-cavity structure reduces the dielectric loss

by around 26%.

Figure 3-6. Simulated dielectric loss of air-cavity and conventional transmission lines

The effective dielectric constants are calculated from the simulated phase

characteristic. The air-cavity structure reduces the effective dielectric constant by 25%

from 2.75 to 2.07.

Signal P

Signal N

Ground

FR4

0 10 20 30 400

0.2

Width (μm)

S2

1 (d

B/c

m)

50 60

0.4

0.6

0.8

1.0

1.2

Total Loss

Dielectric Loss

Conductor Loss

Frequency (GHz)

-0.8

-0.6

-0.4

-0.2

0

0 5 10 15 20

Airgap tanD

FR4 tanDConventional

Air-cavity

(dB

/cm

)

59

Figure 3-7 compares the simulated losses of conventional and air-cavity

transmission lines. The loss of the air-cavity transmission line is 0.25 dB/cm at 3.125

GHz and is 8% less than the conventional structure. Figure 3-8 shows the signaling

power reduction with the air-cavity structure assuming FFE and DFE respectively. The

improvement of air-cavity topology becomes more pronounced at higher frequencies as

the dielectric loss becomes more significant. For example, at 10 GHz, the loss

improvement is nearly 15%, and for a 20-cm channel the signaling power is reduced by

more than 10% with DFE and 16% with FFE. It is therefore expected that the air-cavity

structure is especially attractive for future high-speed interconnects.

Figure 3-7. Improvement with air-cavity transmission line

(A) (B)

Figure 3-8. Signaling power reduction with air-cavity. A) With FFE. B) With DFE.

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0 5 10 15 20

Lo

ss

(d

B/c

m)

Frequency (GHz)

Conventional

Air-cavity

5

10

15

20

25

30

35

40

2 6 10 14 18 22 26 30 34 38

Data rate (Gb/s)

50%-55%45%-50%40%-45%35%-40%30%-35%25%-30%20%-25%15%-20%10%-15% C

han

ne

l le

ng

th (

cm

)

5

10

15

20

25

30

35

40

2 6 10 14 18 22 26 30 34 38

Data rate (Gb/s)

25%-30%

20%-25%

15%-20%

10%-15%

5%-10%

0%-5% Ch

an

ne

l le

ng

th (

cm

)

60

3.3 Fabrication

Figure 3-9 illustrates the process flow for fabricating the proposed air-cavity

interconnects. The process begins with electroplating the first copper pattern on an FR4

substrate representing the differential signal lines (Figure 3-9(A)). Following this step, a

sacrificial polymer layer is spin-coated with desired thickness and patterned to act as a

temporary placeholder in the formation of the air-cavity (Figure 3-9(B)). The sacrificial

polymer contains poly-propylene carbonate (PPC) (Novomer Inc., Ithica, NY). A

photoacid generator is added in order to obtain a photo sensitive polymer mixture, and

γ-butyrolactone (GBL) serves as the solvent. A similar formulation is available as Unity

2203P from Promerus LLC, Brecksville, OH. Two different approaches for patterning

are studied for the PPC layer, photo-patterning and self-patterning [43]. When photo-

patterning, a photo mask is used. When employing PPC self-patterning process, no

photo mask is needed, and the slightly sloped sidewalls of the PPC patterns makes it

ideal for the sequential layers to have a better step coverage. The copper ground layer

is then patterned on top of the PPC patterns. The entire surface is then overcoated with

Avatrel 8000P (functionalized polynorbornene) for hermetic seal of transmission line

and providing mechanical support for the top ground copper layer (Figure 3-9(C)). PPC

polymer backbone unzipping occurs upon heating up to 220°C during Avatrel overcoat

curing, during which period of time the solid PPC is converted to gaseous products. The

gaseous products gradually permeates through the overcoat sidewalls and opening in

the ground layer patterns, leaving an air-cavity region of the same physical shape as the

patterned PPC with little residue (Figure 3-9(D)), thus air-cavity transmission line

structure is formed. The overcoat also serves as solder mask for later die and cable

attachments.

61

(A)

(B)

(C)

(D)

Figure 3-9. Fabrication process for the air-cavity structure

Figure 3-10. Picture and cross-section of the fabricated air-cavity structure

Figure 3-10(A) shows the picture of the finished air-cavity differential

transmission lines. The ground plane is patterned in a grid style, with holes for gas

release during PPC evaporation. Figure 3-10(B) shows a cross-section of the finished

air-cavity structure.

FR4

FR4

PPC

FR4

PPC

Avatrel

FR4

Air

Avatrel

FR4

SignalGround

Air-cavity

62

3.4 Link Implementation

3.4.1 Link Architecture

Figure 3-11 shows the block diagram of the link. The RX has a common-gate

(CG) preamp and a half-rate 1-tap speculative DFE. The TX consists of a half-rate 27-1

PRBS core, a MUX, and an open-drain driver. To reduce signaling power, the back

termination at the TX output usually found in high-speed serial links is removed in this

design. For the same voltage swing seen by the RX, removing the back termination

reduces the required signaling power by 50% because it doubles the impedance seen

by the TX.

Figure 3-11. Link block diagram

Channel equalization is primarily done by the DFE for better power efficiency as

discussed above. However, because DFE only cancels post-cursors, a 2-tap FFE is still

TX MUX

CG amp

Driver

DFE

L L L

L L

27-1

PR

BS

D0

D1

VB

CKTX

Offset

Control

+h

-h

Q0

+h

-h

Q1

CKRX

L

L

LL

L

L

LL

Current-sharing frontend

Impedance control

20-cm air-cavity

channel

32 µm 32 µm

24 µm 12 µm

63

built in the TX driver for pre-cursor cancellation. Note that this TX FFE can also be

configured for post-cursor cancellation, and facilitates the comparison between FFE and

DFE in terms of power efficiency.

3.4.2 TX Design

The latches, multiplexers and drivers in the TX are all implemented in current-

mode logic (CML) for fast operation and good power noise immunity, as shown in

Figure 3-12. Considering the fact that the pre-cursor is usually only a fraction of the

main cursor, the pre-cursor driver is sized half of the main cursor driver. The

multiplexers are sized in such a manner that the signal path comprised of the latch, the

multiplexer and the driver has a uniform fan-out.

(A) (B)

Figure 3-12. Schematics of the latch and multiplexer. A) Latch. B) Multiplexer

Figure 3-13. Schematic of the 5-b DAC

To facilitate debugging and testing, a serial interface is integrated on-chip. The

bias currents of all the gates are controlled with 5-b DACs, the schematic of which is

shown in Figure 3-13.

OUTN

OUTP

INP INN

CKP CKN

OUTN

OUTP

AP AN

CKP CKN

BP BN

X2

W1VDD

VREF

VBIAS

X1

W0

X4

W2

X8

W3

X16

W4

64

3.4.3 RX Design

3.4.3.1 Preamp design

The RX consists of the CG preamp and the DFE. The CG frontend at the RX side

serves multiple purposes. First, it provides low-to-high impedance transformation and

increases the voltage swing seen by the following DFE stage. This accommodates a

smaller input voltage swing, which is important for high power efficiency as discussed

before. Second, it accomplishes level-shifting of the input signal so that NMOS input

stages can be used in the DFE. Third, the input impedance looking into the source of

the CG amplifier provides partial impedance matching for the channel.

The most important design metrics of the CG preamp are bandwidth and gain,

which are both closely related to power. With the bandwidth design target set to 67% of

the data rate, or 4.2 GHz for 6.25 Gb/s NRZ signaling, the gain of the CG preamp is

optimized for minimum link power. A higher preamp gain yields better RX sensitivity and

lower signaling power, but requires more power for the preamp. For a given channel

condition and technology, an optimum gain therefore exists that minimizes the total

frontend power PFE.

Figure 3-14. Preamp model for gain optimization

Figure 3-14 shows a preamp model for gain optimization. For a given load

capacitance , gain A and 3-dB bandwidth , the following equations hold:

IN

65

where W is the transistor width, is the transistor transconductance per unit width, R

is the load resistance, and is the transistor drain capacitance per unit width. For each

transistor current density , W and R can be solved and the amplifier current is found to

be

(A)

(B)

Figure 3-15. Preamp design. A) Amplifier current vs. current density. B) Frontend power vs. preamp gain

Figure 3-15(A) plots the amplifier current as a function of at different gain in the

target 0.13-um CMOS technology when driving the four slicers of the DFE. For each

0.0

0.2

0.4

0.6

0.8

1.0

1 10 100 1000

I AM

P(m

A)

Current density (uA/um)

5

43

2A=1

6

7

8

0

1

2

3

4

0 2 4 6 8 10

Po

we

r (m

W)

Preamp gain

66

gain, there exists an optimum current density, and the optimum current density

increases with increasing gain.

Figure 3-15(B) shows the signaling power, the preamp power and the frontend

power at different gain with optimum current density over a channel with 9-dB Nyquist

loss. The slicer sensitivity is 100 mV, and it is assumed that DFE is used and that back

termination is removed. The minimum frontend power is attained when the preamp gain

is around 4, and is about 50% lower than the case without the preamp.

The frontend power is further reduced with a current-sharing frontend, as shown

in Figure 3-11. By stacking the CG preamp and the open-drain TX driver, the tail current

of the TX driver is reused by the RX amplifier. According to Figure 3-15(B), this reduces

the frontend power by nearly 50%. The fact that the TX driver is powered from the RX

supply also helps to suppress the noise coupling from the TX supply.

Back termination is removed in this work to reduce signaling power. The

downside of this practice is the risk of potential reflections due to TX impedance

mismatch. To mitigate the effect of reflections, a good impedance matching at the RX

side must be maintained. Since the input impedance of the CG frontend is bias

dependent and non-linear, a programmable resistor is connected across the RX inputs

to provide a better matching, as shown in Figure 3-16(A). The programmable range of

the resistor is chosen so that a differential input impedance of 100 Ω is maintained over

a wide bias range between 0.5 mA and 5 mA, as shown in Figure 3-16(B). Figure 3-17

compares the RX eye diagrams with and without back termination. It can be seen that,

as expected, removing the back termination nearly doubles the RX eye opening without

67

any noticeable degradation of the eye quality. Given the same RX sensitivity, this

means the signaling power is reduced by nearly 50%.

(A)

(B)

Figure 3-16. Input impedance tuning. A) Schematic. B) Simulated result.

(A) (B)

Figure 3-17. Simulated RX eye diagrams. A) With back termination. B) Without back termination.

VB32 µm 32 µm

58

0

58

0

1.2

µm

X 8

50

60

70

80

90

100

110

120

130

140

150

0 1 2 3 4 5 6

ZD

M(Ω

)

Tail current (mA)

150

100

50

0

-50

-100

-1500 0.2 0.4 0.6 0.8 1.0

64 mV

Time (UI)

Vo

lta

ge

(m

V)

150

100

50

0

-50

-100

-1500 0.2 0.4 0.6 0.8 1.0

126 mV

Time (UI)

Vo

lta

ge

(m

V)

68

To prevent the RX sensitivity degradation due to small transistor sizes, offset

cancellation is also built into the CG amplifier, as shown in Figure 3-11. The polarity and

magnitude of the offset cancellation are all adjustable via digital control.

3.4.3.2 DFE design

The DFE employs a speculative architecture and half-rate clocking to ease timing

requirement. The slicers are implemented as CML latches with adjustable built-in offset,

as shown in Figure 3-18. When the latch is in its amplification phase (CKP is HIGH), an

auxiliary differential amplifier injects static current into the output nodes to introduce a

desired offset. This is in contrast to [44], where the offset is introduced during the

regeneration phase. This leads to more robust latch operation since the regenerative

gain is not affected by the offset injecting differential pair. Another highlight of the DFE

design is that a single latch stage is employed before the selector, unlike [44] where a

complete flip-flop is used. To account for different channel profiles, both the polarity and

the magnitude of the offset injecting current are programmable via an on-chip serial

interface. The programmable range of the slicer threshold is simulated to be ±140 mV,

which is large enough to account for different DFE tap weights required by different

channel profiles.

Figure 3-18. Slicer schematic

OUTN

OUTP

INP INN

CKP CKN

SP SN

CKP

Tap

Control

69

The designs of the CML latches and multiplexers in the DFE are the same as the

TX except sizing. Unlike the multiplexers in the TX which see the large input

capacitances of the pre-cursor and main cursor drivers, the multiplexers in the RX only

see the CML latch inputs. Accordingly, they are sized the same as the latches to save

power.

3.5 Experimental Results

To evaluate the performance of the proposed air-cavity structure, a test board is

designed. The layout of the test board is shown in Figure 3-19. The center area is

occupied by the active link, which include footprints for a TX chip and a RX, and the air-

cavity transmission lines. The rectangular board for the active link is cut using a dicing

saw and interfaced with test equipment to evaluate overall link performance. CPW lines

are used to connect the SMA connectors to the chip footprint.

Figure 3-19. Layout of the test board with the air-cavity active link

The top and bottom areas of the test board are used to implement air-cavity test

structures of various lengths. To improve measurement accuracy, open-short-thru de-

embedding structures are also implemented. To facilitate processing, custom alignment

marks are placed at multiple locations. The entire board footprint is designed to fit into a

circular area with a diameter of 4” to accommodate the in-house fabrication capabilities.

Active

Link

Test

Structures

Test

Structures

TX RX

4”

70

3.5.1 Air-Cavity Transmission Line Measurement

The performance of the air-cavity transmission line was obtained by measuring a

5-cm test structure using a vector network analyzer with high-frequency probes. Figure

3-20 shows the measured loss and phase responses. The effective dielectric constant is

calculated to be 1.7 from the measured phase, which is lower than predicted before.

This is probably because the dielectric constant of the base material is lower (~3.9) than

the used 4.4 in previous simulations. The lower dielectric constant also leads to higher

line impedance, which causes ripples in the measured loss due to impedance mismatch

[12].

(A)

(B)

Figure 3-20. Measured performances of a 5-cm air-cavity microstrip. A) Loss. B) Phase.

-8

-7

-6

-5

-4

-3

-2

-1

0

0 5 10 15 20

Lo

ss

(d

B)

Frequency (GHz)

-240

-180

-120

-60

0

60

120

180

240

0 5 10 15 20

Ph

as

e (

de

gre

e)

Frequency (GHz)

71

The true loss of the line (excluding the effects of impedance mismatch) is

calculated from extracted propagation constant using the technique in [45], and the

result is shown in Figure 3-21. The loss is 0.28 dB/cm at 3.125 GHz, which readily

meets our design goal. Simulation result (with ) is also overlaid for comparison,

demonstrating good agreement between measurement and simulation.

Figure 3-21. Loss of the air-cavity line

3.5.2 Link Measurement

The TX and RX test chips are fabricated in 0.13-μm 1.2-V CMOS process. Figure

3-22 shows the chip micrographs. The TX and RX cores occupy 0.03 mm2 and 0.02

mm2, respectively. The test chips are wire-bonded to QFN packages and mounted on

the test board with a 20-cm 8”) air-cavity interconnect. Figure 3-23 shows the picture of

the populated test board with air-cavity lines in the center of the board.

Figure 3-22. Chip micrographs of the TX and the RX

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0 5 10 15 20

Lo

ss

(d

B/c

m)

Frequency (GHz)

Measurement

Simulation

1.5

mm

1.3 mm

Transmitter

200μm

15

m

1.3 mm

1.5

mm

Receiver

200μm

10

m

72

Figure 3-23. Picture of the populated test board

Figure 3-24. Test setup

The test setup is depicted in Figure 3-24. The TX and RX work mesochronously,

deriving their clocks from the same signal generator, with their phase relationship

adjusted by a mechanically-tunable delay-line.

The full link operates successfully at 6.25 Gb/s with a half-rate input clock of

3.125 GHz. Figure 3-25(A) and Figure 3-25(B) show the measured single-ended eye-

diagrams at the outputs of the RX CG amplifier (driven off-chip for testing purpose)

before and after enabling the TX FFE respectively. The closed eye-diagram is

TX RX20 cm

air-cavity

TX

Balun

RX

Balun

Delay

Scope

or

BERT

Tri

gg

erAir-Cavity

3.125 GHzSplitter

CK RXCK TX

TX OUT RX IN RX OUT

73

successfully opened by enabling TX FFE. Figure 3-25(C) shows the eye-diagram at the

output of the DFE for a 27-1 PRBS pattern, with the corresponding transient waveform

shown in Figure 3-25(D). Correct 27-1 PRBS sequence is verified with both visual

inspection and BER measurements.

(A) (B)

(C) (D)

Figure 3-25. Measured waveforms

Figure 3-26 shows the measured RX bathtub curves and energy-per-bit

performance with different equalization settings. At 6.25 Gb/s and a BER of 10-12 with

only the TX FFE enabled, the eye opening is 30% UI. Enabling the RX DFE and

disabling TX FFE improves the eye opening to 37%, while the overall power efficiency

improves from 0.9 to 0.6 mw/(Gb/s), respectively. Enabling both FFE and DFE further

improves the horizontal eye opening to 56% UI but decreases the power efficiency.

When the link is operated at 6.25 Gb/s with only the DFE enabled, the TX core, the

current-sharing front-end, and the DFE dissipate 1.44 mW, 1.2 mW and 1.06 mW,

respectively.

74

(A) (B)

Figure 3-26. Measured link performances. A) RX bathtub curves. B) Power efficiency.

Table 3-2 summarizes the link performance in relation to a recently published

paper. Compared to previously published results, a large portion of the TX and RX

power is decreased using the current-sharing frontend.

Table 3-2. Performance summary This work [7]

Technology 0.13 μm 65 nm

Supply voltage 1.2 V 1.0 V

Data rate 6.25 Gb/s 12.5 Gb/s

Front-end swing 125 mV 100 mV

BER 1e-12 1e-12

Horizontal eye 56% UI @ 6.25Gb/s -

Power 3.7 mW 12 mW

Energy-per-bit 0.6 pJ/bit 0.98 pJ/bit

TX/RX core area 0.03mm2/ 0.02mm

2 0.24mm

2/0.24mm

2

3.6 Summary

The bandwidth of the channel poses difficult challenges for high-speed serial

links. At high frequencies, dielectric loss dominates over conductor loss. The design and

10-12

10-10

10-8

10-6

10-4

10-2

100B

ER

-0.5 -0.3 -0.1

Time (UI)

0.1 0.3 0.5

6.25 Gb/s

FFE

DFE

FFE+DFE Po

we

r E

ffic

ien

cy

(p

J/b

it)

Data Rate (Gb/s)

FFE Only

DFE Only

FFE+DFE

0 1 2 3 4 5 6 7 80

0.5

1

1.5

75

fabrication of the air-cavity transmission line structure is presented in this Chapter to

reduce the dielectric loss. The measured effective dielectric constant is 1.73 and the

loss is about 0.4 dB/cm.

The air-cavity transmission lines are used in an active link. The active link

features a low-power current-sharing frontend with a 1-tap speculative DFE. To further

reduce power consumption, the back termination is also removed. The active link

achieves successful 6.25 Gb/s operation and consumes 3.7 mW off a 1.2 V power

supply, demonstrating the potential of the techniques for future low-power high-speed

interconnects.

76

CHAPTER 4 A 4.5-Gb/s 12.4-mW RX WITH BAUD-RATE CDR

4.1 Chapter Overview

The receiver presented in Chapter 3 does not include CDR, an essential function

in high-speed receivers as discussed in Chapter 2. CDR in high-speed serial links is

usually achieved with oversampling. However, oversampling CDRs have a few issues.

One of the issues is explained in Chapter 2, which is the requirement for power-hungry

clock generation and distribution with sub-bit-time resolution. The second issue with

oversampling CDR lies in its assumption that the maximum voltage margin occurs at the

eye center [31]. When the input eye is horizontally asymmetric, locking to the eye center

may lead to sub-optimal voltage margin. The third issue with oversampling CDR is that

it reduces the already challenging settling time requirement for DFE [17] [46]. Because

the input signal is oversampled, the time allowed for the DFE to settle is now less than

one UI. For low-power high-speed serial link design, a baud-rate CDR that circumvents

these issues is therefore of interest. Sampling at the eye edges may also require

dedicated edge equalization, since the edge samples experience different ISI than the

data samples, as shown in .

Figure 4-1. Different ISI seen by the edge and data samples

1 2 3 4 5 6 7 8

Time (UI)

Data sample ISI

Edge sample ISI

77

In this Chapter we present a RX with a novel digital baud-rate eye-tracking CDR

which employs an auxiliary slicer (CDR slicer) with adjustable threshold voltage. By

jointly updating the sampling phase and the threshold voltage of the CDR slicer, the

CDR loop drives the decision point of the CDR slicer to the peak of the eye opening,

and thus automatically locks to the maximum voltage-margin point. Because the CDR

slicer samples at exactly the same instant as the main data slicers, it does not interfere

with DFE operation.

We also present a majority-voting DFE architecture that replaces the selectors in

a traditional speculative DFE with majority-voters. Compared to a selector, a majority

voter is more amenable to low-power and high-speed designs because it reduces the

transistor stacking levels and features equal delay to all data inputs. A majority-voter

also eliminates the need for a level shifter in bipolar designs.

A receiver was implemented with the proposed CDR scheme and the majority-

voting DFE. Details of the RX implementation will be given in this Chapter, together with

measurement results, which confirmed correct functions of both techniques.

4.2 Baud-Rate CDR

A few baud-rate CDR schemes have been proposed in the past. The Mueller-

Muller CDR [47], used in several recently published serial link receivers [8] [46],

operates by adjusting the clock phase so that the sampled pulse response satisfies a

predefined timing criterion. However, this type of CDR does not necessarily ensure

maximum voltage margin of the sampled eye at lock. The CDR in [48] improves the

voltage sampling margin but is only suitable for integrating-type RX frontends. The

baud-rate CDR in [7] relies on auxiliary slicers that have a larger sampling window than

the main data slicers to keep the sampling phase away from the eye edges, but it does

78

not take into account the voltage margin. Another baud-rate CDR reported in [49] locks

to the maximum voltage-margin point, but requires analog slope detection circuitry and

is therefore not as amenable to technology scaling and migration as digital solutions.

(A)

(B)

Figure 4-2. CDR block diagrams. A) Alexander CDR. B) Proposed baud-rate CDR.

Figure 4-2 shows the block diagram of the Alexander CDR and the proposed

baud-rate CDR. The Alexander CDR employs two slicers, sampling half-UI away from

each other, hence 2× oversampling. The PD in an Alexander CDR only produces

information for updating the clock phase. The proposed CDR also employs two slicers

(main and CDR slicers). However, unlike the Alexander CDR, these two slicers sample

the input signal at the same time, therefore no oversampling is involved. The PD in the

proposed CDR not only controls the clock phase, but also the offset of the CDR slicer.

Phase

update

DIN

D

E

LOOP

FILTER

PD

LOGIC

2× sampling

Phase

control

Edge slicer

Data slicer

DIN

D

DCDR

LOOP

FILTER

PD

LOGIC

Phase & offset

update

1× sampling

Phase & offset

control

CDR slicer

Main slicer

79

The algorithm of the proposed CDR is such that it drives the sampling point of the CDR

slicer to the position with maximum vertical eye opening. Since the CDR slicer and the

main slicer are triggered by the same clock phase, this automatically lock the clock

phase to the point with maximum voltage margin.

The operation principle of the proposed CDR is explained with the help of the

CDR truth table shown in Table 4-1, where an “×” denotes “don’t care”. , and

are three consecutive outputs of the data slicer, is the output of the CDR

slicer sampled at the same time as , and is the threshold voltage of the CDR

slicer. The CDR takes action whenever is 1, tracking only the upper part of the eye.

The discussion below therefore considers the case when exclusively. If higher

CDR bandwidth is desired, the lower portion of the eye can also be utilized using an

additional CDR slicer.

Table 4-1. CDR truth table

0 1 0 0 ↓ -- 0 1 0 1 ↑ --

0 1 1 0 ↓ →

0 1 1 1 ↑ ←

1 1 0 0 ↓ ←

1 1 0 1 ↑ →

1 1 1 0 -- --

1 1 1 1 -- --

× 0 × × -- --

Figure 4-3(A) illustrates an example eye diagram. The upper portion of the eye is

divided into five numbered regions by the different waveform trajectories corresponding

to input patterns (010), (011), (110) and (111). According to Table 4-1, the CDR updates

only when data pattern equals (010), (011) or (110) since pattern (111)

80

does not contain any timing information. Assuming equal probability for pattern

occurrences, the CDR behavior is summarized in Table 4-2 and Table 4-3 and is

graphically depicted in Figure 4-3(B), where the circles indicate possible decision points

of the CDR slicer, the vertical arrows indicate the updating direction, and the

horizontal arrows indicate the clock phase updating direction. By inspecting Figure

4-3(B), it can be seen that the CDR drives the CDR slicer’s decision point until it dithers

around the maximum eye-opening position (denoted by a star). Since the CDR slicer

and the DFE are clocked at the same phase, this automatically locks the DFE to the

maximum voltage-margin point.

The proposed CDR has a few noteworthy advantages. First, baud-rate operation

saves clocking power by eliminating the need to generate extra clock phases for

oversampling. Second, the CDR automatically locks to the point with maximum voltage

margin without using any eye-opening monitor circuits. Third, the proposed CDR does

not constrain the frontend interface to any particular architecture. Moreover, decimation

of the CDR slicer output is easily accommodated in this CDR, whereas in some other

schemes this may be constrained because they require consecutive CDR slicer results

[46]. It should also be noted that the CDR slicer can be reused for equalization

adaptation to reduce hardware and power overhead.

Figure 4-3. Operation principle of the proposed baud-rate CDR

2

1

3 45

2

1

3 4

5

81

Table 4-2. update

Region (010) (011) (110) (111) Total

1 ↑ ↑ ↑ -- ↑ 2 ↓ ↑ ↑ -- ↑

3 ↓ ↓ ↑ -- ↓

4 ↓ ↑ ↓ -- ↓

5 ↓ ↓ ↓ -- ↓

Table 4-3. Clock phase update

Region (010) (011) (110) (111) Total

1 -- ← → -- -- 2 -- ← → -- --

3 -- → → -- →

4 -- ← ← -- ←

5 -- → ← -- --

4.3 Majority-Voting DFE

DFE has been used extensively in high-speed links to compensate for inter-

symbol-interference (ISI) in band-limited electrical channels [12] [17] [16] due to its

noise immunity, high signaling power efficiency as explained in Chapter 2. To relax the

stringent timing requirement, speculative DFE architecture [19] [50] is often used. As

shown in Figure 4-4, a 1-tap speculative DFE makes two tentative decisions and

assuming the previous bit is and respectively, and then the correct decision

is selected by . The timing requirement for the DFE loop can be written as

)

where is the selector delay, and and are the delay and setup time of

the CML DFF.

82

Figure 4-4. Block diagram of a 1-tap speculative DFE

From Equation 4-2 the selector and flip-flop delays in the critical timing path

determine the maximum operating speed of the 1-tap speculative DFE. While significant

work has been published on CML latches/FFs [51] [52], the following observations can

be made regarding the operation of a CML selector, which is shown in Figure 4-5. First,

because the selection of the current bit decision is made by series connecting the

previous bit , the CML selector employs three transistors in the stack (including the

tail current), and is therefore not optimal for low-voltage/low-power designs. Second, to

maximize the timing margin of the critical DFE feedback loop, it is desirable to minimize

the delay from to , yet in Figure 4-5, experiences the largest delay among

the three inputs. The third issue concerns the common-mode level of : since is

supplied from a CML latch, its common-mode level is close to VDD and this may

necessitate an explicit level shifting stage which incurs power and speed overhead

(especially in bipolar implementations [53]).

Figure 4-5. Schematic of a CML selector

DFFSlicers

Selector

83

Table 4-4. Selector truth table

-1 -1 -1 +1 -1

-1 -1 +1 -1 -1

-1 +1 -1 +1 +1

-1 +1 +1 -1 -1

+1 -1 -1 +1 -1

+1 -1 +1 -1 +1

+1 +1 -1 +1 +1

+1 +1 +1 -1 +1

Table 4-3 shows the truth table of the CML selector in a speculative DFE. Note,

however, that in a low-pass electrical channels with a pulse response of [ , ], both

coefficients and are positive, and thus the feedback tap weight in the DFE

always tends negative. This implies that the combination and

in the

truth table in Figure 4-5 does not occur (indicated in gray), and inverting the

corresponding row outputs therefore does not affect the DFE function. Thus, the truth

table can be rewritten as shown inTable 4-5, and can be expressed as

. )

where is the sign of the operand.

Figure 4-6. Proposed majority voter schematic

Equation 4-2 can be readily implemented with a majority-voter, as shown in

Figure 4-6. Compared to the selector in Figure 4-5, the majority-voter obviates the few

disadvantages mentioned previously. The number of transistors in stack is reduced from

84

three to two, making the majority-voter more amenable for low voltage designs. The

majority-voter is fully symmetric with respect to the three inputs, and as a result, the

critical delay from to is identical for all inputs. Moreover, no level-shifting is

required for .

Table 4-5. Majority-voter truth table

-1 -1 -1 +1 -1

-1 -1 +1 -1 -1

-1 +1 -1 +1 +1

-1 +1 +1 -1 -1

+1 -1 -1 +1 +1

+1 -1 +1 -1 -1

+1 +1 -1 +1 +1

+1 +1 +1 -1 +1

Figure 4-7(A) compares the simulated to delay for a selector and

majority-voter as a function of the input transistors’ current density. For comparison, the

input transistors are of the same size, the single-ended input swing is 300mV, the fan-

out is assumed to be two, and the supply is set to 1.2V. The load resistors are adjusted

so that both the selector and the majority-voter have a small-signal gain of one. The

delay of both selector and majority voter decreases with larger current densities and

higher transistor , and saturates as reaches its maximum. For equal current

densities, the majority-voter exhibits ~50% less delay.

Figure 4-7(B) shows the overall DFE loop delay using the proposed majority

voter and the traditional selector. In this comparison, the latches in the DFF are biased

with equal current density in both cases. The majority voter based DFE shows >10%

improvement in delay over a wide range of current densities. Further improvement can

be achieved by increasing the current-density bias point and speed of the CML DFFs.

85

(A)

(B)

Figure 4-7. Simulated delay. A) Selector and majority-voter. B) Overall DFE loop.

Figure 4-8(A) shows the selector and the majority-voter delay as a function of

bias current. Although the majority-voter has three static tail current paths compared to

the single current bias leg of the selector, the overall current consumption to achieve the

same delay is comparable. This is due to the fact that the majority-voter requires a

lower current density than the selector to achieve the same speed. That is, the majority-

voter has a lower effort delay [54], and thus it exhibits higher power efficiency. This can

be related to the majority-voter having one transistor less in the stack, which also

enables operation at lower supply voltages as shown in Figure 4-8(B). A comparison of

the selector and majority-voter delay normalized to their respective delays at the

0

10

20

30

40

50

60

0 20 40 60 80 100

De

lay (

ps

)

Current density (µA/µm)

Selector

Majority-voter

0%

5%

10%

15%

20%

25%

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100

Imp

rove

me

nt

DF

E l

oo

p d

ela

y (

ps

)

Current density (µA/µm)

W/ selector

W/ majority-voter

86

nominal supply voltage of 1.2V shows that 1) the majority voter is significantly less

sensitive to supply voltage variation and 2) it can operate at a lower supply voltage. For

instance, the selector delay quickly degrades below 0.8V while the majority-voter

exhibits a more gradual degradation below 0.6 V.

(A)

(B)

Figure 4-8. Simulated selector and majority-voter performances. A) Delay vs. total bias current. B) Normalized delay variation with supply voltage (VDD) for current-

density of 100 A/m.

4.4 Chip Implementation

4.4.1 Architecture

Figure 4-9 shows the block diagram of the RX core. The input data is sampled by

a half-rate 1-tap speculative DFE and a CDR slicer. The DFE output is then de-

multiplexed by 8, whereas the CDR slicer output is decimated by 8. A CDR logic block

0

10

20

30

40

50

60

0 100 200 300 400 500 600 700

Dela

y (

ps

)

Current (µA)

Selector

Majority-voter

0

1

2

3

4

5

6

7

8

9

10

0.4 0.6 0.8 1.0 1.2

No

rma

lize

d d

ela

y

VDD (V)

Selector

Majority-voter

87

processes the output of the DFE and the CDR slicer according to the CDR algorithm

described above, and updates both the threshold of the CDR slicer with a 6-b DAC and

the clock phase with a phase interpolater (PI). The I/Q inputs to the PI are generated by

dividing down a full-rate external clock.

To minimize power, the RX employs high-speed CML circuits only in the first two

stages and static CMOS logic for the later stages, as shown in Figure 4-9. In addition,

the data output of the CDR slicer is decimated by 8 instead of being fully de-

multiplexed. Although this decimation reduces the CDR bandwidth, experimental results

reported in following sections confirm that the CDR bandwidth is sufficiently large for

plesiochronous chip-to-chip interconnects. All blocks are built with custom layout except

the CDR logic block which is synthesized with standard cells.

Figure 4-9. Block diagram of the RX

+

+

+

+

L

L

L

L

L

L+

+

IN

/2 /2PI /2CK

CMOS

CMLI

Q

/2

Clocking

DFE

Slicer

Maj. voter

Level converter

Latch

L L D D D

DMUX

Q[0:15]

QCDR

CDR

LOGIC

5

6

CDR slicer

6-b DAC

SAFF SAFF

SAFF

L L

L

L

L

88

4.4.2 Slicer

The slicer is implemented as a CML latch with digital offset control, as shown in

Figure 4-10, where all transistors without length annotation are of minimum channel

length. During pre-amplification mode, a current is injected to the output nodes to

introduce a desired offset. To reduce power supply noise, the offset-injection current is

kept active even when the slicer is in regeneration mode. Both the polarity and

magnitude of the injected current are controlled through the serial interface. An

important design parameter of the slicer is the offset tuning range, which must be large

enough to override the intrinsic slicer offset while generating the desired DFE tap

weight. Figure 4-11(A) shows the simulated offset of the slicer, while the simulated

offset tuning characteristic of the slicer is shown in Figure 4-11(B) when the sign of

offset is set to 1. The slicer offset is 34 mV, and the offset tuning range is ±220 mV.

With 6-b digital control, this gives a maximum DFE tap weight of nearly 200 mV with a

nominal step of 3 mV.

Figure 4-10. Schematic of the slicer with threshold control

IN

CK

S

CK

S

CK CK

IN

89

(A)

(B)

Figure 4-11. Simulated slicer performances. A) Slicer offset. B) Offset tuning.

4.4.3 DMUX

The DMUX is constructed from cascading 1:2 DMUX cells. Figure 4-12 illustrates

the schematics of the latch-based CML and CMOS 1:2 DMUX cells, together with their

transistor-level details. The CML latch has the same topology as the slicer, except that it

does not have the offset adjustment. Also note that the bias current and the transistor

sizes are reduced by 50% since offset is not critical. The CMOS latches are

implemented as sense-amplifier flip-flops (SAFFs).

0

10

20

30

40

50

60

-36 -27 -18 -9 0 9 18 27 36

Offset voltage (mV)

= .

0

50

100

150

200

250

0 16 32 48 64

Slic

er

off

set

(mV

)

Offset control code

90

Figure 4-12. Schematics of the CML and CMOS DMUX cells

4.4.4 Clocking

The clocking circuitry generates clocks for the DFE and the DMUX. A full-rate

external clock is first divided down by a CML divider to obtain I/Q clocks, as shown in

Figure 4-13. Since phase inversion is simply swapping the differential signal polarity, I

and IB are obtained simultaneously. The same is true for Q and QB.

Figure 4-13. Schematic of the divider for I/Q generation

A phase interpolator (PI) combines the I/Q clocks with digitally-controlled weights

to adjust the receiver sampling phase. The principle of PI is depicted in Figure 4-14.

Phase interpolation is achieved by combining the I/Q clock phases with different

SR latchSense-amplifier

CK

SAFF SAFF

SAFF

L L

L

L

L

IN

CK CK

IN1 0.7

2

5

6.7

0.28

0.28

0.6

0.2

= =

L

L

IP / IN

QP/QNCK

91

weightings. Figure 4-15 shows the schematic of the PI, which consists of four differential

pairs. Phase tuning is achieved by adjusting the tail currents of the four differential pairs.

To guarantee monotonicity, the tail current in each differential pair is split into eight

identical current sources, and the binary phase control word PI[5:0] is converted to

thermometer code W[0:31] to control the 32 current sources. With this half-rate

architecture, the phase resolution of the PI is

UI.

(A) (B)

Figure 4-14. Principle of PI

Figure 4-15. Schematic of the phase interpolator

I (0o)

Q(90o)

QB(270o)

IB(180o)

0o

270o

90o

180o

IP IN QP QN IN IP QN QP

VBN

[0] [1] [2] [3] [4] [5] [6] [7]

1.7

2 2 2 2

Decoder

W[0:7] W[8:15] W[16:23] W[24:31]

PI[5:0]

92

The output of the PI is further divided down to clock the DMUX. Figure 4-16

shows the level-converter schematic used to convert CML logic levels to full-swing

CMOS for clocking the SAFF’s in the last two DMUX stages. The CML clock is AC-

coupled to inverters with resistive feedback. The feedback resistor and coupling

capacitor values are chosen so that the lower cut-off frequency is well below the target

clock frequency.

Figure 4-16. Level-converter schematic.

4.5 Experimental Results

The receiver chip was implemented in 0.13-μm bulk CMOS technology, mounted

on a QFN package and assembled on an FR4 test board. Figure 4-17 shows the die

micrograph along with test board picture. The receiver core occupies an area of

0.14mm2.

(A) (B)

Figure 4-17. Die micrograph and board picture

CML CMOS

RX

360 μm

40

0 μ

m

93

Figure 4-18 depicts the measurement setup. A PRBS generator and a 20-inch

differential microstrip FR4 channel were used to validate the receiver. The PRBS

generator and the RX were clocked by two different RF sources. When evaluating the

DFE, the two RF sources are synchronized with the RX CDR disabled. Otherwise they

ran independently when CDR loop was enabled. The phase modulation (PM) was

added for jitter tolerance measurement. The recovered data was monitored using a

BERT and a high-speed sampling oscilloscope. Measurements were performed up to

4.5 Gb/s with a 27-1 PRBS pattern, limited at higher data rates by equipment capability.

Figure 4-18. Test setup

Figure 4-19 shows the measured channel insertion loss and the resulting eye

diagram at 4.5 Gb/s, showing complete eye closure due to severe ISI. The loss at

Nyquist frequency is 22 dB. The measured bathtubs at different DFE settings are shown

in Figure 4-20. which were obtained by sweeping the PI control code while monitoring

the receiver BER. Without DFE, error-free operation was not possible. The eye opening

enlarges with increasing DFE settings, and decreases due to over-equalization after

reaching the maximum eye-opening. The peak eye-opening is 0.5 UI.

Figure 4-21(A) shows the measured PI linearity. The minimum DNL of -0.64 LSB

indicates monotonic operation, as guaranteed by the thermometer coding. The

maximum DNL is 1.5 LSB, giving a maximum phase step of 0.09 (=1.5/16) UI. The

RX

Balun

Scope

or

BERT

RF SRC 2

CKIN

DIN

PRBS

RF SRC 1

DOUT

CK

SYNC

20” FR4 ustrip

PM

94

repetitive DNL and INL patterns are due to the use of simple I/Q interpolation scheme

[55].

(A)

(B)

Figure 4-19. Measured 20” channel performances. A) Loss. B) Eye diagram.

The CDR function was evaluated by setting the frequency of the PRBS generator

slightly different from the RX clock source. The CDR lock range was measured to be

±100 ppm, confirming plesiochronous operation even though the CDR bandwidth is low

due to decimation. The histogram of the recovered clock at the limit of the lock range is

the shown in Figure 4-21(B). The RMS jitter is 13 ps. The jitter is relatively high because

the clock output buffer chain shares the same power domain with the noisy digital

circuitry.

-50

-40

-30

-20

-10

0

0 1 2 3 4 5

S2

1 (

dB

)

Frequency (GHz)

-22 dB @ 2.25 GHz

95

(A)

(B) Figure 4-20. Measured DFE performances. A) Bathtub curves. B) Eye openings.

(A) (B)

Figure 4-21. CDR measurement results. A) PI linearity. B) Recovered clock.

Jitter tolerance of the CDR was measured by phase modulating the clock of the

PRBS generator and recording the modulation depth when bit error occurred. The

measured jitter tolerance is shown in Figure 4-22. Below 30 KHz jitter frequency, the

jitter tolerance is larger than 1 UI.

1.0E-12

1.0E-10

1.0E-08

1.0E-06

1.0E-04

1.0E-02

1.0E+00

0 0.2 0.4 0.6 0.8 1

BE

R

Phase (UI)

100

10-2

10-4

10-6

10-8

10-10

DFE setting =5

10-12

10

15

20

0%

10%

20%

30%

40%

50%

60%

0 10 20 30 40

Eye

op

en

ing

(U

I)

DFE setting

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

0 8 16 24 32

INL

/DN

L (

LS

B)

PI control word

INL

DNL

96

Figure 4-22. Measured CDR jitter tolerance

The RX core consumes 12.4 mW from a 1.2V supply, which translates to an

FOM of 2.75 pJ/bit. Table 4-6 shows the performance summary.

Table 4-6. Performance summary Input Data Rate 4.5 Gb/s

De-multiplexing 1:16

Equalization 1-tap speculative DFE

Clock Recovery Baud-rate eye-tracking

Power Supply 1.2 V

Power 12.4 mW

Process 0.13 μm CMOS

Area 360μm × 400μm

FoM 2.8 pJ/bit

4.6 Summary

Traditional oversampling CDR involves a few design issues, including the

requirement of power-hungry generation and distribution of clocks with sub-bit-time

resolution, the stringent constraint on the settling time of DFE, the possibility of sub-

optimal equalization of edge samples. It also locks to the center of the eye regardless of

the specific eye shape, potentially leading to degraded voltage margins. Various baud-

rate CDRs have been proposed over the years. However, they either do not take into

account the voltage margin, still require sampling at instants other than the data

0.0

0.5

1.0

1.5

2.0

2.5

3.0

10 100

Jitt

er t

ole

ran

ce (

UI)

Jitter frequency (KHz)

97

sampling instants, entails analog circuitry for slope detection, or is only suitable for

integrating-type frontends.

In this Chapter, we propose a novel digital baud-rate eye-tracking CDR scheme

that obviates the above disadvantages. It employs a CDR slicer in parallel with the main

slicers, and the CDR algorithm controls both the clock phase and the threshold voltage

of the CDR slicer to drive the decision point of the CDR slicer to the peak of the eye

opening. Since the CDR slicer shares the same clock phase as the main slicer, this

automatically locks the RX to the point with the maximum eye-opening.

A majority-voting DFE architecture is also presented in this Chapter wherein the

selectors in a speculative DFE are replaced with majority-voters. The majority-voter has

one less level of transistors in the stack, and is therefore more amenable to low-power

and high-speed designs compared to a selector. It also reduces the DFE loop delay due

to its structural simplicity. Furthermore, the majority-voting DFE obviates the need for a

level shifter in bipolar designs.

Experimental results confirm the effectiveness of the proposed CDR scheme and

the majority-voting DFE. Implemented in 0.13-μm CMOS, the RX works reliably at 4.5

Gb/s while consuming 12.4 mW. Higher data rate is limited by the measurement

equipment. The CDR displays a lock range of ±100 ppm, and the DFE is able to

equalize a channel with 22 dB Nyquist loss while producing a 50% UI equalized eye-

opening.

98

CHAPTER 5 A 5-Gb/s 0.75-pJ/BIT VOLTAGE-MODE TRANSCEIVER

5.1 Chapter Overview

Chapter 3 and Chapter 4 apply some of the results from Chapter 2 to improve the

link power efficiency on the architecture level, namely the removal of back termination,

the channel loss reduction with air-cavity transmission lines, and the use of DFE and

baud-rate CDR. A few circuit techniques are also resorted to in Chapters 3 and 4, such

as the current-sharing frontend and the majority-voting speculative DFE. The 6.25-Gb/s

transceiver in Chapter 3 achieves 0.6-pJ/bit power efficiency without CDR, whereas the

4.5-Gb/s RX in Chapter 4 achieves 2.8-pJ/bit including CDR and clocking circuitry.

Based upon these results, this Chapter attempts to build a complete transceiver with

better power efficiency in the same technology. To attain this goal, the transceiver

employs a combination of architectural improvements and circuit techniques.

One major improvement is the signaling mode. The transceiver uses voltage-

mode signaling with differential termination in place of the current-mode signaling used

in the air-cavity active link in Chapter 3. According to chapter 2, this reduces the

signaling power by 75%.

The other major improvement is the exclusive use of static CMOS logic gates

instead of the CML logic gates in Chapters 3 and 4. This avoids the static current

consumption of the CML gates since the CMOS gates only consume power during state

transitions. To further improve the power efficiency, the RX operates from a 1-V power

supply, instead of the nominal 1.2-V power supply. To cope with the resulting speed

degradation of the gates, the slicers heavily parallelized and a look-ahead selection tree

99

is used in the DFE. Heavy parallelism in the frontend also saves power by eliminating

the need for an explicit DMUX.

The RX in this Chapter uses the same baud-rate CDR algorithm as presented in

Chapter 4. However, further decimation is applied to reduce the power consumption. An

injection-locked ring oscillator is used for clock generation to avoid the power overhead

of a PLL or DLL. In place of the PI for phase rotation in Chapter 4, a delay line is used

to adjust the injection clock phase so that the RX clock phases can be moved

simultaneously.

The result is a complete 5-Gb/s transceiver in 0.13-µm bulk CMOS process with

3.7-mW power consumption. This translates to a power efficiency of 0.75-pJ/bit, which

is among the best reported to date.

5.2 TX Implementation

5.2.1 TX Architecture

Figure 5-1 shows the TX block diagram. A full-swing restorer (FSR) converts the

output from a CML PRBS generator (reused from a previous design) to full swing

CMOS logic levels. A tapered inverter chain acts as a pre-driver between the FSR and

the VM driver. To preserve high speed, the fan-out of the predriver is designed to be

two. An on-chip LDO generates the supply VDRV for the VM driver from the un-regulated

chip supply.

100

Figure 5-1. TX block diagram

5.2.2 PRBS Generator

Figure 5-2 shows the block diagram of the PRBS generator. It consists of a clock

buffer, a PRBS core, a buffer, and an all-zero detector. This PRBS generator is reused

from a previous design, and all the buffers and gates are implemented in fully-

differential CML although the drawing is single-ended for simplicity.

The PRBS core is a linear feedback shift register (LFSR) comprised of 14 D

latches clocked at 2.5 GHz. The linear feedback through the XOR gates implements the

polynomial X7+X6+1 to generate a 27-1 maximum-length sequence. A half-rate

architecture is chosen for easier clock distribution [56]. The two 2.5-Gb/s PRBS streams

with proper phase shift are multiplexed to obtain the 5-Gb/s PRBS.

Figure 5-2. PRBS block diagram

DriverPre-driver

LDO

VREF

VDRV

VDD

PRBS

FSR

FSR

33

0.6

+

-

PRBS Core

D D D D D DD

D D D D D DDAll Zero

Detector

CK

101

One well-known design issue in PRBS generator is the all-zero state of the LFSR

which will circulate indefinitely once the LFSR falls into this state. To prevent this from

happening, [57] [58] uses a reset signal to manually insert a one into the LFSR. This

solution will not work if the LFSR accidentally falls into the all-zero state during normal

operation (for instance due to power supply disturbance). A better solution is to monitor

the LFSR and automatically reset it if such an all-zero state is detected. [59] uses logic

gates to detect the all-zero state, which is complex and timing-critical. [60] instead

detect the average DC level of the LFSR outputs. Although this solution is not timing-

critical, it still needs additional routing for all the LFSR outputs and thereby incurs extra

loading and complicates the layout.

Note, however, that it’s not necessary to monitor all the LFSR outputs to detect

the all-zero state. Instead, monitoring the final generator output would suffice. This

avoids the loading and layout complication. Figure 5-3 shows the all-zero detection

used in this work. The RC filter has a cut-off frequency of 2 MHz and filters out the high-

frequency component. Since a PRBS is nearly DC balanced, P and N should have

nearly the same DC voltages. When the LFSR falls into the all-zero state, however, P

will have a lower DC voltage than N, and the comparator senses such a condition and

resets the LFSR. Figure 5-4 shows the schematic of the self-biased comparator. For

robust operation, the comparator has a built-in offset of roughly 60 mV so that it will not

activate reset during normal operation. Figure 5-5 shows the simulated waveforms of

the all-zero detector. At start-up, the PRBS is stuck at the all-zero state. The detector

senses this state and inserts one’s into the LFSR so that proper PRBS pattern can be

initiated.

102

Figure 5-3. All-zero detector

Figure 5-4. Schematic of the self-biased comparator with offset

Figure 5-5. Simulated waveforms confirming the function of the all-zero detector

5.2.3 LDO

The LDO powers the TX driver for better supply noise rejection and also provides

a convenient means for adjusting the TX output swing. For a single-ended output swing

of 100 mV, the driver current consumption is 1 mA with differential RX termination. With

PRBS Core

Reset

42 KΩ

1.7 pF

P

N

0.2µm/5µm

12µm 3µm

10µm/0.5µm

10µm/0.5µm

All-zero Reset Normal operation

103

a width of , the pass element is large enough to source 10 mA to support larger

swings in measurement. The error amplifier is a simple two-stage opamp. The dominant

pole is located at the VDRV node due to the large decoupling capacitor. Figure 5-6 shows

the stability simulation results. The phase margin is 72 degrees.

(A)

(B)

Figure 5-6. Stability of the LDO

5.2.4 TX Driver

Since the targeted TX swing is less than 100 mV, the TX employs an N-over-N

VM driver [6] [39], as shown in Figure 5-1. Exclusive use of NMOS in the driver reduces

the input capacitance and therefore the predriver power consumption compared to an

inverter driver [61]. The transistors are sized for 50-Ω Ron for proper channel back

-50

-25

0

25

50

75

1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8

Ga

in (

dB

)

Frequency (Hz)

0

30

60

90

120

150

180

1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 1E+8

Ph

as

e (

de

gre

e)

Frequency (Hz)

104

termination. Note the top NMOS is sized slightly larger than the bottom one since it sees

less overdrive voltage.

5.3 RX Implementation

5.3.1 RX Architecture

Figure 5-7 depicts the receiver block diagram with differential termination.

Because the TX output has low common-mode voltage, the input signals VP and VN are

first shifted up to enable NMOS transistors at the input of the slicers. Equalization is

done with 1-tap speculative DFE for its high signaling efficiency compared to TX FFE

[16]. A bank of 32 slicers performs digitization and direct 1:16 de-multiplexing. Two

additional CDR slicers facilitate timing recovery. The slicer bank’s 34 output bits are

synchronized, and 17 of them are selected to accomplish DFE. The ILRO, locked to a

312.5-MHz external source, generates 16 clocks phases CK[0:15] for the slicer bank.

The CDR logic extracts timing information from the 17 bits and adjusts the phase of the

injection clock to track the maximum eye opening.

Figure 5-7. RX block diagram

DFE

Level

Shifter

SY

NC

DF

E S

ele

cti

on

Tre

e

ILRO

CD

R L

og

ic

Q[0]

Q[7]

Q[8]

Q[15]

Q*[8]

Delay 312.5 MHz

CK[0:15]

CDR slicers

5 Gb/s

VLS -VDFE

VP

VN

VP-VN+2VDFE

VP-VN-2VDFE

VLS +VDFE

VCM

105

5.3.2 Slicer Design

The most important design goals of the slicer include power, speed and

sensitivity. The slicers are implemented as SAFFs to avoid static power consumption,

as shown in Figure 5-8. With 16-way interleaving, the speed requirement on the slicer is

much relaxed, leaving its sensitivity the focus of design optimization.

One factor that impacts the slicer sensitivity is transistor mismatch. To reduce the

input capacitance and power consumption, the slicers are sized to near minimum. As a

result, the simulated 1-σ slicer offset is 38 mV. To improve RX sensitivity, all the slicers

have 8-b offset trimming. The trimming range is designed to be ±160 mV, yielding a

trimming resolution of 1.25 mV.

Figure 5-8. Schematic of the slicer

Another factor that impacts the slicer sensitivity is hysteresis, including the

hysteresis due to incomplete resetting of the SA core, and the hysteresis due to the

imbalanced input capacitances of the RS latch that follows the SA core [62]. With heavy

front-end parallelism, the SA core has enough time to completely reset and no

hysteresis is observed due to the SA core. To remove the hysteresis due to the

imbalanced RS latch input capacitance, a buffer stage is inserted between the SA core

and the RS latch, as shown in Figure 5-8. Simulation indicates that without this buffer

CKB

106

stage, the slicer has a hysteresis of 30 mV, whereas inserting the buffer stage makes

the hysteresis negligible.

5.3.3 Level Shifting and DFE Tap Generation

The slicers use NMOS input transistors for faster operation. However, the RX

input has a common-mode level close to ground due to the use of the VM signaling. A

level shifter is therefore required before the slicers to shift up the input signals by VLS.

Level-shifting can be accomplished with an AC-coupling capacitor [63] or a

common-gate (CG) amplifier [64], as shown in Figure 5-9(A) and (B). AC-coupling does

not consume power but cuts off the low frequency component of the input signal. On the

other hand, a CG amplifier provides DC coverage but dissipates excessive power due

to the stringent bandwidth requirement. This is especially true when driving the large

input capacitance of the heavily-parallelized slicer bank. Figure 5-9(C) shows the basic

idea of the proposed level-shifter, which combines the advantages of both - a capacitor

provides a high-frequency signal path while a source-follower enables DC coverage.

(A) (B) (C)

Figure 5-9. Level shifters. A) Capacitor-based. B) CG-amp-based. C) Proposed.

Figure 5-10 shows the detailed schematic of the level shifter. The AC-coupling

capacitor is implemented as a NMOS transistor with source and drain shorted to the

input. The shifting voltage is adjusted by tuning VB. To control the low frequency gain,

the source follower is broken into 4 identical segments, with the input of each segment

Capacitor-based CG-amp-based

AC path

DC path

Proposed

107

switchable between the input and the common mode voltage by GAIN[3:0]. When all the

four inputs are switched to the common mode voltage (GAIN=0), the DC path of the

level shifter is disabled. Figure 5-11 shows the simulated frequency response of the

level shifter at different gain settings. When the DC path is disabled, the level shifter has

a low cut off frequency of 3M Hz. Because of its much relaxed bandwidth requirement,

the source follower consumes negligible <10 μW) power.

Figure 5-10. Detailed schematic of the level shifter

Figure 5-11. Simulated frequency response of the level shifter at different gain settings

The level shifters also provide a convenient means of generating the DFE tap.

This is achieved by introducing an offset in the shifting voltages of VP and VN, as

shown in Figure 5-7. Although it’s possible to embed the DFE tap into the slicer offset,

doing so would have required too large a slicer trimming range when the input swing is

high.

VB

VIN

VCM

GAIN[3:0]

To slicers

-60

-50

-40

-30

-20

-10

0

1.0E+04 1.0E+06 1.0E+08 1.0E+10

Ga

in (

dB

)

Frequency (Hz)

Gain=4

Gain=3

Gain=2

Gain=1

Gain=0

108

5.3.4 DFE with Look-Ahead Selection Tree

The slicer bank is implemented using a 16-way parallel architecture to relax

speed requirement and avoid the added power consumption by an explicit de-

multiplexer. A critical issue in the speculative DFE is the stringent timing constraint,

which occurs when decisions are selected based on previously received bits. For a

straightforward implementation of the DFE selection tree shown in Figure 5-13(A), the

previous bits must ripple through all 16 selectors under worst-case conditions, and the

resulting timing constraint is

where and are the delay and set-up times of the D flip-flop, is the

selector delay, and is the bit time. Figure 5-12 shows the simulated as a

function of VDD before layout extraction. At 1.0 V, the delay is about 120 ps. Considering

the parasitics due to wiring, such a delay is marginal for 5 Gb/s operation ( ).

Figure 5-12. Simulated pre-layout selector delay vs. power supply

This work uses a look-ahead selection tree to expedite the selection process.

Two possible sets of decisions for Q[8:15] are pre-computed and then selected, as

shown in Figure 5-13(B). The timing constraint now becomes

0

50

100

150

200

250

300

350

0.6 0.7 0.8 0.9 1.0 1.1 1.2

Dela

y (

ps

)

VDD (V)

109

which is relaxed by nearly 50% compared to the straightforward implementation.

(A) (B)

Figure 5-13. DFE selection tree. A) Conventional. B) Look-ahead.

5.3.5 Decimated Baud-Rate CDR

The RX employs the same baud-rate CDR scheme as that in chapter 4 to reduce

the clocking power compared to Alexander-type CDRs [64]. If we want to monitor all

CK[0:15], 32 more slicers will be required, leading to considerable power and area

overhead. To further reduce power consumption, only CK[8] is monitored in this work.

This greatly reduces the number of CDR slicers by more than 90%, from 32 to 2.

Although this decimation reduces CDR bandwidth, it is generally acceptable for

mesochronous chip-to-chip links [64]. Note that because of heavy parallelism, the

reduction in input capacitance and area is more pronounced compared to the

decimation in [64].

5.4 Injection-Locking-Based Clock Generation

5.4.1 Clock Generation Overview

Despite a 50% reduction in the number of clock phases by the baud-rate CDR,

generating the required 16 phases for the slicer bank is still non-trivial. Injection-locking

based clock generation is chosen in place of PLL- or DLL-based schemes for its low

power and superior jitter performance. Figure 5-14 shows the block diagram of the clock

DFF Q[7]

DFF Q[8]

DFF Q[15]

DFF Q[0]

DFF Q[7]

DFF Q[8]

DFF Q[15]

DFF Q[0]

0 1

Selector

Precomputation

110

generation circuitry. At the core lie two cascaded (master and slave) low-power

injection-locked ring oscillators (ILROs). Both ILROs are digitally trimmed to ensure

reliable locking. The slave ILRO helps correct the master ILRO’s phase mismatch and

duty-cycle distortion due to injection locking [65]. A bank of current-starved delay lines

facilitates further phase calibration.

Phase tuning of ILRO is usually done by adjusting the free-run frequency of the

ILRO [66] [22] [67]. However, tuning the free-run frequency of the ILRO may change the

phase relationship between its outputs and degrade the RX timing margin. In this work,

the phases of the ILRO outputs are tuned by adjusting the injection clock phase with an

additional delay line controlled by the CDR logic, as shown in Figure 5-14.

Figure 5-14. Block diagram of the injection-locking-based clock generation

5.4.2 ILRO Core

The master and slave ILROs are of the same design. Figure 5-15 shows the

ILRO core schematic. Eight pseudo-differential delay cells constructed from inverters

are used instead of CML delay cells to avoid static current consumption. The input clock

phases are injected through NMOS transistors. To ensure locking, the free-run

frequency of the oscillator is digitally trimmed.

Delay line

X16

Master ILRO

X16

X16

Slave ILRO

Delay lines

Freq.

trimming

From

CDR logic

Phase

trimming

Ext. ref.

111

Figure 5-15. Schematic of the ILRO core

One design issue of the pseudo-differential oscillator is its start-up. Because

there are even stages of delay cells, a stable DC solution exists where the whole ring

behaves like a latch, as shown in Figure 5-16. To prevent that from happening, the

cross-coupled inverters must be sized large enough compared to the main inverters. In

this design, the cross-coupled inverters are sized of the main inverters for reliable

start-up, as annotated in Figure 5-15.

Figure 5-16. Start-up issue of the pseudo-differential oscillator

5.4.3 Delay Line

The delay lines are constructed from cascading current-starved delay cells, the

schematic of which is shown in Figure 5-17, where a 4-b digitally controlled current sets

the bias current of the inverters. Figure 5-18 shows the simulated tuning curve of one

delay cell. The tuning range is 30 ps. The CDR delay line consists of 8 delay cells. The

P[0]

P[8]

P[9]

P[1]

P[6]

P[14]

P[15]

P[7]

[1] [0]X128

CTRL[7]

INJ[0]

P[0]

INJ[1]

P[1]

INJ[8]

P[8]

INJ[14]

P[14]

INJ[15]

P[15]

PMOS:

NMOS:

PMOS:

NMOS:

X2 X1:

0

0

1

1

0

0

1

1

112

total tuning range of 240 ps is larger than 1 UI for reliable CDR operation especially

when the extra delay caused by parasitics is considered.

Figure 5-17. Schematic of the current-starved delay line

Figure 5-18. Simulated delay line tuning curve

5.5 Experimental Results

The transceiver was fabricated in a 0.13-μm bulk CMOS process using only

nominal-VT devices. The test chip was assembled in a 32-pin QFN package and

mounted on an FR4 board. Figure 5-19 shows the chip micrograph. The RX measures

, while the TX occupies .

5.5.1 TX Measurement

The TX is measured at different supply voltages. With a 1.5-V supply the TX is

able to work up to 6.25 Gb/s, whereas at 1.2 V the TX is able to work at 5 Gb/s. Below

1.2 V the TX does not work properly, probably limited by the CML PRBS core. Figure

5-20(A) shows the measured TX eye diagrams at 6.25 Gb/s. The RMS jitter is 11 ps.

1

IN OUT

[3:0]

220

230

240

250

260

270

0 5 10 15

De

lay (

ps

)

Control code

113

Figure 5-20(B) shows the captured transient of the TX output, which confirms correct 27-

1 pattern generation.

Figure 5-19. Chip micrograph and transceiver layout

(A)

(B)

Figure 5-20. TX measurement results at 6.25 Gb/s. A) Output eye diagram. B) TX transient showing correct 27-1 PRBS patter.

RX ILROs

DFECDR logic

Level

shifters

Buffers

Delay

lines

TX PRBSDecoupling cap

LDO

FS

R

DR

V

500 μm

30

0 μ

m

500 μm

23

0 μ

m

20 mV 50 ps

20 mV 2.5 ns

114

5.5.2 Clocking Measurement

Figure 5-21 shows the measured tuning curve and locking range of the ILRO.

The ILRO has a tuning range of more than 500 MHz, and the locking range is larger

than 10% when the free-run frequency is 312.5 MHz.

(A)

(B)

Figure 5-21. ILRO measurement results. A) Frequency tuning. B) Locking range.

Figure 5-22 shows measured phase noises with and without injection. At 100

KHz offset, injection-locking suppresses the phase noise by more than 70 dB.

0

100

200

300

400

500

600

700

800

0 64 128 192 256

Fre

qu

en

cy (

MH

z)

Frequency control word

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

0 64 128 192 256

Lo

ck

ra

ng

e

Frequency control word

115

Figure 5-22. Measured phase noise with and without injection locking

The measured CDR delay line tuning curve is shown in Figure 5-23. The tuning

range is 400 ps, which covers 2 UI when the data rate is 5 Gb/s. The measured tuning

range is more than 60% larger than simulation results, indicating heavy parasitics due to

routing.

Figure 5-23. Measured CDR delay line tuning curve showing >2-UI tuning range

5.5.3 RX Measurement

Standalone RX measurement is done up to 4 Gb/s due to equipment limit. Figure

5-24 shows the measured loss profile of the 20” channel. The loss is 19.2 dB at 2 GHz.

Figure 5-25 shows the 4 Gb/s eye diagrams before and after the channel. Due to severe

channel loss, the eye is completely closed after the channel.

-130-120-110-100-90-80-70-60-50-40-30-20-10

1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Ph

as

e n

ois

e (

dB

c)

Frequency Offset

103 106 107105104 108

W/O injection

W/ injection

0

50

100

150

200

250

300

350

400

450

0 32 64 96 128

De

lay i

nc

rea

se

(p

s)

Control code

116

Figure 5-24. Measured loss characteristics of the 20” channel

Figure 5-25. Measured 4-Gb/s eye diagrams before and after the 20” channel

Figure 5-26 shows the measured bathtubs with and without DFE. Error-free

operation cannot be attained without DFE, while the eye opening is 30% when DFE is

enabled. Figure 5-26 shows the recovered clock. The RMS jitter is 4.85 ps, while the p-

p jitter is 42 ps.

Figure 5-26. RX bathtubs with and without DFE

-50

-40

-30

-20

-10

0

0 1 2 3 4 5

S2

1 (

dB

)

Frequency (GHz)

-19.2 dB @2 GHz

30 mV100 ps

30 mV100 ps

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0.0 0.2 0.4 0.6 0.8 1.0

BE

R

Delay (UI)

100

10-3

10-6

10-9

10-12

30%

W/O DFE

W/ DFE

117

Figure 5-27. Jitter histogram of the recovered clock

The receiver core is powered from a 1V supply, and dissipates 1.1 mW, which

translates to a power efficiency of 0.28 pJ/bit. Table 5-1 compares the performance to

some recently published work. The power efficiency is nearly a 2× improvement over

the best result of previously published complete receivers.

Table 5-1. Performance summary of the receiver

[6] [7] [22] This work

Data rate (Gb/s) 6.25 12.5 8 4

Equalization CTLE CTLE CTLE DFE

Nyquist loss (dB) 15 12 9.7 19

Sub-rate 1/2 1/2 1/10 1/16

Clock generation PLL PLL ILRO ILRO

CDR Alexander Buad-rate NA Buad-rate eye-tracking

Jrms

(ps) NA 2.2 4 4.85

Technology 90-nm 65-nm 65-nm 0.13-μm

VDD

(V) 1.0 1.0 0.6/1.0 1.0

Power (mW) 8.22 6.6 1.3-1.98 1.1

Area (mm2

) 0.15 0.24 0.014-0.018 0.15

FoM (pJ/bit) 1.31 0.53 0.16-0.25 0.28

5.5.4 Transceiver Measurement

The whole link is then tested with a 10” channel on FR4 at 5 Gb/s, although the

TX is capable of operating at 6.25 Gb/s.

JRMS=4.85 ps

JP-P = 42 ps

118

Figure 5-28 shows the TX eye diagrams before and after passing the 10”

channel. Although the Nyquist channel loss is less than the standalone RX

measurement, the eye is still completely closed due to the bandwidth and jitter of the

TX. The near-end TX RMS jitter is 13 ps.

(A)

(B)

Figure 5-28. Measured 5-Gb/s TX eye diagrams. A) Before the channel. B) After the 10” channel

Figure 5-29 show the recovered data and clock of the RX. The recovered clock

has an RMS jitter of 6.9 ps. Figure 5-30 shows the RX bathtubs before and after

enabling the DFE. The eye opening with DFE enabled is 18%.

20 mV50 ps

20 mV50 ps

119

(A)

(B)

Figure 5-29. Measured CDR waveforms. A) Recovered 312.5-Mb/s data. B) Recovered 312.5-M clock.

Figure 5-30. RX bathtubs with and withou DFE

The TX works from a 1.2-V supply and consumes 2.1 mW, while the RX

consumes 1.6 mW from a 1-V supply. The total power consumption of the transceiver is

3.7 mW, and the power efficiency is 0.75 pJ/bit. Table 5-2 compares the transceiver

20 mV 1 ns

JRMS= 6.9 ps

JP-P = 57.8 ps

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0.2 0.3 0.4 0.5 0.6 0.7 0.8

BE

R

Delay (UI)

100

10-3

10-6

10-9

10-12

18%

W/O DFE

W/ DFE

120

performance with some recent publications. Even though we use a lelatively less

advanced technology, the power efficiency is among the best.

Table 5-2. Performance summary of the transceiver

[42] [6] [7] [68] This work

Technology 65 nm 90 nm 65 nm 45 nm 0.13 μm

TX VDD (V) 0.68 1.0 1.0 V 0.8 1.2 V

RX VDD (V) 0.68 1.0 1.0 0.8 1.0

Data rate (Gb/s) 5 6.25 12.5 10 5

Nyquist loss (dB) 4 15 12 8 12

TX swing (mVpp) 100 200 150 150 160

BER 1e-12 1e-15 1e-12 1e-14 1e-12

Eye opening (UI) - 30% 43% - 18%

Power (mW) 13.5 14 12 14 3.7

Energy efficiency (pJ/bit) 2.7 2.24 0.98 1.4 0.75

TX/RX area (mm2) 0.03/0.06 0.31/0.31 0.24/0.24 0.07/0.07 0.15/0.12

5.6 Summary

Building on the results in Chapter 3 and Chapter 4, this Chapter presents a 5-

Gb/s 0.75-pJ/bit transceiver in 0.13-um bulk CMOS technology. Various design

techniques are combined to attain this high power efficiency, including the VM signaling

with differential termination to reduce the signaling power by 75% compared to CM

signaling, the exclusive use of static CMOS gates to avoid the static power consumption

of CML gates, the injection-locking-based clock generation, decimation in the CDR

circuitry, and low-voltage RX operation enabled by the heavy frontend parallelism and

the look-ahead DFE selection tree. The heavy parallelism also eliminates the need for

an explicit DMUX, leading to further power reduction.

121

Even though the transceiver is implemented in a less advanced 0.13-um CMOS

technology, the achieved power efficiency of 0.75 pJ/bit is among the best reported to

date at comparable data rates. It’s therefore believed that the techniques presented in

this Chapter will help enable the Tb/s aggregate off-chip signaling of future electronic

systems.

122

CHAPTER 6 A DIGITAL BACKGROUND ADC CALIBRATION TECHNIQUE

6.1 Chapter Overview

The continuous scaling of CMOS technology has made digital signal processing

more powerful and affordable. Compared to analog signal processing, digital solutions

have the advantages of greater flexibility and better scalability. As a result, there is a

trend of moving more and more signal processing into the digital domain. This trend is

also reflected in high-speed serial links [8] [69] [70], where an ADC digitizes the

distorted incoming bit stream and a DSP carries out the signal processing such as

equalization and timing recovery in the digital domain, as shown in Figure 6-1.

Figure 6-1. An ADC-based serial link

One of the key challenges in such ADC-based serial links is the design of a high-

speed low-power ADC. Due to its high speed, a flash ADC is often the architecture of

choice. For low power consumption, it is desirable to use small transistors in the flash

ADC. However, the mismatch between transistors becomes worse with small transistor

sizes, which will degrade the linearity of the ADC if left unaddressed.

For example, consider the preamp in Figure 6-2 often found in flash ADCs.

Around balanced condition, the input and output are related by

where is the preamp gain, , , and

are the differential output, input and reference voltages respectively. The last term,

is the offset voltage of the preamp due to device mismatches. With proper design and

ADCTX DSP

123

layout, has a zero mean (no systematic offset) and a certain spread determined by

circuit details and the fabrication technology. For typical bias conditions, is

dominated by transistor threshold voltage mismatch [71] and can be expressed as

where is a parameter determined by the technology, and is the gate area of

the transistors. To satisfy linearity requirement, the transistors must be sized large

enough so that is kept within a fraction of the ADC step size. With the transistor

length and current density

largely determined by speed requirement, W is the only

design variable that can be exploited to reduce . According to Equation 6-2, to

decrease by half, the transistor width and therefore the current consumption

must be increased by , a very unfavorable tradeoff for low power designs. As

technology scales down, this tradeoff is expected to become more and more

challenging due to effects such as random dopant fluctuation (RDF) and line-edge

roughness (LER) [72].

Figure 6-2. Schematic of a preamp

Since offset changes slowly over time with environmental (supply voltage and

temperature) variations and device aging, it can be cancelled with some form of

calibration effectively. Various calibration schemes have been proposed in the past for

2ID 2ID

VINP VINNVRP VRN

RD

W/L W/L

VON

VOP

124

flash ADCs, which all fall into either the foreground [73] [74] [75] or the background

categories [76] [77]. A foreground calibration scheme mandates temporarily interrupting

the normal ADC operation and is therefore usually done at power-up or during certain

idle times when allowed by the system. However, as the supply voltage and

temperature change over time, the calibration results may no longer be optimum,

leading to degraded performance [78]. In contrast, a background calibration scheme

does not require interrupting the ADC operation and can run continuously to track

environmental variations and device aging. Thus, background calibration schemes are

generally preferred.

Some of the critical challenges in background calibration for high-speed ADCs

are accuracy, convergence speed, area/power overhead, and performance penalty.

Despite the many background calibration techniques proposed in the past, a quick

literature review demonstrates the need for an improved background calibration scheme

that is suitable for high-speed ADCs. In response, this Chapter describes a novel

background calibration scheme for ADCs which features negligible hardware and power

overhead. The proposed calibration scheme is implemented in a 50-mW 2.5-GS/s 5-bit

flash ADC and its effectiveness is verified with experimental results.

6.2 Background Calibration

6.2.1 Review of Prior Art

Several background calibration schemes for flash ADCs have been reported in

literature, and are briefly reviewed here. Correlation-based calibration operates by

modulating the analog input signal with pseudo-random sequences to extract offset

information from the resulting statistics of the digital output, and has been proposed for

both pipeline and flash ADC’s [79] [80] [81] [82]. In [79] and [80], the analog input is

125

converted to a white signal with little energy at DC by chopping it with a pseudo-random

binary sequence. The DC component in the resulted signal stems mainly from the ADC

offset. By forcing this DC component to zero, the comparator offset can be effectively

removed. A more general approach is proposed in [81], where the offset of a

comparator is detected by chopping the analog input with a sequence from an on-chip

random-number-generator (RGN) and observing the code distribution of the digital

outputs, as illustrated in Figure 6-3 (drawn single-ended for simplicity). The chopping

operation degrades the ADC sample rate because it needs finite time to settle. Due to

this approach’s statistical nature, the analog input must be uncorrelated with the on-chip

generated random sequence and the calibration results are prone to fluctuation which

can only be minimized at the cost of the convergence speed [81]. Furthermore,

Correlation-based calibration invariably introduces performance penalty because they

interfere with the analog signal path with chopping or noise injection. For fast and robust

calibration, deterministic schemes are generally preferred.

Figure 6-3. Correlation-based calibration

+-

RNG

SH

SL

SL

SH

VIN

VR

Q

Vos

VIN-VR

P1

PD

F

VIN-VR

P1

PD

F

∆P1

0 0+Vos-Vos

When RNG=1:

Q=sgn(VIN-VR-Vos)

When RNG=0:

Q=sgn(VIN-VR+Vos)

+Vos

126

Redundancy-based calibration [83] [77] [84] achieves deterministic operation by

employing redundant elements to enable un-interrupted ADC operation when some of

the elements undergo calibration. Figure 6-4 shows the 6b ADC block diagram with

background calibration as reported in [76] , where 64 instead of 63 comparators (C1-

C64) are employed in parallel. When C1 is being calibrated, the other 63 comparators

(C2-C64) work together as a normal ADC. After C1’s calibration is done, the comparator

array is reconfigured so that C1 and C3-C64 work together as a normal ADC and C2

undergoes calibration, with the ADC operation un-interrupted. This process repeats

continuously and in the end all the comparators are calibrated. The advantage of this

technique is its low hardware overhead. However, this technique still incurs speed

penalty because it needs to reconfigure the ADC during its normal operation.

Figure 6-4. Redundancy-based calibration

Reference-ADC based calibration schemes proposed in [85] [86] [87] employ a

slow but accurate reference ADC to improve the linearity of the fast but inaccurate main

ADC. Figure 6-5 shows a simplified block diagram of the reference-ADC based

calibration scheme, while Figure 6-6 shows its working principle. For simplicity, we

assume that the main ADC has 3-b resolution. In Figure 6-6, the transfer curves of the

main ADC and the ideal reference ADC are overlaid. Denoting the transition levels of

C64 C63 C2 C1

Control Logic

VIN

VRP VRN

Encoder

127

the main and reference ADCs as and respectively, any offset will

cause to differ from . These differences are marked by gray bars in Figure 6-6

and are referred to as calibration windows hereafter. Whenever falls within the

calibration windows, a discrepancy occurs between the reference and main ADC

outputs. The calibration engine then examines such discrepancies and drives

toward the ideal .

Figure 6-5. Reference-ADC-based calibration

Figure 6-6. Principle of reference-ADC-based calibration.

Although reference-ADC-based calibration is deterministic and incurs negligible

performance penalty, there is considerable design overhead when the reference and

main ADCs are entirely different – for example, a Σ-Δ ADC is used to calibrate a

pipeline ADC in [87]. Furthermore, because the main and reference ADCs operate from

different sampling clocks, mismatch in their track-and-hold (T/H) circuits can degrade

the calibration accuracy. To alleviate this problem, one has to resort to either power-

Ref.

ADC

Cal.

Engine

Main

ADC

M

VIN

Decimation

Ou

tpu

t c

od

e

VIN

Ref. ADC

Main ADC

128

hungry T/H circuits to drive both ADCs [86] or dedicated timing calibration for the two

sampling clocks [88], both of which are very challenging at high speeds. These

disadvantages can be avoided with the so-called “split-ADC” architecture, where the

reference ADC is simply a replica of the main ADC and operates at the same speed [78]

[89]. The replica ADC, however, incurs significant area, input capacitance and power

overhead.

6.2.2 Proposed Background Calibration Scheme

In the reference-ADC based calibration scheme, all the transition levels are

calibrated simultaneously. This necessitates a reference ADC with at least the same

resolution as the main ADC, and thus high overhead seems inevitable. However,

because offset varies slowly over time, the transition levels can be calibrated

sequentially instead of simultaneously. The benefit of this sequential calibration is the

greatly reduced complexity of the reference ADC. In the extreme case, as in our

proposed calibration scheme, 1-b resolution is sufficient, and the reference ADC

degenerates to a single comparator.

Figure 6-7 shows a block diagram of the proposed calibration scheme. The

reference ADC is now replaced with a single comparator, whose threshold voltage is

reconfigurable through a digital-to-analog converter (DAC). At the beginning, the

calibration engine sets the comparator’s threshold voltage to , as shown in Figure

6-8(A). By monitoring the outputs of the ADC and the comparator, the calibration engine

adjusts until

. After calibrating , the comparator’s threshold

voltage is set to and calibration of begins, as shown in Figure 6-8(B). By

iterating the same process, all the transition levels of the main ADC can be calibrated.

129

The resulting fully-calibrated transfer curve of the ADC is shown in Figure 6-8(H). The

performance metrics of the proposed calibration scheme are discussed below.

Figure 6-7. Proposed reconfigurable-comparator-based calibration

(A) (B) (C) (D)

(E) (F) (G) (H)

Figure 6-8. Principle of the proposed calibration scheme. The transition levels are

calibrated sequentially in A)-G), and the resulting transfer curve is shown in H).

+

-

VIN

Reconfigurable

comparator

DA

C

Main

ADC

Cal.

Engine

Ou

tpu

t c

od

e

VIN

VTH[1] cal.

Ou

tpu

t c

od

e

VIN

VTH[2] cal.

Ou

tpu

t c

od

e

VIN

VTH[3] cal.

Ou

tpu

t c

od

eVIN

VTH[4] cal.

Ou

tpu

t c

od

e

VIN

VTH[5] cal.

Ou

tpu

t c

od

e

VIN

VTH[6] cal.

Ou

tpu

t c

od

e

VIN

VTH[7] cal.

Ou

tpu

t c

od

e

VIN

Finished

130

6.2.2.1 Calibration accuracy

The calibration accuracy is determined by a few factors, including the reference

ADC accuracy, the calibration step size, and noise. The discussion above assumes an

ideal reference ADC. In reality, however, both the DAC and the comparator in the

reference ADC introduce errors and ultimately limit the calibration accuracy. Moreover,

due to the digital nature of the calibration scheme, the main ADC can only be adjusted

in discrete steps. The reference ADC accuracy, together with the finite calibration step

size, limits the overall calibration accuracy. Once the ADC is calibrated, the residual

error in the transition level is bounded by

- )

where is the DAC error, is the offset of the comparator in the reference

ADC, and is the calibration step size. The calibrated INL and DNL are bounded by

| | )

and

| | )

respectively. Notice that does not impact the calibrated DNL. This is because

appears in all the calibrated transition levels and merely causes a DC shift in the

calibrated transfer curve.

The effect of noise on calibration accuracy is shown in Figure 6-9 for the case

, where denotes the mean of a random variable. For convenience,

the noise is lumped to in Figure 6-9. Ideally, whenever a discrepancy occurs, it

should indicate and correct calibration can be made. However, due to noise,

may be temporarily higher than , as indicated by the dashed line in Figure 6-9,

131

and this may cause incorrect calibration to occur. To improve immunity to noise, the

calibration engine can average multiple discrepancies before making a decision.

Figure 6-9. Mechanism of noise-induced calibration error

Because the reference ADC shares the same T/H and sampling clock as the

main ADC, the calibration accuracy of the proposed scheme does not suffer from the

T/H mismatch issue as the conventional reference-ADC based approach does. Nor is it

sensitive to the statistics of the input signal since it does not rely on the correlation

between the input signal and an on-chip pseudo-random sequence.

6.2.2.2 Convergence speed

To calculate the convergence speed, we assume distributes uniformly within

the full-scale input range VFS. Similar calculations can be carried out for other input

distributions, such as those of sine waves. Suppose the initial offset of a certain

transition level is . The probability that the input produces a discrepancy is

, and

on average ⌈

⌉ conversions are needed to reduce the offset by one step, where ⌈

is the smallest integer that is larger than |

|. Therefore, the number of conversions to

calibrate the offset is

-2 0 2 4

IncorrectCorrect

PDF

132

)

If we assume the offset is a normal distribution with a mean of zero and a

standard deviation of σ, then the average number of conversions required to calibrate a

particular transition level is

(

)

)

Exploiting the symmetry of the integrand and assuming the offset is within [-3σ,

3σ], we can approximate the above integral as

(

)

)

For an N-bit ADC, there are 2N-1 transition levels. The total number of

conversions for the calibration to converge is

)

Since , Equations 6-8 and 6-9 are combined to yield

(

)

)

Figure 6-10 plots as a function of the ADC resolution with different σ when

. For a 5-bit ADC, when , the calibration takes about

conversions to converge. Note that while grows at a rate of 22N, it is a relatively

133

weak function of σ. For example, tripling σ from to increases the required

number of conversions by only 37%. This is because calibrating small offsets takes

more conversions as the input has a lower chance of producing a discrepancy when the

offset is small.

Figure 6-10. Required conversions for convergence with different resolutions

6.2.2.3 Calibration overhead and performance considerations

The calibration overhead consists mainly of the reference ADC, the calibration

engine, the memory to store the offset control words, and the circuitry to adjust the main

ADC offset. With the calibration engine, the memory and the adjustment circuitry being

common to all digital calibration schemes, the major overhead advantage of the

proposed scheme lies in the simplicity of the reference ADC. The comparator in the

reference ADC can reuse the design available in the main ADC and entails no extra

design effort. The DAC in the reference ADC is only used to set the threshold voltage

and its speed requirement is much relaxed compared to the main ADC’s sample rate.

The power, area, and design overhead of the reference ADC is therefore trivial.

The proposed calibration scheme does not require noise injection or chopping as

seen in correlation-based calibrations. While redundancy-based calibration reconfigures

the main ADC during normal operation, the calibration scheme herein does not.

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

4 5 6 7 8 9 10

# o

f c

on

ve

rsio

ns

Resolution (bit)

103

104

105

106

107

108

σ=1VLSB

σ=3VLSB

134

Moreover, it does not insert extra conversion cycles thereby avoiding any speed

penalty. Although the reference ADC does increase the input capacitance, this penalty

is minimal because only a single comparator is used. For example, calibrating a 5-b

flash ADC with the proposed scheme increases the input capacitance by less than 4%.

This is in stark contrast to the split-ADC architecture, which increases the input

capacitance by .

Table 6-1 shows a comparison of various background calibration schemes. The

proposed calibration engine achieves deterministic operation, introduces little

performance penalty, and incurs low hardware and design overhead. Because the

calibration is sequential, its convergence is slower than the split-ADC architecture. This

usually is not detrimental since environmental variations are slow. When fast

convergence is desired (for example, to reduce the test time during mass production),

foreground calibration can be performed at power up before the background calibration

is enabled.

Table 6-1. Comparison of proposed and existing background calibration schemes

Deterministic Performance

Penalty

Hardware

Overhead

Design

Effort

Converg.

Speed

Correlation-based No Yes Medium Medium Low

Redundancy-based Yes Yes Low Low High

Ref.-ADC-based Yes No High High Medium

Split-ADC Yes Yes High Low High

This work Yes No Low Low Medium

6.3 Chip Implementation

6.3.1 ADC Architecture

Figure 6-11 depicts a block diagram of the implemented 5-bit flash ADC with the

calibration circuitry (drawn single-ended for simplicity, though the real implementation is

135

differential). The main ADC consists of a track-and-hold (T/H), a resistor ladder, a

comparator array, and a digital backend. The comparator array is comprised of

comparators C[1:31], which digitize the sampled analog input against 31 evenly-spaced

reference voltages VR[1:31] from the resistor ladder. The resulting thermometer codes

are then converted to binary format by the digital backend which also corrects first-order

bubble errors.

Figure 6-11. Block diagram of the ADC

The calibration circuitry consists of the resistor ladder and the shaded blocks in

Figure 6-11. The switch bank SR, the resistor ladder and the comparator C[0] make up

the reference ADC. The SRAM stores the offset control words W[1:31] for

C1~C31. The finite-state machine (FSM) communicates with the SRAM through the

address decoder and serves as the calibration engine.

The chip also houses a serial interface. This facilitates digital control of the bias

generator and allows clearing the SRAM content to disable calibration.

SR

SQ

Addr. Decoder

C[31] C[30] C[2] C[1] C[0]

Digital Backend

FSM

S[31]

SRAM (31X5b)

S[30] S[2] S[1]

C[0]~C[31]: Comparators

W[1]~W[31]: Offset control words

S[31]

W[31]

S[30]

W[30]

S[2]

W[2]

S[1]

W[1]

VIN

VRP VRN

DATA

ADDR

T/H

VR[31] VR[30] VR[2] VR[1]

S[31] S[30] S[2] S[1]

Serial

Interface

Bias

Gen.

Q[31] Q[30] Q[2] Q[1] Q[0]

136

6.3.2 Resistor Ladder

Since the resistor ladder generates the reference voltages for the reference ADC,

its linearity ultimately determines the achievable calibration accuracy. For an N-bit ADC,

the requirement on the resistors used in the ladder is [90]

where R is the nominal resistance and is the variance. The resistor ladder consists of

identical poly resistor units with W/L of 8μm/4μm with estimated mismatch <0.35%,

which is better than 8-bit accuracy [91]. To stabilize the reference voltages and

suppress input feedthrough, decoupling PMOS capacitors are connected to all resistor

ladder output taps [92]. The resistor ladder consumes 0.21 mW.

6.3.3 T/H

A passive T/H precedes the comparator array, the schematic of which is shown

in Figure 6-12(A). By presenting a static signal to the comparator array during

quantization, the T/H helps minimize linearity degradation due to signal dependent

comparator delays and the clock and signal skew between comparators. Since the input

voltage swing is from VDD-0.4V to VDD, PMOS transistors are used. This also eliminates

the need for a buffer to shift the input common mode level [93] [92].

The bandwidth of the T/H is determined by the on-resistance of the switch and

the sampling capacitor. Figure 6-12(B) shows the small signal model of the T/H, where

CPAD is the pad parasitic capacitance, Csample is the sampling capacitance, and the 25Ω

resistor is the parallel combination of the channel impedance and the on-chip

termination resistor. A simple π model is used in the transistors’ places, with R’ and C’

being the channel resistance and the gate capacitance of a unit width transistor

137

respectively. A larger transistor has a lower on-resistance and thus tends to give a

higher bandwidth. However, when the on-resistance is comparable to 25Ω, the

bandwidth will drop with increasing transistor width because the parasitic capacitance

begins to dominate. An optimum transistor size therefore exists which maximizes the

total T/H bandwidth. Figure 6-13 plots the T/H bandwidth as a function of the transistor

width. It can be seen that a width of 28um gives the highest bandwidth. However, the

optimum is not a very sharp one. A transistor width of 14um is chosen instead, with only

a 10% drop in bandwidth, while saving about 0.2mW on clocking.

(A)

(B)

Figure 6-12. T/H Design. A) Schematic. B) Its small-signal model.

Figure 6-13. T/H Bandwidth vs. switch width

7µm14µm7µm

CKBCKCKB

VD

D

W

R'

WC'WC'PADCsampleC25

2.0

2.5

3.0

3.5

4.0

4.5

0 10 20 30 40 50

Ba

nd

wid

th (

GH

z)

Width (µm)

138

A few mechanisms limit the T/H linearity, including signal-dependent charge

injection, clock feedthrough, and nonlinear channel resistance during track-mode [94].

Dummy switches driven by a delayed complementary clock are used at both sides of

the sampling switch to cancel the charge injection [92]. With second order distortion

largely removed by differential signaling, the third order term dominates the distortion

performance. Simulation shows that, when sampling a 1.4GHz full scale sine wave at

2.5GS/s, the T/H achieves -45dBc third order harmonic distortion, with 1.5dB

improvement by the dummy switches.

6.3.4 Comparator

Figure 6-14 shows the block diagram of the comparator. A three-stage

preamplifier followed by a regenerative latch digitizes the difference between input and

reference voltages. Another two latch stages reduce metastability and convert current-

mode-logic (CML) levels to full-swing CMOS logic levels. A current steering DAC

accepts the control word from the SRAM and injects static current into the output of the

first preamplifier stage to cancel the offset of the whole comparator.

Figure 6-14. Comparator block diagram.

Compared to a dynamic comparator [74], the preamplifier expedites the

regeneration in the latch [95], suppresses charge kickback, and provides better power

supply and common-mode rejections. The preamplifier consists of three stages (P1~P3)

for fast overdrive recovery [90] [93]. Figure 6-15 shows the schematics of P1, P2, and

VR

VINSR

P1 P2 P3 L1 P4 L2 L3

DAC

CML Latch CML Latch SAFFSRAM

139

the DAC. Resistor loads are used instead of diode connected transistors to avoid the

voltage headroom due to the transistor VT [93].

Figure 6-15. Schematics of the first two stages of the preamplifier

For high speed operation, the bandwidth of the preamplifiers must be maximized.

For that reason, it’s desirable to bias the transistors at high current densities. However,

this practice is limited by two factors. First, the transit frequency of a transistor

increases slowly at high current densities, as shown in Figure 6-16(A), which means the

current efficiency drops at high current densities, even without considering the Tf drop

caused by velocity saturation. Second, the highest current density is limited by the

supply voltage due to voltage headroom issues. For P1, ignoring the currents through

M3, the gain is given by

where mg is the transconductance of M1 and M2, 1I is the current through M1 and M2,

and RV is the voltage drop on R1 when the differential pair is balanced. The term

is

due to the fact that half of the bias current flows through M1B and M2B and does not

produce any gain. Since VINP, VINN, VRP and VRN all vary between VDD-0.4V to VDD, to

prevent M1 and M2 from entering linear region, must be kept below

or about 0.25

DAC P2P1

VINP VRP VRNVINN

VB

M1A M1B M2A M2B

M3A M3B

M4A M4B M5A M5B

R1BR1A R2BR2A

IT1 IT2 IT3IDAC

M1A, M1B 1µ/0.12µ

M2A, M2B 1µ/0.12µ

M3A, M3B 1µ/0.12µ

M4A, M4B 0.4µ/0.12µ

M5A, M5B 1µ/0.12µ

R1A, R1B 6 KΩ

R2A, R2B 6 KΩ

IT1, IT2, IT3 100 µA

IDAC 0~40 µA

140

V considering the body effect. Figure 6-16(B) plots as a function of the current

density, assuming a moderate gain of 2. It can be seen that the speed and gain

requirements can’t be met without violating the limit. To solve this problem, two

transistors biased in the saturation region (M3A and M3B) are used to bypass half of the

current to reduce the voltage headroom on R1A and R1B by half [96], as also shown in

Figure 6-16(B). The chosen current density is 50μA/μm.

Since P2 has less self-loading, it can achieve a larger GBW than P1 given the

same bias condition and fanout. The gain of P2 is therefore designed 70% higher than

P1, while the bandwidths of P1 and P2 are kept the same. No inductive peaking is used

to save area.

(A)

(B)

Figure 6-16. Effects of M3. A) Transit frequency vs. current density. B) Required voltage drop on the load resistor vs. current density.

0

20

40

60

80

100

120

0E+00 1E-04 2E-04 3E-04

f T(G

Hz)

Current Density (μA/μm)

0 100 200 300

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 100 200 300

VR

(V)

Current density (μA/μm)

w/o M3

w/ M3

141

Figure 6-17. Schematic of the CML latches

A CML flip-flop and a sense-amplifier flip-flop (SAFF) complete the comparator.

Figure 6-17 shows the CML flip-flop, which is constructed with the conventional master-

slave topology. Figure 6-18 shows the SAFF schematic. It consists of a sense-amplifier

(SA) and a set-reset (SR) latch. The SAFF provides additional gain to suppress

metastability errors and convert CML levels to full-swing CMOS levels. With the

additional gains of the latches, the ADC’s BER is estimated to be better than [97].

Figure 6-18. Schematic of the SAFF

Figure 6-19 shows the current-steering DAC. A bias generator shared by all the

comparators generates three bias voltages. The offset control word W[N] selects from

these three bias voltages and VSS to inject an appropriate current to comparator C[N]

and cancel its offset.

L2P4L1P3

CK CKB CKB CK

M1A M1B

R1BR1A

M2A M2B M3AM3B M4A M4B

M5A M5B M6A M6B

R2BR2A

IT1 IT2

M1A, M1B 1µ/0.12µ

M2A, M2B 0.8µ/0.12µ

M3A, M3B 1µ/0.12µ

M4A, M4B 0.8µ/0.12µ

M5A, M5B 2µ/0.12µ

M6A, M6B 2µ/0.12µ

R1A, R1B 8 KΩ

R2A, R2B 8 KΩ

IT1, IT2 60 µA

SRL3

CKB

142

Figure 6-19. Current-steering DAC and the DAC bias generator. The bias generator is shared by all the comparators.

One important design parameter of the current-steering DAC is its calibration

range . This range is selected based on the comparator offset and the yield target.

To reduce area and power consumption, the transistors in the comparators are sized

close to the minimum. Figure 6-20(A) shows the simulated comparator offset , which

is 22.5 mV (0.9 LSB) and is dominated by the preamplifier. For a certain calibration

range, the yield is the probability of all the 32 comparators’ offset falling within this

range, and, assuming a Gaussian distribution for the comparator offset, is given by

[ (

)]

Figure 6-20(B) shows the yield as a function of the normalized calibration range. To

achieve a yield higher than 90%, the normalized calibration range

should be higher

than 6. In this prototype, the maximum IDAC is programmable through the serial

interface, and the simulated can cover up to , as Figure 6-20(C) shows.

The other key parameter of the current-steering DAC is its resolution, which

determines the calibration step and the achievable calibration accuracy as discussed

previously. In this prototype, 5-b resolution is chosen. When the calibration range is

Current-steering DACShared bias generator

W[N][1:0]

VB[3]

VB[2]

VB[1]

VSS

W[N][3:2]

VB[3]

VB[2]

VB[1]

VSS

M3BM3A

W[N][4] W[N][4]

VB[1]

IB×1

VB[2]

IB×2

VB[3]

IB×3M2A M2B

M1BM1A M1C

M6A, M6B, M6C 2µ/0.16µ

M4A, M4B 0.4µ/0.12µ

M3A 1.6µ/0.16µ

M3B 0.4µ/0.16µ

IB 13 µA

143

programmed to 5.4 LSB ( ), the calibration step is 0.19 LSB. With the resistor

ladder providing higher than 8-b linearity, this guarantees a calibration accuracy of 0.5

LSB according to Equation 6-3.

(A)

(B)

(C)

Figure 6-20. Simulated comparator performances. A) Offset. B) Yield vs. normalized calibration range. C) Calibration range.

-60 -40 -20 0 20 40 600

2

4

6

8

10

12

14

16

18

20

0

4

8

12

16

20

-60 -40 -20 0 20 40 60

Offset (mV)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12

Yie

ld

Normalized DAC Range

0

20

40

60

80

100

120

140

160

180

200

0 10 20 30 40 50

Vcal(m

V)

Max. IDAC (µA)

144

6.3.5 Digital Backend

A digital backend converts the output thermometer codes of the comparator array

to binary format. It also provides the capability of correcting or minimizing errors due to

bubbles or metastabilities. Figure 6-21 shows the block diagram of the digital backend.

The three-input AND gate array converts the thermometer codes to one-hot codes and

provides 1st order bubble error correction. The one-hot codes are then used to address

a quasi-gray-code ROM encoder [98]. Simple XOR gates convert the quasi-gray code to

binary codes. The binary codes are then decimated by 64 to accommodate the limited

bandwidth of the test equipment.

Figure 6-21. Block diagram of the digital backend

6.3.6 Reference ADC

The reference ADC is comprised of the resistor ladder, the switch bank SR, and

the comparator C[0]. The resistor ladder is reused form the main ADC to reduce the

calibration overhead. The switch bank SR is built with CMOS transmission gates and is

controlled by the one-hot code S[1:31] to select the desired reference voltage for C[0]

from the resistor ladder. The switch bank SR is implemented with simple CMOS

Pipelined ROM Encoder

Decimator

0

SR SR SR SR

11

Quasi-gray

One-hot

Thermometer

Binary

/64

CK

145

transmission gates. C[0] shares the same design as C[1:31] and does not involve any

extra design effort.

6.3.7 Calibration Engine and Supporting Circuitry

The other calibration circuitry includes the FSM as the calibration engine, the

SRAM to store the offset control words, the address decoder to facilitate the

communication between the FSM and SRAM, and the switch bank SQ. The FSM, the

SRAM, and the address decoder are all built with standard cells, while the switch bank

SQ is implemented with CMOS transmission gates, same as SR.

Figure 6-22. FSM flow chart. N is the calibration index, which is also the SRAM address.

Figure 6-22 shows the flow chart of the FSM operation. At the beginning, the

FSM sets N to 1. This sets S[1] to HIGH so that both C[0] and C[1]’s reference voltages

are connected to VR[1]. Meanwhile, C[1]’s output is also selected. To improve noise

immunity, the FSM then accumulates the results of 128 comparisons between C[0] and

C[1]’s outputs before updating the control word W[1] in the SRAM. After that, the FSM

sets N to 2 and calibrates C[2]. This process repeats cyclically for C[1:31] so that the

comparators are all continuously calibrated in the background.

Clear error counter

Compare Q[N] and Q[0]

Update error counter

128 comparisons?

Update W[N]

N = 1

No

Yes

N = 31 ?

No

Yes

N = N+1

146

Note that, with the help of SQ, the FSM directly reads the ADC’s raw

thermometer output instead of its decoded binary output. This eliminates the need for a

5-b digital comparator and bypasses the possible complication introduced by bubble

error correction.

6.3.8 Clock and Power Distribution

Clock distribution is of crucial importance in high speed ADC design. The clock

buffers are sized for the same fan-out. Dummy loads are inserted in the clock tree to

compensate for unbalanced loads. To account for the finite delay through the

preamplifier, the clock of the T/H leads that of the comparators by one inverter delay.

Since the clock of the FSM and the decimator is divided down from the full-speed clock

and its phase relationship with the full-speed clock is unknown, multiple phases are

generated for selection through the on-chip serial interface.

The power is split to analog and digital domains. Decoupling capacitors are

inserted whenever there is spare area. To prevent noise coupling through the substrate,

guardring is inserted between the analog part and the digital part. The guardring is

connected to a dedicated ground pad, separate from analog and digital ground pads

[99].

6.4 Experimental Results

The prototype 5-bit flash ADC was fabricated in 0.13μm 1-poly 8-metal bulk

CMOS process and was measured in a QFN package. Figure 6-23 shows the chip

micrograph. The ADC core occupies an active area of 0.24 mm2. Even without any

layout optimization, the calibration circuitry takes less than 10% of the core area.

147

Figure 6-23. Chip micrograph.

The ADC was powered from a 1.2-V supply. The reference voltages VRP and VRN

were set to 1.2 V and 0.8 V respectively, giving a differential full-scale input range of 0.8

V. The ADC’s decimated digital output was captured by a mixed-signal oscilloscope and

post-processed in Matlab.

The ADC’s static performance was evaluated by stepping the DC input voltage to

the ADC and recording the levels at which the output toggles. The peak-to-peak noise

observed during DC measurement is 2.5 mV, or roughly 0.1 LSB. To remove the effect

of noise during the DC measurement, the output codes were averaged to find the

transition levels. Figure 6-24 shows the measured INL and DNL with and without

calibration. When calibration is disabled, i.e., when all the SRAM bits are cleared to 0

through the serial interface, the ADC has an INL of -1.85/1.48 LSB and a DNL of -

1.00/2.75 LSB. Enabling calibration improves the INL to -0.21/0.17 LSB and the DNL to

-0.07/0.04 LSB. The low calibrated DNL and INL clearly demonstrates the efficacy of

the proposed calibration scheme.

FSM

SR

AM

Co

mp

ara

tor

Dig

ita

l

Ba

ck

en

d

Bias

Clock Tree

R Ladder

100 μm

148

(A)

(B)

Figure 6-24. Measured ADC linearity. A) INL. B) DNL.

Figure 6-25 shows dynamic performance evaluation test setup. The single-ended

input signal from a signal generator is first converted to differential by a passive balun

before being fed to the ADC. Figure 6-26 shows the output spectrums before and after

enabling the calibration. The input signal is a full-scale 1.172-GHz sine wave, and the

sample rate is 2.5 GS/s. Note that due to the decimation, the fundamental tone is

aliased to 0.3 MHz and the frequency spans from DC to 19.53125 MHz. The SFDR

improves by nearly 12 dB from 27.3 dB to 39.2 dB with calibration.

-3

-2

-1

0

1

2

3

0 4 8 12 16 20 24 28 32

INL

(L

SB

)

Output code

w/ calibration

w/o calibration

-1.85/1.48 LSB -0.21/0.17 LSB

-3

-2

-1

0

1

2

3

0 4 8 12 16 20 24 28 32

DN

L (

LS

B)

Output code

w/o calibration

w/ calibration

-1.00/2.75 LSB -0.07/0.04 LSB

149

Figure 6-25. Test setup for dynamic performance evaluation

(A) (B)

Figure 6-26. Output spectrums. A) W/ calibration. B) W/o calibration

Figure 6-27. ENOB w/ and w/o calibration

Figure 6-27shows the measured ENOB at various sample rates with the input

frequency kept at around 1.2 GHz. Without calibration, the highest ENOB is below 3.5

Test Board

Balu

n

Mixed-signal

ScopeMatlab

Power

Supply

VDD VRP VRNCK

ADC LVDS Driver

0 2 4 6 8 10 12 14 16 18-70

-60

-50

-40

-30

-20

-10

0

25004p2dbm.csv ENOB=4.4035

dB

39.2dBw/ cal.

0 2 4 6 8

Frequency (MHz)

10 12 14 16 18

-60

-50

-40

-30

-20

-10

0

-700 2 4 6 8 10 12 14 16 18

-70

-60

-50

-40

-30

-20

-10

0

2500.csv ENOB=3.1763

27.3dB w/o cal.

dB

0 2 4 6 8

Frequency (MHz)

10 12 14 16 18

-60

-50

-40

-30

-20

-10

0

-70

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.0 1.5 2.0 2.5 3.0

EN

OB

(b

it)

Sample rate (GS/s)

w/ calibration

w/o calibration

1.2b

150

bits. With calibration, the ENOB improves to 4.7 bits below 2 GS/s and remains above

4.4 bits until 2.5 GS/s. For all sample rates, the calibration improves the ENOB by more

than 1.2 bits.

The ADC core (excluding peripheral IO and termination) consumes 50mW, of

which about 34 mW is consumed by the digital backend and the clocking circuitry. Even

without resorting to power-saving architectures such as interpolation and folding, our

design achieves a competitive figure-of-merit (FoM) of 0.95 pJ/conversion. Table 6-2

shows our design’s performance summary alongside some recently published flash

ADCs. Note that designs with similar or better FoM all employ interpolating or folding

techniques except [74], which uses fully dynamic comparators and a more advanced

technology.

Table 6-2. Comparison with recently published work

Reference [77] [78] [100] [74] [92] [101

]

[102

]

[103

] This work

Interpolating Yes Yes No No Yes Yes No Yes No

Folding No Yes No No No No No No No

Resolution 6 6 4 5 6 6 6 6 5

Fs (GS/s) 3 2.7 4 1.75 3.5 1.6 5 1.2 2.5

INL (LSB) 0.2 0.73 0.24 0.39 1 0.42 0.7 0.6 -0.21/0.17

DNL (LSB) 0.2 0.53 0.15 0.38 0.5 0.49 0.6 0.4 -0.07/0.04

ENOB 5.81)

5.3 3.5 4.7 4.9 5.4 5.0 5.7 4.4

Process (nm) 90 90 180 90 90 130 65 130 130

VDD (V) 1.2 1 1.8/2.5 1 0.9 1.5 1.3 1.5 1.2

Power (mW) 90 50 608 7.6 98 180 320 90 50

Calibration BG2)

BG FG3)

FG No No No No BG

Area (mm2) 0.28 0.36 0.88 0.03 0.15 0.42 0.3 0.12 0.24

FoM (pJ/Conv.) 2.3 0.47 13.6 0.17 0.95 2.6 1.97 1.4 0.95 1)

With 10MHz input. 2)

Background. 3)

Foreground.

151

6.5 Summary

As technology scales, ADC-based serial links are becoming attractive for its

flexibility and scalability, where a flash ADC architecture is usually used for its high

speed capability. One of the key challenges in ADC-based serial links is the power

consumption of high-speed ADCs, reduction of which is limited by the mismatch

between components. By compensating for the offset due to mismatch, calibration

allows the use of small components in the ADC’s without performance degradation, thus

enables low-power designs. Running the calibration in the background provides the

additional benefit of tracking environmental changes and device aging.

Key metrics for background calibration techniques include accuracy,

convergence speed, area/power overhead, and performance penalty. A brief survey of

currently available background calibration techniques against these metrics suggests

the need for improvement, especially for high-speed ADCs. A novel digital background

ADC calibration scheme has been proposed in this Chapter. By employing a single

reference comparator and reconfiguring its threshold voltage, the proposed scheme

calibrates the transition levels of the main ADC sequentially. Compared to the

simultaneous calibration of existing solutions, this sequential operation leads to

extremely low hardware and design overhead. Its impact on the ADC performance is

also minimal.

The effectiveness of the proposed calibration scheme is experimentally

demonstrated by the significant improvements in the static and dynamic performance of

a 50-mW 2.5-GS/s 5-bit full-flash ADC in 0.13-μm CMOS technology. Although a flash

ADC is used as a prototype in this work, the concept can be readily extended to other

152

architectures. This technique should help pave the way for future low-power ADC-based

serial links.

153

CHAPTER 7 CONCLUSIONS

The exponential increase of functionality integrated on a single microprocessor

requires ever higher aggregate I/O bandwidth. Meanwhile, the whole chip power budget

has been kept practically flat at around 140 W due to packaging and thermal

management limitations. As a result, the power efficiency of off-chip signaling must be

greatly improved to maintain the scaling of microprocessors.

At multi-Gb/s, the channel imposes a challenging bandwidth bottleneck because

of its frequency-dependent loss induced by skin effect and dielectric dissipation. As a

result, high-speed signaling usually resorts to sophisticated equalization such as FFE

and DFE to compensate for the channel loss. Besides equalization, other essential

functions in a high-speed link include clocking and signaling. To improve the link power

efficiency, the implementation options for each function must be carefully evaluated in

terms of their impact on the total link power so that informed tradeoffs can be made.

This Dissertation represents such an effort from both the circuit and channel

perspectives. On the circuit side, different schemes for equalization, clock generation

and recovery, and signaling modes are compared. The advantages of DFE, injection-

locking-based clock generation, baud-rate CDR, and voltage-mode signaling with

differential termination are identified. On the channel side, air-cavity transmission-lines

are proposed to reduce the dielectric loss of electrical channels at high frequencies. The

results of this effort include a 6.25-Gb/s 0.6-pJ/bit active with a current-sharing frontend

and an air-cavity channel, a 4.5-Gb/s 3.2-pJ/bit receiver with baud-rate eye-tracking

154

CDR and majority-voting DFE, and a 5-Gb/s 0.75-pJ/bit transceiver in exclusive static

CMOS logic style, which is among the best reported to date.

As semiconductor technology scales, digital signaling processing has become

more and more power efficient compared to its analog counterpart. In the field of high-

speed off-chip signaling, this has recently led to the interest in ADC-based links. One

critical challenge in the ADC-based link architecture is to reduce the power consumption

of the high-speed ADC, which is limited by the component mismatches among other

factors. This Dissertation presents a digital background calibration technique that

features minimal overhead and performance penalty. The efficacy of the calibration

scheme is experimentally confirmed with a 50-mW 2.5-GS/s 5-b full-flash ADC.

All the silicon results in this Dissertation are based on a 0.13-µm bulk CMOS

technology. However, there are no fundamental reasons that prevent the presented

techniques from being extended to more advanced technologies. The work in this

Dissertation should therefore help pave the way toward more power-efficient off-chip

signaling in future electronic systems.

155

LIST OF REFERENCES

[1] G. E. Moore, "Cramming more components onto integrated circuits," Electronics,

vol. 38, no. 8, pp. 114-117, April 1965.

[2] G. Moore, "Progress in Digital Electronics," in IEEE Technical Digest of the Int’l Electron Devices Meeting, 1975.

[3] B. Casper, G. Balamurugan, J. Jaussi, J. Kennedy and M. Mansuri, "Future microprocessor interfaces: analysis, design and optimization," in IEEE Custom Integrated Circuit Conf., 2007.

[4] J. Nasrullah, A. Amin, W. Ahmad, Z. Qin, Z. Mushtaq, O. Javed, J. Yoon, L. Chua, D. Huang, B. Huang, M. Vichare, K. Ho and M. Rashid, "A terabit/s-throughput; SerDes-based interface for a third-generation 16 Core 32 thread chip-multithreading SPARC processor," in IEEE Symp. VLSI Circuits, 2008.

[5] "The International Technology Roadmap for Semiconductors (ITRS)," 2011. [Online]. Available: http://public.itrs.net/. [Accessed 2011].

[6] J. Poulton, R. Palmer, A. M. Fuller, T. Greer, J. Eyles, W. J. Dally and M. Horowitz, "A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS," IEEE J. Solid-State Circuits, vol. 42, no. 12, pp. 2745-2757, December 2007.

[7] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, T. Takemoto, F. Yuki and T. Saito, "A 12.3 mW 12.5 Gb/s complete transceiver in 65nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2010.

[8] M. Harwood, N. Warke, R. Simpson, T. Leslie, A. Amerasekera, S. Batty, D. Colman, E. Carr, V. Gopinathan, S. Hubbins, P. Hunt, A. Joy, P. Khandelwal, B. Killips, T. Krause, S. Lytollis, A. Pickering, M. Saxton, D. Sebastio and G. Swanson, "A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC with Digital receiver Equalization and Clock Recovery," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2007.

[9] H. Johansson and C. Svensson, "Time resolution of NMOS sampling switches used on low-swing signals," IEEE J. Solid-State Circuits, vol. 33, no. 2, pp. 237-

245, February 1998.

[10] H. Johnson and M. Graham, High-speed digital design: a handbook of black magic, New Jersey: Prentice-Hall, 1993.

[11] E. Bogatin, "Essential principles of signal integrity," IEEE Microwave Magazine, vol. 12, no. 5, pp. 34-41, August 2011.

156

[12] E. Bogatin, Signal integrity: simplified, New Jersey: Prentice Hall, 2003.

[13] W. J. Dally and J. Poulton, "Transmitter equalization for 4-Gbps signaling," Micro, vol. 17, no. 1, pp. 48-56, 1997.

[14] J. Jaussi, G. Balamurugan, D. Johnson, B. Casper, A. Martin, J. Kennedy, N. Shanbhag and R. Mooney, "8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew," IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 80 - 88, January 2005.

[15] S. Gondi and B. Razavi, "Equalization and clock and data recovery techniques for 10-Gb/s CMOS serial-link receivers," IEEE J. Solid-State Circuits, vol. 42, no. 9, pp. 1999-2011, 2007.

[16] T. Beukema, M. Sorna, K. Selander, S. Zier, B. Ji, P. Murfet, J. Mason, W. Rhee, H. Ainspan, B. Parker and M. Beakes, "A 6.4Gb/s CMOS SerDes core with feed-forward and decision-feedback equalization," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2633-2645, 2005.

[17] R. Payne, P. Landman, B. Bhakta, S. Ramaswamy, S. Wu, J. D. Powers, M. U. Erdogan, A. Yee, R. Gu, L. Wu, Y. Xie, B. Parthasarathy, K. Brouse, W. Mohammed, K. Heragu, V. Gupta, L. Dyson and W. Lee, "A 6.25-Gb/s binary transceiver in 0.13-um CMOS for serial data transmission across high los legacy backplane channels," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2646-2657, December 2005.

[18] A. Emami-Neyestanak, A. Varzaghani, J. Bulzacchelli, A. Rylyakov, C.-K. Yang and D. Friedman, "A 6.0 mW 10.0Gb/s receiver with switched-capacitor summation DFE," IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 889-896, 2007.

[19] S. Kasturia and J. H. Winters, "Techniques for high-speed implementation of nonlinear cancellation," IEEE J. Sel. Areas Commun., vol. 9, no. 5, pp. 711-717, June 1991.

[20] G. Balamurugan, J. Kennedy, G. Banerjee, J. Jaussi, M. Mansuri, F. O'Mahony, B. Casper and R. Mooney, "A scalable 5-15Gbps, 14-75mW low power I/O transceiver in 65nm CMOS," in IEEE Symp. VLSI Circuits, 2007.

[21] F. O'Mahony, S. Shekhar, M. Mansuri, G. Balamurugan, J. E. Jaussi, J. Kennedy, B. Casper, D. J. Allstot and R. Mooney, "A 27Gb/s forwarded-clock I/O receiver using an injection-locked LC-DCO in 45nm CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2008.

[22] K. Hu, R. Bai, T. Jiang, C. Ma, A. Ragab, S. Palermo and P. Y. Chiang, "0.16-0.25 pJ/bit, 8 Gb/s near-threshold serial link receiver with super-harmonic injection-locking," IEEE J. Solid-State Circuits, vol. 47, no. 8, pp. 1842-1853, 2012.

157

[23] B. Razavi, "A study of injection locking and pulling in oscillators," IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1415-1424, 2004.

[24] J. Lee and H. Wang, "Study of subharmonically injetion-locked PLLs," IEEE J. Solid-State Circuits, vol. 44, no. 5, pp. 1539-1553, 2009.

[25] J. Chen, A. Hu, Y. Fan and R. Bashirullah, "Noise suppression in injection-locked ring oscillators," Electronics Letters, vol. 48, no. 6, pp. 323-324, 2012.

[26] M. Hsieh and G. Sobelman, "Architectures for multi-gigabit wire-linked clock and data recovery," IEEE Circuits and Systems Magazine, vol. 8, no. 4, pp. 45-57, 2008.

[27] C. R. Hogge, "A self-correcting clock recovery circuit," IEEE J. Lightwave Tech., vol. 3, no. 12, pp. 1312-1314, 1985.

[28] J. D. H. Alexander, "Clock recovery from binary signals," Electronics Letters, vol. 11, no. 22, pp. 541-542, 30 October 1975.

[29] Y. M. Greshishchev, P. Schvan, J. L. Showell, M. Xu, J. J. Ojha and J. E. Rogers, "A fully integrated SiGe receiver IC for 10-Gb/s data rate," IEEE J. Solid-State Circuits, vol. 35, no. 12, p. 1949–1957, 2000.

[30] J. Lee and B. Razavi, "A 40 Gb/s clock and data recovery circuit in 0.18um CMOS technology," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2003.

[31] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleitner, M. Kossel, T. Morf, J. Weiss and M. Schmatz, "A 72mW 0.03mm2 inductorless 40Gb/s CDR in 65nm SOI CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2007.

[32] C. Kromer, G. Sialm, c. Menolfi, M. Schmatz, F. Ellinger and H. Jackel, "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," IEEE J. Solid-State Circuits,

vol. 41, no. 12, p. 2921–2929, December 2006.

[33] B. K. Casper, M. Haycock and R. Mooney, "An accurate and efficient analysis method for multi-Gb/s chip-to-chip signaling schemes," in IEEE Symp. VLSI Circuits, 2002.

[34] H. Hatamkhani and C.-K. K. Yang, "A study of the optimal data rate for minimum power of I/Os," IEEE Trans. Circuits and Syst. II, vol. 53, no. 11, pp. 1230-1234, 2006.

[35] M.-S. Chen, Y.-N. Shih, C.-L. Lin, H.-W. Hung and J. Lee, "A Fully-Integrated 40-Gb/s Transceiver in 65-nm," vol. 47, no. 3, pp. 627-640, March 2012.

158

[36] S. Hall and H. Heck, Advanced signal integrity for high-speed digital designs, New Jersey: John Wiley & Sons, 2009.

[37] B. Kim, Y. Liu, T. Dickson, J. Bulzacchelli and D. Friedman, "A 10-Gb/s Compact Low-Power Serial I/O With DFE-IIR Equalization in 65-nm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3526-3538, 2009.

[38] T. Tanahashi, M. Kurisu, H. Yamaguchi, T. Nedachi, M. Arai, S. Tomari, T. Matsuzaki, K. Nakamura, M. Fukaishi, S. Naramoto and T. Sato, "A 2 Gb/s 21 CH low-latency transceiver circuit for inter-processor communication," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2001.

[39] K.-L. Wong, H. Hatamkhani, M. Mansuri and C.-K. Yang, "A 27-mW 3.6-Gb/s I/O transceiver," IEEE J. Solid-State Circuits, vol. 39, no. 4, p. 2004, April 2003.

[40] D. M. Pozar, Microwave engineering, New Jersey: John Wiley & Sons, 1998.

[41] M. V. Schneider, "Microstrip lines for microwave integrated circuits," Bell Syst. Tech. Journal, vol. 48, no. 5, p. 1421–1444, 1969.

[42] G. Balamurugan, J. Kennedy, G. Banerjee, J. Jaussi, M. Mansuri, F. O'Mahony, B. Casper and R. Mooney, "A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 4, pp. 1010-1019, 2008.

[43] T. Spencer, Y. Chen, R. Saha and P. Kohl, "Stablization of the thermal decomposition of poly(propylene carbonate) through Copper ion incorporation and use in self-patterning," Journal of Electronic Materials, pp. 1350-1363, 2011.

[44] D. Z. Turker, A. Rylyakov, D. Friedman, S. Gowda and E. Sanchez-Sinencio, "A 19Gb/s 38mW 1-tap speculative DFE receiver in 90nm CMOS," in IEEE Symp. VLSI Circuits, 2009.

[45] W. R. Eisenstadt and Y. Eo, "S-parameter-based IC interconnect transmission line characterization," IEEE Trans. Components, Hybrids, and Manufacturing Technology, vol. 15, no. 4, pp. 483-490, 1992.

[46] V. Balan, J. Caroselli, J.-G. Chern, C. Chow, R. Dadi, C. Desai, L. Fang, D. Hsu, P. Joshi, H. Kimura, C. Liu, T.-W. Pan, R. Park, C. You, Y. Zeng, E. Zhang and F. Zhong, "A 4.8-6.4-Gb/s serial link for backplane applications using decision feedback equalization," IEEE J. Solid-State Circuits, vol. 40, no. 9, pp. 1957-1967, 2005.

[47] K. H. Mueller and m. Muller, "Timing recovery in digital synchronous data receivers," IEEE Trans. on Communications, vol. 24, no. 5, pp. 516-531, May 1976.

159

[48] A. Emami-Neyestanak, S. Palermo, H.-C. Lee and M. Horowitz, "CMOS transceiver with baud rate clock recovery for optical interconnects," in IEEE Symp. VLSI Circuits, 2004.

[49] F. Musa and A. C. Carusone, "A baud-rate timing recovery scheme with a dual-function analog filter," IEEE Trans. Circuits Syst. II, vol. 53, no. 12, pp. 1393-1397, December 2006.

[50] R. S. Kajley, P. Hurst and J. E. C. Brown, "A mixed-signal decision-feedback equalizer that uses a look-ahead architecture," IEEE J. Solid-State Circuits, vol. 32, no. 3, pp. 450-459, 1997.

[51] W. Fang, "Accurate analytical delay expressions for ECL and CML circuits and their applications to optimizing high-speed bipolar circuits," IEEE J. Solid-State Circuits, vol. 25, no. 2, pp. 572-583, 1990.

[52] T. E. Collins, V. Manan and S. I. Long, "Design analysis and circuit enhancements for high-speed bipolar flip-flops," IEEE J. Solid-State Circuits, vol. 40, no. 5, pp. 1166-1174, 2005.

[53] A. Garg, A. C. Carusone and S. P. Voinigescu, "A 1-tap 40-Gb/s look-ahead decision feedback equalizer in 0.18-um SiGe BiCMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2224-2232, October 2006.

[54] A. Kapoor, Y. Hu and R. Bashirullah, "Design and optimization of high-speed CML gaters using a current-centric LE model," to appear in IEEE Trans. Circuits & Syst. I.

[55] C. Kromer, G. Sialm, C. Menolfi, M. Schmatz, F. Ellinger and H. Jackel, "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," IEEE J. Solid-State Circuits, vol. 41, no. 12, p. 2921–2929, December 2006.

[56] M. G. Chen and J. K. Notthoff, "A 3.3-V 21-Gb/s PRBS generator in AlGaAs/GaAs HBT technology," IEEE J. Solid-State Circuits, vol. 35, no. 9, pp. 1266-1270, 2000.

[57] E. Laskin and S. P. Voinigescu, "A 60 mW per lan, 4X23-Gb/s 27-1 PRBS generator," IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2198-2208, 2006.

[58] T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica and S. P. Voinigescu, "An 80-Gb/s 231-1 pseudorandom binary sequence generator in SiGe BiCMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 12, pp. 2735-2745, 2005.

[59] H. Knapp, M. Wurzer, T. F. Meister, J. Bock and K. Aufinger, "40Gbitps 27-1 PRBS generator IC in SiGe bipolar technology," in Proc. Bipolar/BiCMOS Circuits and Technology Meeting, Monterey, CA, 2002.

160

[60] H. Knapp, M. Wurzer, W. Perndl, K. Aufinger, J. Bock and T. F. Meister, "100-Gb/s 27-1 and 54-Gb/s 211-1 PRBS generators in SiGe bipolar technology," IEEE J. Solid-State Circuits, vol. 40, no. 10, pp. 2118-2125, 2005.

[61] K. Fukuda, H. Yamashita, F. Yuki, M. Yagyu, R. Nemoto, T. Takemoto, T. Saito, N. Chujo, K. Yamamoto, H. Yanai and A. Hayashi, "An 8Gb/s transceiver with 3X-oversampling 2-threshold eye-tracking CDR citcuit for -36.8dB-loss backplane," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2008.

[62] M.-J. E. Lee, W. J. Dally and P. Chiang, "Low-power area-efficient high-speed I/O circuit techniques," IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1591-1599, 2000.

[63] S. Quan, F. Zhong and W. L. e. al, "A 1.0625-to-14.025Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2011.

[64] R. Palmer, J. Poulton, W. J. Dally, J. Eyles, A. M. Fuller, T. Greer, M. Horowitz, M. Kellam, F. Quan and F. Zarkeshvari, "A 14mW 6.25Gb/s transceiver in 90nm CMOS for serial chip-to-chip communications," in ISSCC Dig. Tech. Papers, San Francisco, 2007.

[65] R. Farjad-Rad, A. Nguyen, J. M. Tran, T. Greer, J. Poulton, W. J. Dally, J. H. Edmondson, R. Senthinathan, R. Rathi, M.-J. E. Lee and H. Ng, "A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os," IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1553-1561, 2004.

[66] K. Hu, T. Jiang, J. Wang, F. O'Mahony and P. Y. Chiang, "A 0.6 mV/Gb/s, 6.4-7.2 Gb/s serial link receiver using local injection-locked ring oscillators in 90 nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 899-908, 2010.

[67] S. Shekhar, M. Mansuri, F. O'Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, D. J. Allstot, R. Mooney and B. Casper, "Strong injection locking in low-Q LC oscillators: modeling and application in a forwarded-clock I/O receiver," IEEE Trans. Circuits and Syst. -I: Regular Papers, vol. 56, no. 8, pp. 1818-1829, 2009.

[68] F. O'Mahony, J. E. Jaussi, J. Kennedy, G. Balamurugan, M. Mansuri, C. Roberts, S. Shekhar, R. Mooney and B. Casper, "A 14X10 Gb/s 1.4mW/Gb/s parallel interface in 45 nm CMOS," IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2828-2837, 2010.

[69] J. Cao, B. Zhang, U. Singh, D. Cui, A. Vasani, A. Garg, W. Zhang, N. Kocaman, D. Pi, B. Raghavan, H. Pan, I. Fujimori and A. Momtaz, "A 500mW digitally-calibrated AFE in 65nm CMOS for 10Gb/s serial links over backplane and multimode fiber," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2009.

161

[70] H. Yamaguchi, H. Tamura, Y. Doi, Y. Tomita, T. Hamada, M. Kibune, S. Ohmoto, K. Tateishi, O. Tyshchenko, A. Sheikholeslami, T. Higuchi, J. Ogawa, T. Saito, H. Ishida and K. Gotoh, "A 5Gb/s transceiver with and ADC-based feedforward CDR and CMA adaptive equalizer in 65nm CMOS," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2010.

[71] P. Kinget, "Device mismatch and tradeoffs in the design of analog circuits," IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212 - 1224, June 2005.

[72] I. Young, "Analog mixed-signal circuits in advanced nano-scale CMOS technology for microprocessors and SoCs," in Proceedings of the ESSCIRC, 2010.

[73] C. Chen, M. Le and K. Kim, "A low power 6-bit flash ADC with reference voltage and common-mode calibration," IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1041-1046, 2009.

[74] B. Verbruggen, P. Wambacq, M. Kuijk and G. Van der Plas, "A 7.6 mW 1.75 GS/s 5 bit flash A/D converter in 90 nm digital CMOS," in IEEE Symp. VLSI Circuits, 2008.

[75] M. Flynn, C. Donovan and L. Sattler, "Digital calibration incorporating redundancy of flash ADCs," IEEE Trans. Circuits Syst. II, vol. 50, no. 5, pp. 205 - 213, May 2003.

[76] S. Tsukamoto, I. Dedic, T. Endo, K. Kikuta, K. Goto and O. Kobayashi, "A CMOS 6-b; 200 MSample/s; 3 V-supply A/D converter for a PRML read channel LSI," IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1831 - 1836, 1996.

[77] M. Kijima, K. Ito, K. Kamei and S. Tsukamoto, "A 6b 3GS/s Flash ADC with Background Calibration," in IEEE Custom Integrated Circuits Conf., 2009.

[78] Y. Nakajima, A. Sakaguchi, T. Ohkido, N. Kato, T. Matsumoto and M. Yotsuyanagi, "A background self-calibrated 6b 2.7 GS/s ADC with cascade-calibrated folding-interpolating architecture," IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 707-718, April 2010.

[79] H. Ploeg, G. Hoogzaad, H. Termeer, M. Vertregt and a. R. Roovers, "A 2.5-V 12-b 54-Msample/s 0.25-um CMOS ADC in 1-mm2 with mixed-signal chopping and calibration," IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 1859-1867, December 2001.

[80] S. Jamal, D. Fu, N. Chang, P. Hurst and S. Lewis, "A 10-b 120-Msample/s time-interleaved analog-to-digital converter with digital background calibration," IEEE J. Solid-State Circuits, vol. 37, no. 12, pp. 1618-1627, December 2002.

162

[81] C. Huang and J. Wu, "A background comparator calibration technique for flash analog-to-digital converters," IEEE Trans. Circuits Syst., vol. 52, no. 9, pp. 1732-1740, September 2005.

[82] D. Fu, K. C. Dyer, S. H. Lewis and P. J. Hurst, "A digital background calibration technique for time-interleaved analog-to-digital converters," IEEE J. Solid-State Circuits, vol. 33, no. 12, pp. 1904 - 1911, 1998.

[83] S. Tsukamoto, I. Dedic, T. Endo, K. Kikuta, K. Goto and O. Kobayashi, "A CMOS 6-b, 200 MSample/s, 3 V-supply A/D converter for a PRML read channel LSI," IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1831 - 1836, 1996.

[84] J. Ingino and B. Wooley, "A continuously calibrated 12-b, 10-MS/s, 3.3-V A/D converter," IEEE J. Solid-State Circuits, vol. 33, no. 12, pp. 1920 - 1931, 1998.

[85] Y. Chiu, C. Tsang, B. Nikolic and P. Gray, "Least-mean-square adaptive digital background calibration of pipelined analog-to-digital converters," IEEE Trans. Circuits Syst., vol. 51, no. 1, pp. 38-46, 2004.

[86] X. Wang, P. J. Hurst and S. H. Lewis, "A 12-bit 20-MSampls/s pipelined analog-to-digital converter with nested digital background calibration," IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 1799 - 1808, November 2004.

[87] C. Tsang, Y. Chiu, J. Vanderhaegen, S. Hoyos, C. Chen, R. Brodersen and B. Nikolic, "Background ADC calibration in digital domain," in IEEE Custom Integrated Circuits Conf., 2008.

[88] H. Wang, X. Wang, P. J. Hurst and S. H. Lewis, "Nested digital background calibration of a 12-bit pipelined ADC without an input SHA," IEEE J. Solid-State Circuits, vol. 44, no. 10, pp. 2780-2789, 2009.

[89] J. McNeill, M. C. W. Coln and B. J. Larivee, ""Split ADC" architecture for deterministic digital background calibration of a 16-bit 1-MS/s ADC," IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2437 - 2445, 2005.

[90] J. Doernberg, P. Gray and D. Hodges, "A 10-bit 5-Msample/s CMOS two-step flash ADC," IEEE J. Solid-State Circuits, vol. 24, no. 4, pp. 241-249, 1989.

[91] K. Uyttenhove and M. Steyaert, "A 1.8-V 6-bit 1.3-GHz flash ADC in 0.25-μm CMOS," IEEE J. Solid-State Circuits, vol. 38, no. 7, pp. 1115 - 1122, July 2003.

[92] K. Deguchi, N. Suwa, M. Ito, T. Kumamoto and T. Miki, "A 6b 3.5GS/s 0.9V 98mW flash ADC in 90nm CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 10, pp. 2303-2310, 2008.

163

[93] M. Choi and A. Abidi, "A 6b 1.3GS/s A/D converter in 0.35um CMOS," IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 1847-1858, 2001.

[94] R. J. V. d. Plassche, Integrated analog-to-digital and digital-to-analog converters, Boston: Kluwer, 1994.

[95] P. Allen and D. Holberg, CMOS analog circuit design, New York: Oxford, 2002.

[96] B. Razavi, Design of analog CMOS integrated circuits, New York: McGraw-Hill, 2001.

[97] W. Evans, E. Naviasky, H. Tang and B. Allison, "Comparator metastability analysis," 1 January 2011. [Online]. Available: http://www.designers-guide.org/Analysis/metastability.pdf. [Accessed 1 July 2012].

[98] Y. Akazawa, A. Iwata, T. Wakimoto, T. Kamato, H. Nakamura and H. Ikawa, "A 400MSPS 8b flash AD conversion LSI," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 1987.

[99] M. Ingels and M. S. J. Steyaert, Integrated CMOS circuits for optical communications, New York: Springer-Verlag, 2004.

[100] S. Park, Y. Palaskas and M. Flynn, "A 4GS/s 4b flash ADC in 0.18μm CMOS," IEEE J. Solid-State Circuits, vol. 42, no. 9, pp. 1865-1872, September 2007.

[101] A. Ismail and M. Elmasry, "A 6bit 1.6GS/s low power wideband flash ADC converter in 0.13um CMOS," IEEE J. Solid-State Circuits, vol. 43, no. 9, pp. 1982-1990, September 2008.

[102] M. Choi, J. Lee, J. Lee and H. Son, "A 6-bit 5-GSample/s Nyquist A/D Converter in 65nm CMOS," in Symp. VLSI Circuits, 2008.

[103] C. Sandner, M. Clara, A. Santner, T. Hartig and F. Kuttner, "A 6bit 1.2GS/s low power flash ADC in 0.13um CMOS," IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 1499-1505, July 2005.

[104] H. Katamkhani and C.-K. K. Yang, "A study of the optimal data rate for minimum power of I/Os," IEEE Trans. Circuits and Systems II, vol. 53, no. 11, pp. 1230-1234, November 2006.

[105] A. Deutsch, C. Surovic, R. Krabbenhoft, G. Kopcsay and B. Chamberlin, "Prediction of losses caused by roughness of metallization in printed-circuit boards," IEEE Trans. Advanced Packaging, vol. 30, no. 2, pp. 279-287, 2007.

164

[106] P. M. Figueiredo, P. Cardoso, A. Lopes, C. Fachada, N. Hamanishi, K. Tanabe and J. Vital, "A 90 nm CMOS 1.2 V 6b 1 GS/s two-step subranging ADC," in IEEE ISSCC Dig. Tech. Papers, San Francisco, 2006.

[107] X. Wang, P. Hurst and S. Lewis, "A 12-bit 20-MSampls/s pipelined analog-to-digital converter with nested digital background calibration," IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 1799 - 1808, November 2004.

[108] W. Evans, E. Naviasky, H. Tang and B. Allison, "http://www.designers-guide.org/Analysis/metastability.pdf," 1 January 2011. [Online]. Available: http://www.designers-guide.org/Analysis/metastability.pdf. [Accessed 1 October 2011].

[109] H. Chen, I. Chen, H. Tseng and H. Chen, "1-GS/s 6-bit two-channel two-step ADC in 0.13-μm CMOS," IEEE J. Solid-State Circuits, vol. 44, no. 11, pp. 3051-3059, 2009.

[110] G. Balamurugan, F. O'Mahnoy, M. Mansuri, J. E. Jaussi, J. T. Kennedy and B. Casper, "A 5-to-25Gb/s 1.6-to-3.8mW/(Gb/s) reconfigurable transceiver in 45nm CMOS," in ISSCC Dig. Tech. Papers, San Francisco, 2010.

165

BIOGRAPHICAL SKETCH

Jikai Chen received BSEE and MSEE from East China Normal University,

Shanghai, China and Zhejiang University, Hangzhou, China respectively. He received

his PhD from the University of Florida, Gainesville, FL in 2013. From 2003 to 2004, he

was an analog IC design engineer with Realsil Microelectronics, working on PLL-based

clock buffers. From 2004 to 2006, he was a senior analog IC design engineer with

Philips Semiconductors (now NXP), designing high-voltage LCD drivers. From 2006 to

2012 he was a research assistant with the Integrated Circuit Research lab of the

University of Florida, with his research focused on low-power circuit design for high-

speed serial links. Since 2012 he has been with Texas Instruments as an analog circuit

designer working on high-speed circuit design for optical communications.