e cient self-timed interfaces for crossing clock …...e cient self-timed interfaces for crossing...

74
Efficient Self-Timed Interfaces for Crossing Clock Domains by Ajanta Chakraborty B.Eng. Bhopal Engineering College, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia August 2003 c Ajanta Chakraborty, 2003

Upload: others

Post on 25-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Self-Timed Interfaces for Crossing Clock Domains

by

Ajanta Chakraborty

B.Eng. Bhopal Engineering College, 2001

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

THE FACULTY OF GRADUATE STUDIES

(Department of Computer Science)

We accept this thesis as conformingto the required standard

The University of British Columbia

August 2003

c© Ajanta Chakraborty, 2003

Abstract

With increasing integration densities, large chip designs are commonly partitioned into

multiple clock domains. While the computation within each individual domain may be

synchronous, the interfaces between these domains often use asynchronous methods. One

such approach is the STARI technique[Gre93, Gre95] where a self-timed FIFO compensates

for clock-skew between the sender and receiver. This dissertation presents implementations

of STARI where the FIFO consists of a single, handshaking stage. I start with the simplest

case where the sender and receiver operate at exactly the same frequency with an unknown

skew. I then generalize this design for links with clocks whose frequencies are rational

multiples of each other, clocks whose frequencies are closely matched, and arbitrary clocks.

In each of these cases, the STARI interface can exploit the stability of typical clocks to

achieve low latencies and negligible probabilities of synchronization failure using very simple

hardware. I have designed and tested a proof-of-concept chip fabricated with the TSMC

0.18µ CMOS process for the scenario where clocks of different domains are exactly matched

in frequency. The tests have demonstrated our claims about the skew tolerance of the design

and I am now in the process of designing the interface for further generalizations.

ii

Contents

Abstract ii

Contents iii

List of Figures v

Acknowledgements vii

1 Introduction 1

1.1 Multiple Clock Domain Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work 7

2.1 Skew and Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Generation and Distribution of Clocks . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Clock Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Skew Compensation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 GALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Synchronizing Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 Mesochronous Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

2.3.4 Plesiochronous Designs . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 STARI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 MinSTARI: A Single Stage FIFO Interface 23

3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Skew Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Maximum Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Minimum Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Implementation and Test Results 36

4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Generalizations: Rational, Close and Arbitrary Clocks 48

5.1 Rational Clock Frequency Multiples . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Plesiochronous Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Arbitrary Clock Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 A FIFO Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusion 61

Bibliography 63

iv

List of Figures

1.1 Exactly Matched Clock scenario: A Chip Multiprocessor . . . . . . . . . . . 2

1.2 Rationally Related Clock scenario: A “typical” wireless SOC application . . 2

1.3 Nearly Matched Clock scenario . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Clock Generation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 A GALS System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 A GALDS System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Communication Scheme with Resampling . . . . . . . . . . . . . . . . . . . 15

2.5 Mesochronous Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 FIFO with Local Clock Control . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Globally Updated Mesochronous Method . . . . . . . . . . . . . . . . . . . 19

2.8 Plesiochronous Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.9 Source Synchronous Communication . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Interface as Latch with 2 clock inputs . . . . . . . . . . . . . . . . . . . . . 23

3.2 The Single Stage FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Clock Timing For The FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Latch Controller State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 A traditional C-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 The Latch Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Drifting Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

v

3.8 Five timing Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Design Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 LFSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5 Modified Yuan-Svensson Latch . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.6 en generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.7 Receiver Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.8 Error Detection Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.9 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.10 Skew Tolerance Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.11 Phase Modulation Tolerance at 20MHz . . . . . . . . . . . . . . . . . . . . . 45

4.12 Phase Modulation Tolerance at 30MHz . . . . . . . . . . . . . . . . . . . . . 46

5.1 An Interface with Rational Clocks . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Exploiting Periodic jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 A Miss Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 Receiver Frequency vs. Cycle Time constraint . . . . . . . . . . . . . . . . . 52

5.5 Interface for Nearly Matched Clocks . . . . . . . . . . . . . . . . . . . . . . 54

5.6 Interface for Arbitrary Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.7 Implementing a FIFO interface . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.8 Symmetry in FIFO interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.9 Timing scenarios with nearly full FIFO-R . . . . . . . . . . . . . . . . . . . 60

vi

Acknowledgements

This work has been possible through direct and indirect support of a variety of people. I

would like to sincerely thank my supervisor Dr. Mark Greentstreet for his unparalleled

guidance, encouragement and enthusiasm throughout my stay at UBC as a graduate stu-

dent. Being a total beginner in the field of VLSI design, the aid and help of colleagues and

friends was invaluable. Some of them being Brian Winters, for providing me the answers

to my unending questions and helping with design; Roberto Rosales for teaching me the

basics of chip testing and for relentlessly helping me with testing; Roozbeh for helping with

various CAD tools; and to the entire System-on-Chip lab of ECE department at UBC, for

giving me the opportunity to use their test lab and equipments.

I have no words to express my gratitude to my loving and doting parents who have

taught me to take on new challenges with vigor and also to my brother, Alexy, and my

sister, Jhinuk, without whom I would not have been here today. Thanks Rinkesh for always

being there for me and for the constant support and encouragement I have received from

you.

Ajanta Chakraborty

The University of British Columbia

August 2003

vii

Chapter 1

Introduction

1.1 Multiple Clock Domain Scenarios

As we move into very deep submicron technology, increasing integration densities and clock

frequencies drive designers to implement increasing numbers of on-chip clock domains. This

keeps the skew in clock and data to a small amount within a domain to ensure reliable

transfer of data. As tight timing tolerances cannot be guaranteed between timing do-

mains, communication between domains often takes place at a rate slower than the system

clock (e.g., one transfer for every two cycles of the clock) or using some kind of mixed syn-

chronous/asynchronous designs. Various multiple clock domain scenarios can be categorized

as:

1. Exactly matched clock frequencies

2. Rationally related clock frequencies

3. Nearly matched clock frequencies

4. Arbitrary clock frequencies

1. Exactly matched clock frequencies

In the scenario with exactly matched clocks, as shown in Figure 1.1 all the domains re-

1

1.1. Multiple Clock Domain Scenarios 2

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

L2$ L2$ L2$ L2$

Interconnect

Figure 1.1: Exactly Matched Clock scenario: A Chip Multiprocessor

AnalogRF ADC

DSP 2200 MHz

DSP 1500 MHz

LCD

Crypto

Memory

CPU 2CPU 1

Speakers

I/O Controller10 MHz

700 MHz 300 MHz 100 MHz

PLLs

Oscillator

Microphone

Keypad

Figure 1.2: Rationally Related Clock scenario: A “typical” wireless SOC application

Bridge

DDR

AGP

Infiniband

CPU

I$ D$L2$

Figure 1.3: Nearly Matched Clock scenario

1.1. Multiple Clock Domain Scenarios 3

ceive their clocks from the same source and are thus operating at exactly same frequency.

This is a typical situation in high-performance designs such as microprocessors for general

purpose computers. In these designs, clock and data skew [HN01] arise from a variety of

sources [BDM02]. First, scaling trends with decreasing feature sizes decrease gate delays

causing a corresponding increase in clock frequencies. For high performance designs, clock

frequencies are further increased by architectural trends favoring deeper pipelines with fewer

gates per pipeline stage.

Wire delays within a domain with a fixed number of transistors remain relatively con-

stant with scaling. For long wires, the performance gap with gate becomes very severe wih

shrinking feature size.

Thus, although within each domain the clock skews are relatively small allowing op-

eration at high clock rates, between clock domains, skews may be much larger. For ex-

ample two domains, which are at separate leaves of a clock tree distribution network,

might communicate with each other. Although circuits in these two domains may be

physically adjacent, there may be large, unpredictable phase differences between their

clock signals. Traditionally, designers target clock skews of about 10% of the clock pe-

riod [KB+01, RM+01, KA+01, IM02, KN+02]. Long wire delays and variations in buffer de-

lay make these targets challenging. Accordingly, designers resort to careful layout [BC+99]

and active skew compensation [TR+00]. Likewise, fabrication engineers reduce wiring de-

lays by deploying copper wires [Dav99] to lower resistance and low-k dielectrics [GA+02]

to reduce capacitance. These approaches come at a cost: circuit and layout approaches to

lowering clock skew often do so at an increase in circuit complexity and power consumption;

improvements in materials are limited by physical constants.

An alternate approach is to devise measures which compensate for skew when data is

transferred between domains. Source synchronous designs can be of great use in such sit-

uations. The designs presented are based on one such source synchronous communication

technique namely the STARI technique. Here a self timed FIFO is placed between the

1.1. Multiple Clock Domain Scenarios 4

communicating domains. The self-timed FIFO is used to compensate for skew between two

synchronous systems operating with a common clock. Section 2.4 describes STARI method

in greater detail.

2. Rationally related clock frequencies

In the scenario with rationally related clock frequencies, the clocks of the different domains

operate at frequencies that are rational multiples of each other. Figure 1.2 depicts a design

where different portions of the chip operate at different frequencies, and the various clocks

are derived from a common source. This commonly occurs in system-on-chip designs where

different IP blocks may be designed with different target clock frequencies or in multi-rate

digital-signal-processing designs. Although multiple clock frequencies are used, these fre-

quencies are exact rational multiples of each other, and these ratios are typically known in

advance or are determined by pre-designed operating modes of the design. Knowing the

exact relationship between the various clock frequencies enables the design of an interface

that operates with low latency and without synchronization.

3. Nearly matched clock frequencies

As shown in Figure 1.3, in the scenario with nearly matched clocks, the different domains

operate with independent sources that are closely matched in frequency. This occurs, for

example, in the design of network routers [KP+99] where each line card receives a bit

stream with an embedded clock from a different source. Although each stream comes with

its own clock, typically these clocks are very closely matched in frequency. For example,

ATM standards specify that bit rates be within one part per million of their nominal values.

4. Arbitrary clock frequencies

Finally in the scenario with arbitrary clock frequencies, the clocks are derived from in-

dependent sources and can have any arbitrary frequencies. While the frequencies may be

1.2. Contributions 5

arbitrary, typical synchronous designs use clocks that are very stable. Thus, the relationship

between the clock freuqncies change very slowly over time. This enables the design of an in-

terface where synchronization while necessary, is not critical to the latency of data transfers.

In this thesis, I present interfaces ensuring reliable communication in all of these multiple

clock domain scenarios. The designs described, use a self-timed FIFO which has a single

handshaking stage and thus is a minimalist version of the original design, namely STARI.

The essential observation behind my designs is that clocks for synchronous systems are

designed to be extremely stable. Thus, I can design STARI style interfaces that provide

moderate amounts of skew tolerance and dynamically compensate for any long-term drift

in skew or frequency. As described above, I present designs where the sender and receiver

operate at exactly the same frequency; at frequencies that are rational multiples of each

other; at closely matched frequencies; and at arbitrary, relatively stable frequencies. In the

remainder of the thesis, I show that these designs are small and can operate at high clock

frequencies with low latencies.

1.2 Contributions

In this thesis, I show that, multiple clock domains with exactly matched, rationally related,

nearly matched and arbitrary frequencies can communicate reliably and efficiently using a

single stage FIFO that offers nearly two clock periods of skew tolerance.

The contributions of this dissertation can be summarized as:

• Detailed study of the minimalist version of the STARI design along with an analysis

of its skew tolerance.

• A novel initialization mechanism that achieves maximum robustness.

• Designing a proof-of-concept chip implementing the minSTARI(described in Chap-

ter 3) design and sufficient test circuitry to verify its functionality.

1.3. Overview 6

• Detailed analysis and design of extensions of the basic design to apply to the more

general multiple clock domain scenarios.

1.3 Overview

The document is organized as follows:

I begin with a brief description of clock skew and jitter in digital designs and why it poses

a problem with shrinking die sizes and increasing transistor density. Next a variety of

techniques are described which have been used or suggested to handle clock skew, including

the STARI method which this research extends.

Next I describe a minimalist version of the original STARI method, minSTARI, in

Chapter 3 which can achieve a skew tolerance of almost two clock periods. Thus, minSTARI

forms the solution for exactly matched clock scenario where clocks of different domains are

operating at the same frequency. Chapter 4 describes a proof-of-concept chip demonstrating

the operation of the design along with the test results obtained.

In the further generalizations of the basic idea, additional circuitry is used to reduce

each scenario to the exactly matched clock frequencies scenario, and then minSTARI is used

to handle the skew. For example, the design for rationally related clock frequencies scenario

generates a rational approximation of clocks to reduce the design to matched frequency

scenario. Similarly the closely matched and arbitrary clock frequencies scenario also use

combination of various other techniques to achieve the same objective. The generalizations

are explained in Chapter 5.

Chapter 2

Related Work

2.1 Skew and Jitter

Skew can be viewed conceptually as the uncertainty in the timing of clock or data. More

specifically clock skew can be defined as the difference in time between simultaneous tran-

sitions of the clock within a system [Kat98] which is introduced by the clock distribution

system. Clock jitter can be defined as short-term variations of the significant instants of

the clock signal from its ideal position in time. Various factors can contribute to clock

skew[BC+99], such as:

1. Process variation between transistors: each buffer stage introduces uncertainties due to

process variations. To reduce skew, designers can reduce the number of buffer stages in the

clock distribution network.

2. Variation in parameters of the wires used: long wires introduce uncertainties and thus,

should be avoided.

3. Different sizing of each buffering stage according to the load it has to drive.

4. Presence of adjacent wires and the amount of switching activities between them.

5. Inductive reactances of the wires.

6. And finally, variations in factors such as temperature, power supply voltage etc.

Skew is becoming a major limiting factor in increasing the global clock frequency. Com-

7

2.2. Generation and Distribution of Clocks 8

pensating skew generally involves introducing complicated architectures or faster logic.

2.2 Generation and Distribution of Clocks

2.2.1 Clock Generation

Three standard approaches to clock generation are Phase Locked Loops(PLLs), off chip

oscillators and Delay Locked Loops(DLLs).

A. PLL based design

In the PLL based designs, a low speed reference clock is distributed throughout the chip

and PLLs are used to obtain different multiples of the base frequency as required by various

sub-components. Clock jitter in this case is typically 5% of the clock cycle[BC+99].

B. Off-Chip Oscillator

In an off-chip oscillator, as the name suggests, an off-chip oscillator is used to generate very

stable clock signals. Synchronizing the clock signal with the system becomes a difficult

task but can have very low clock jitter amounting to as little as 1% of the clock cycle to a

maximum of 5% [BC+99].

C. DLL based design

In DLL based designs, the local clock is delayed sufficiently to line up with the edge of system

clock and thus can be used for latency correction or introducing a desired phase difference

if needed. DLL based designs form an attractive alternative to PLL based designs due to

their better jitter performance, inherent stability and simpler design but are difficult to use

for frequency synthesis [SH97].

2.2.2 Clock Distribution

The topology and technique for clock distribution plays a very important part in determining

skew and jitter and for the overall performance of the system. Thus, much effort is spent

in designing and optimizing clock networks which can balance factors such as skew and

2.2. Generation and Distribution of Clocks 9

Low PassFilter Controlled

Voltage

OscillatorDetectorPhase

Divide By n

a. PLL based design b. Off−Chip Oscillator c. DLL based design

sync. LocalDomainOscillator

System Clock

matchingdelay Local

DomainLocal Clock

Figure 2.1: Clock Generation Techniques

clock tree delays. Table 2.1 gives a comparison of some of the standard clock distribution

techniques:

Example Implementations of Clock Networks

1. DEC Alpha series: The design is primarily based on mesh or grid techniques where

wires are cross-connected with vertical and horizontal straps in a mesh pattern which keeps

the clocks in phase across the whole chip. The Alpha microprocessors require that a sub-

stantial capacitive load be driven at high speed along with maintaining a fast edge rate.

Thus, in the earlier designs five levels of buffering configured as a tree were used. For ex-

ample,the Alphaserver 4100 clock distribution system uses a combination of both balanced

H-Tree and shared output tree to distribute the clock signal [Dam97]. The balanced H-Tree

takes care of the fixed load blocks, whereas the shared output tree is used where various

module configurations could alter clock loading. Later for a 600 MHz Alpha processor, a

hierarchy of clocks was used where a gridded global clock with windowpane arrangement

of final distributed drivers was used to lower the skew[BB98]. More clocks were derived

out of these to provide more flexibility. Thus, a very complicated structure was adopted

to enhance performance and save power. Similarly, for a processor running at 1.2 GHz,

the difficulties in distributing clocks over large areas using low resistance grids are avoided

by moving away from single chip-wide clock distribution to multiple phase locked clocks

controlling different components of the chip for better skew and jitter control. Networks

of various kinds ranging from P-shaped grids, rectangular X-trees to partial H-trees were

2.2. Generation and Distribution of Clocks 10

Methods Technique Advantages Disadvantages

1. DistributedBuffers

Buffers are distributedthroughout the chip

Flexibility, Wire-ability, Low power

High sensitivity toprocess skew

2. Water Main All the buffering is done at themain clock signal before it isdistributed using a wide wire

Low skew if hori-zontal flow in thedesign exists

Not suitable if theinterconnect resis-tance of the bufferis large comparedto the buffer outputresistance

3. H-Tree Wires between buffer stagesare configured in a balanced,hierarchical “H” wiring pat-tern. Primary clock driveris connected to the center ofthe main “H” structure andthe clock signal is distributedthrough the corners followinga recursive design.

Good wireabil-ity, zero skew foridentical loads

Physical layoutgets constrained,interconnect ca-pacitance increasesdue to longer wires,difficult to balance,poor automaticclock routing

4. Mesh Rooted tree structure of clockbuffers with shunt paths in thelater stages.

Minimizes inter-connect resistance,places branch re-sistance in parallelreducing skew.

May have highdevice and wiringprocess variations,poor wireability

Table 2.1: Comparison of Different Clock Distribution Techniques

2.2. Generation and Distribution of Clocks 11

used[XB+01].

2. IBM S/390 Microprocessor: This is a 400 MHz CMOS microprocessor which

primarily uses tree-like structures. A single clock is distributed from a centrally located

on-chip PLL through a single buffer to 580 distribution points[RJDC98]. The distribution

is achieved in two levels of balanced H-like trees. The first level tree distributes the central

clock to nine buffers which are then further distributed in the second level. Using a small

number of large buffers reduces skew and jitter from on-chip process and variations but

results in more complicated wiring networks.

3. Intel “Itanium” Microprocessor: This is another microprocessor designed for

running at GHz level frequency. It primarily uses programmable deskew circuits while

supporting local optimization of the clock distribution network[RT00]. The architecture

consists of three components: a balanced tree for distributing the global clock, multiple

deskew buffers with balanced tree structures driving the regional clock grids, and multi-

ple local clock buffers tapping these regional grids. A reference clock is also distributed

throughout the chip for phase correction.

4. Other techniques: The signal integrity problems due to clock jitter, clock skew

and signal reflection have motivated researchers to look into alternative methods of in-

terconnection. Thus, apart from digital interconnect techniques, optical interconnect and

RF and microwave interconnect have also appeared in the picture. Optical interconnect

have lower power consumption at very high frequencies and good signal integrity properties

but are bulky, expensive and difficult to fabricate[RWW+02]. Two major kinds of optical

interconnects are based on either free-space technology or guided-wave technology. The in-

termediate technology between metal-based and optical interconnect is RF and microwave

interconnects. My work is focused on the existing approach i.e. digital interconnects.

In summary, various methods have been devised for proper clock generation and distri-

bution which can lead to stable clock signals and reduced skew. However, in most large VLSI

designs, long wires and large numbers of components create large interconnect impedances.

2.3. Skew Compensation Techniques 12

Moreover, factors including variation in temperature, power supply voltage etc., introduce

an arbitrary amount of skew in the signals both along the clock and data path. Thus, it

becomes necessary to adopt some kind of clock skew compensation technique.

2.3 Skew Compensation Techniques

Given the challenges of transmitting signals between clock domains, researchers have ex-

plored a variety of asynchronous solutions. These range from building completely asyn-

chronous chips [MB+89, ML+97, FEG00, RB+01] to various combinations of synchronous

and asynchronous modules in the same design. Here I, focus on the latter approach. The

various methods for combining asynchronous and synchronous modules vary according to

the requirements that are placed on the clock. At one extreme, GALS (i.e. “Globally

Asynchronous, Locally Synchronous”) designs make very minimal assumptions about clock

timing, effectively turning clocks into bundled completion signals and adding handshaking

to clock generation [Cha84]. At the other extreme, “mesochronous” and “plesiochronous”

methods rely on exact or nearly exact frequency matching of the clocks [Mes90]. Be-

tween these two extremes, synchronizing buffers allow each domain to operate with its own

clock, but make minimal assumptions about the relationships between these clocks. My

approaches fall squarely in the mesochronous and plesiochronous camps. I summarize these

various approaches below.

2.3.1 GALS

As originally proposed by Chapiro [Cha84], GALS (i.e. “Globally Asynchronous, Locally

Synchronous”) designs use stoppable clocks to allow synchronous modules to communicate

using asynchronous protocols. Each synchronous domain has its own clock generator that

consists of a ring oscillator with a handshaking stage. When domain X has a value to send

to domain Y, X outputs the value, sends a request to Y and stalls its clock until it receives

2.3. Skew Compensation Techniques 13

controllerPort

controllerPort

LocalClock

Generator

SynchronousLocally

Block

Port Port

WrapperAsynchronous

Figure 2.2: A GALS System

an acknowledgement. Likewise, when domain Y is prepared to receive a value from domain

X, it stalls its clock, waits for a request from X, latches the value, sends an acknowledgement

to X and restarts its own clock.

Yun and Donohue [YD96] extended Chapiro’s approach by adding a mutual-exclusion

element to the ring oscillator. This allows each locally synchronous block to continue oper-

ating while polling for input from its neighbours. The mutual exclusion element delays the

next clock event, if needed, to allow metastability [CM73] arising from the polling to resolve.

Yun and Donohue’s approach allows GALS designs to be very flexible, and their methods

have been extended by several research groups, e.g. [MVF00, SM00, MT+02, SPL02]. Their

design is not metastability free but no clock events occur while it is being resolved which

prevents the communicating domains from accepting incorrect data. However, pausing or

stretching the clock increases latency in the system.

Chattopadhyay and Zilic [CZ02] extended this work further by designing a GALDS sys-

tem which stands for Globally Asynchronous Locally Dynamic System as shown in Figure

2.3. GALDS is based on the observation that dynamically switching clock frequencies is an

effective method of saving power. Thus instead of generating a fixed frequency clock signal

for the local domains, it uses a local clock controller which dynamically varies the clock fre-

quency according to the requirement. The design uses a ‘unidirectional synchronizer, which

2.3. Skew Compensation Techniques 14

Uni DirectionalSynchronizer

Bi DirectionalSynchronizer

Φ1 Φ3Φ2

Φ

Uni DirectionalSynchronizer

Figure 2.3: A GALDS System

is essentially a single-stage asynchronous FIFO, for adjacent clock domain communication

and a bidirectional synchronizer which is a combination of two FIFO stages in parallel

for non-adjacent domains. Async/Sync converters, which are essentially single stage syn-

chronizers, are used to interface the synchronous domains with the asynchronous wrapper,

which shift asynchronous signals into local clock domains thereby introducing latency into

the system. This design provides less than a clock period to resolve metastability for the

control signals which might not be adequate at higher clock rates. Although, the design

ensures that data is successfully latched into the receiving domain even when metastable

control signals are obtained, this does not ensure proper operation of the system since this

uncertainty in validity of data introduces inconsistency in the overall operation.

Another communication scheme for mixed timing designs has been proposed by [HN03]

where a handshaking protocol is followed between two independent clock domains for asyn-

chronous signals and the asP* protocol [MJ+97]is used for generating signals synchronous

to either clock. Figure 2.4 shows the setup of this scheme. The sequence of signals gen-

erated is shown by the numbers corresponding with each signal. The metastability caused

in receiving external signals, En or Val, is resolved by data resampling. In this method, a

fixed time interval is alotted for resolving the metastability and upon unsuccessful resolu-

tion, the process is repeated. On the receiver end, data corresponding to metastable signals

2.3. Skew Compensation Techniques 15

SENDER

Clk1

ER

CEIVER

Clk2

Rst(1)

Data

Ack(1/5)

Val(4)

Ack(1/5)

Req(3)

En(2)

Req(3) ValGen

EnGen

Figure 2.4: Communication Scheme with Resampling

is overwritten by data of a successful attempt.

GALS makes minimal assumptions about clock stability; in fact, GALS discards the

stability and low-jitter of clocks that are the hallmarks of synchronous design. The frequency

stability of traditional clocks allows us to determine the relative phase of two independent

clocks thousands or more of cycles in advance. As I describe in sections 5.2 and 5.3, this

predictability enables moving metastability off the latency critical paths in my designs.

Also, in GALS designs, since a logic signal with a low drive, controls the clock which has

a very high fan-out. Thus, a high amount of amplification is required before the signal is

fed into the clock control circuit which introduces latency. Jitter is the variation in the

time between successive clock events. This variation directly degrades the performance of

synchronous designs, and clock pausing exacerbates jitter. After pausing a clock, the first

edge through the ring oscillator and clock buffer will propagate slower than subsequent

events [WGG02]. The loss of long-term timing predictability and the increase of jitter are

consequences of the GALS approach of converting synchronous designs into asynchronous

ones. In this thesis, I show that more efficient designs are achieved by letting synchronous

modules be synchronous and using simple asynchronous interfaces to compensate for clock-

skew and other timing uncertainties.

2.3. Skew Compensation Techniques 16

2.3.2 Synchronizing Buffers

The next step in our taxonomy allows independent, free-running clocks in each domain and

makes minimal assumptions about the timing relationships between them. A common rule-

of-thumb for design specifies the use of two or three synchronizing latches whenever a clock

domain is crossed [JG93, Chapter 3.11.4]. [JG93] also expresses the clock-to-output delay

of a latch as a function of the difference of the setup and the critical switching time of the

latch. This motivates increasing the number of stages of the synchronizer with increasing

clock rates. Thus, for high-performance designs with a small number of gate-delays per

clock period, even longer chains may be needed to achieve acceptably low probabilities of

failure. Seizovic [Sei94] recognized that these synchronizations can be pipelined allowing

high throughput even when the time for reliable synchronization is many clock periods.

Chelcea and Nowick [CN01] further optimized this approach by noting that synchronizations

are only needed for the receiver when the buffer is close to empty and only needed for the

sender when the buffer is close to full. All of these approaches still incur worst-case latency

due to either buffering latency when the buffer is not nearly empty or due to synchronization

latency when the buffers are nearly empty. Iyer and Marculescu [IM02] evaluated the

performance of a superscalar microprocessor design decomposed using Chelcea and Nowick’s

FIFOs. Superscalars are particularly sensitive to latency, and Iyer and Marculescu found

that the performance penalties arising from the added latency outweighed the power savings

for the design that they considered.

2.3.3 Mesochronous Designs

A Mesochronous Design is marked by multiple clock domains which run at the same

frequency but have unknown phase relations between them. Figure 2.5 shows a generic

mesochronous design.

Here, the clocks of the different domains are derived from the same clock source and are

2.3. Skew Compensation Techniques 17

ΦT

delay

ΦR

delay

Φ

Interconnectdelay

Transmitter’sDomain Domain

Receiver’s

Dt Dt

Figure 2.5: Mesochronous Timing

exactly matched in frequency. Due to delays in the clock path, the phase relation between

the clocks, however, is uncertain. Thus, the output data DT , is not synchronized with the

receiver’s clock ΦR. Moreover, the delay in the data path exacerbates the uncertainty.

Various techniques have been suggested to handle mesochronous timing. One of them,

which reduces or eliminates the synchronization latency of the designs described above by

taking advantage of the stability of the clocks, is the STARI method. The stability of

the clocks enables prediction of the timing relationship of clocks in different domains well

into the future. STARI interfaces [Gre93, Gre95] or “source-synchronous” [YH00])have a

common clock source for the sender and receiver, guaranteeing that both operate at the

same frequency although the phase difference between the two may be unknown. A FIFO at

the receiver is initialized to be roughly half full. During each clock period, the transmitter

inserts one item into the FIFO and the receiver removes one item from the FIFO. The

FIFO occupancy remains within one of half-full; in particular, overflows and underflows are

excluded. This removes the need for testing full and empty conditions and thereby removes

the need for synchronization and synchronizers. STARI interfaces are described in more

detail in section 2.4.

A variation of this design has been implemented by Kim and Sridhar [KS96]. Their

design as shown in Figure 2.6 uses a self-timed FIFO for insertion and removal of data but

the clock at the receiver end is regulated by a local clock control(LCC) circuit. This LCC

waits for a request signal from the FIFO and then starts the receiver clock for removal of

2.3. Skew Compensation Techniques 18

Transmitter Receiver

Self−Timed FIFO

Clock Gen

LCC

clk clk

data

c1

cn

data

ack

done

lclk

Figure 2.6: FIFO with Local Clock Control

data. Synchronization is required for the first datum and subsequent data is removed every

clock cycle. Thus, instead of using phase detectors, this method relies on adjusting the

data arrival time so that it is synchronized at the receiver end. The LCC is implemented

through a series of three C-elements(described in Section 3.1), and metastability is resolved

through a comparator. This method could suffer from the disadvantages of pausing clocks

similar to GALS and could have restrictions in high performance designs [MS01]. Also the

modifications required in case of multiple domains communicating with the receiver are

unclear.

A second variation is provided by[S03] called Globally Updated Mesochronous(GUM)

Design. Here instead of using a FIFO, clocks with adjustable delays are used in each syn-

chronous domain. A calibration process determines the ideal phase offset between the clocks

by measuring the round trip latency of the data path between the communicating domains.

In particular it starts from an arbitrary, initial operating point and slowly increases the

delay of a clock signal until it enters a failure zone with respect to the other clock. Once

the window of operation is determined, the clock pulse is positioned in the center of the

window. This calibration process resembles the dynamic initalization process described in

section 3.3.1 and 3.3.2. The difference is that the GUM method can not dynamically ac-

count for skew, which our method can, and thus for factors like variations in temperature,

2.3. Skew Compensation Techniques 19

delayadjustable

delayadjustable

Φ

Domain 1 Domain 2

Figure 2.7: Globally Updated Mesochronous Method

power supply noise etc.. Furthermore, the initialization process has to be repeated many

times to account for any dynamic skew. Moreover, the additional complexity of maintaining

an adjustable delay and repeating the careful measurement process several times with each

clock can be quite tedious.

There has also been an effort to apply mesochronous techniques to on-chip networks

[Wik03]. In this technique, the transmitter sends its data and a strobe signal which is kept

at half the frequency of the transmitter’s clock. The receiver multiplies the strobe signal

to generate the clock signal and latches the incoming data using this generated clock. A

phase comparator compares the incoming data at every cycle with the receiver’s clock and

then selects either the receiver clock signal or the receiver clock delayed by half a clock

period, to trigger a second latch which generates the final data for the receiver’s domain.

The comparator requires synchronization and may take arbitrary long time to resolve if

both the selection signals are equally good.

My solution for mesochronous designs is a STARI-based technique and is described in

more detail in the subsequent chapters.

2.3.4 Plesiochronous Designs

In plesiochronous designs, the sender’s and receiver’s clocks are generated separately but

are closely matched in frequency [Mes90, DDX95]. Accordingly, the relative phase between

the sender and receiver changes very slowly.

2.3. Skew Compensation Techniques 20

Receiverdelay/2 Π

MUX

T RΦ Φ

QQ

Flip Region Detector

Transmitter

Figure 2.8: Plesiochronous Retiming

Rather than detecting FIFO-full and FIFO-empty conditions, a plesiochronous interface

can include circuitry to detect FIFO-nearly-full or FIFO-nearly-empty conditions. These

conditions can be synchronized to the appropriate clock domain with extremely reliable,

high-latency synchronizers.

[DDX95] is based on data retiming where two versions of transmitter’s data, original and

delayed by half a clock period, are maintained. Based on the timing of the receiver’s signal,

dynamic mode switching occurs choosing the one which has more tolerance for frequency

mismatch. This method is similar to my work in that it also separates synchronization from

the latency critical path. In [DDX95], the transmitter is always kept at a slower pace by

introducing non-data items which constitutes of a fixed percentage of total bandwidth to

avoid overflow of data and leads to performance degradation by 2δf . This method requires

real time switching between different input modes and hence provides smaller interval to

complete switching without duplicating or missing data. It also requires previous knowledge

of data and nondata elements for successful switching.

In my method, real-time switching is avoided by using “near miss” detectors as described

in Section 5.2 that enable us to predict the overflow and underflow well before they would

actually occur. As an example, consider the case where the sender and receiver clocks are

guaranteed to be matched to within 1 ppm (a typical requirement for high-speed networks).

Let a “near-empty” detector report the condition that data from the transmitter arrived

such that it was available for removal from the FIFO less than 10% of a clock period before

the actual removal. With the close clock matching, at least 100,000 clock cycles will elapse

2.4. STARI 21

. .. D Q D Q . ..

ΦT’

insertdata_in data_out

remove

FIFO

Φ

ΦT

unknowndelay

unknowndelay

ΦR

sender’s domain receiver’s domain

forwarded clk

data T’D

Figure 2.9: Source Synchronous Communication

before an underflow occurs! Thus, a ten or more stage synchronizer can be used to report

this “near-empty” condition to the receiver without risk of underflow. The receiver can

then skip removing a datum in a subsequent clock cycle, and it will be at least one-million

clock cycles until the next near-empty condition occurs. While the synchronization path

for flow control signals can have very high latency in a plesiochronous design, the latency

of the data path can be kept very low. In section 5.2 we show how our designs can be

used to implement plesiochronous interfaces. Section 5.3 generalizes this to interfaces with

arbitrary, stable clocks.

2.4 STARI

In this section, I describe a particular implementation of mesochronous interface, STARI

which stands for Self Timed At Receiver’s Input [Gre93]. The n-stage FIFO, placed between

clock domains operating at same frequencies but with unknown phase relation between

them, achieves extremely high skew tolerance but with some added latency. My interface,

minSTARI, is a minimalist version of this design with reduced latency and is described in

the next chapter.

To handle clock skew, self-timed methods have proven to be very effective but at the

additional cost of communication delay. Thus, a technique combining both synchronous

2.4. STARI 22

and asynchronous approaches can be more effective than either. STARI is such a combi-

nation where an n-stage FIFO is placed between the communicating domains. Figure 2.9

describes a STARI interface. Both the transmitter and receiver derive their clocks, ΦT and

ΦR respectively, from the same clock generator Φ. The delays in the path from clock gen-

erator to any clock signal are assumed to be arbitrary. The transmitter’s domain forwards

both its data and its clock ΦT ′ to the FIFO. Because ΦT ′ and ΦR are exactly matched

in frequency, the insertion rate of the FIFO is same as the removal rate of the FIFO. If

the FIFO is initialized to be roughly half full, then throughout operation, the capacity of

the FIFO remains roughly half full. Thus, the need to check for overflow and underflow is

avoided. The FIFO can be implemented with individual C-elements or using simple latches

acting as buffers. Thus STARI offers clear advantages over purely synchronous or purely

asynchronous systems because it does not require the absolute synchronization of purely

synchronous methods nor does it require the explicit flow control mechanism of purely

asynchronous ones.

In this thesis, I present both specializations and generalizations of the original STARI

work. In Chapter 3 a specialization of STARI is described by focusing on the case where

the FIFO consists of a single stage. Such an implementation provides nearly two clock

periods of skew tolerance. By optimizing for the single-stage case, I obtained very simple

interfaces between the edge-triggered conventions common in synchronous design and the

handshaking communication that is characteristic of self-timed designs. I then generalize

STARI to relax the requirement of exactly matched clocks at the sender and receiver. I

present interfaces where the sender and receiver clock frequencies are rational multiples of

each other(Section 5.1), closely matched(Section 5.2), and arbitrary(Section 5.3). All of

these designs exploit the long-term stability of clocks to obtain simple interfaces with small

latencies.

Chapter 3

MinSTARI: A Single Stage FIFO

Interface

This chapter describes a simple implementation of STARI communication where the FIFO

has a single stage. As shown in Figure 3.2, the FIFO consists of a single latch, and a latch

controller that generates a clock for this latch based on the clocks from the transmitter and

receiver. To the user, this FIFO appears as a latch with two clock inputs(Figure 3.1).

In this chapter and the next, it is assumed that the transmitter and receiver operate at

exactly the same frequency; only the relative phase difference is unknown. This is easily

achieved if both of their clocks are derived from a common source.

ΦT ΦR

D Q receiverdata todata from

transmitter

Figure 3.1: Interface as Latch with 2 clock inputs

23

24

latch−Rlatch−Xlatch−TQD

ΦT ΦR

QD data_out

Φ

latch controller

X

QDdata_in

single−stage FIFOtransmitter receiver

Figure 3.2: The Single Stage FIFO

ΦR

st

ΦX

TRδ

ht

RTδ

hthtst st

γTR

γRT

OK OKFigure 3.3: Clock Timing For The FIFO

3.1. Description 25

3.1 Description

Figure 3.3 depicts the timing for the single-stage FIFO. For simplicity, I assume that the

latches are positive-edge-triggered. My design easily generalizes to other latching styles.

For proper operation, the latch controller must generate ΦX so as to satisfy the set-up and

hold requirements of latch-X and latch-R. To satisfy the requirements of latch-X, the rising

edge of ΦX must occur at least tset−up +tprop (abbreviated ts in the figure) after the previous

ΦT event, and at least thold − tprop (abbreviated th in the figure) before the next ΦT event,

where tset−up , thold , and tprop denote the set-up and hold times and propagation delay of

the latches respectively. To satisfy the requirements of latch-R, the rising edge of ΦX must

occur at least thold − tprop after the previous ΦR event, and at least tset−up + tprop before the

next ΦR event. The “exclusion” regions corresponding to these requirements are indicated

by cross-hatched regions for ΦX in Figure 3.3.

There are two windows of opportunity for generating ΦX : a rising edge of ΦX may occur

between a rising edge of ΦT and the subsequent rising edge of ΦR, or between the rising

edge of ΦR and the subsequent rising edge of ΦT . I refer to these scenarios according to the

last event (ΦT or ΦR) that occurs prior to each ΦX event. Thus, if ΦX occurs after a ΦT

event but before the next ΦR event, I refer to this situation as “transmitter-last”. Likewise,

I use “receiver-last” to refer to the other case. In Figure3.3, δTR denotes the time from the

rising edge of ΦT to the next rising edge of ΦR. Likewise, δRT denotes the time from the

rising edge of ΦR to the next rising edge of ΦT . Let P denote the clock period. Now, let

γTR denote the width of the window of opportunity for the transmitter-last scenario, and

3.1. Description 26

γRT denote the width of the window of opportunity for the receiver-last case. We have:

γTR = δTR − 2(tset−up + tprop)

γRT = δRT − 2(thold − tprop)

⇒ γTR + γRT = δTR + δRT − 2(tset−up + thold )

= P − 2(tset−up + thold )

⇒ max(γTR, γRT ) ≥ P/2 − (tset−up + thold )

(3.1)

In other words, if the clock period is greater than 2(tset−up + thold ), then the window of

opportunity for at least one of the transmitter-last or the receiver-last case is non-empty, and

the latch-controller can generate a clock that ensures proper operation of the interface. In

particular, if γTR > 0, then the latch controller can safely generate a rising edge tset−up+tprop

after the rising edge of ΦT ; otherwise; γRT must be positive, and the latch controller can

safely generate a rising edge thold − tprop after the rising edge of ΦR. Section 3.3 shows

how the latch controller can be initialized to operate in one of these two scenarios. The

remainder of this section considers steady-state operation.

Figure 3.4 shows a finite state machine that implements the operations of the latch

controller. One event is output on ΦX each time it has received an event on ΦT and an

event on ΦR. For ∆TR ≥ 2(tset−up + tpropmax), the controller starts in state 0. Upon

receiving a ΦR event, it moves to state R. When the controller receives a ΦT event, it moves

to state TR. After a delay of tset−up + tprop , the controller outputs a ΦX event and returns

to state 0. Likewise, for the case with P − ∆TR ≥ 2(thold − tprop), the controller starts in

state 0, moves to state T upon receiving a ΦT event, moves to state TR upon receiving a

ΦR event, and after a delay of tset−up + tprop , outputs a ΦX event and returns to state 0.

The latch controller performs the function of a C-element. A traditional C-element(Figure 3.5)

drives its output to the value of its inputs when they agree. When the inputs differ, the

output retains its old value [Sei79]. My designs use an edge triggered C-element: after

detecting rising edges on each input, it generates a pulse on its output.

Now, first consider operation in a transmitter-last scenario with γTR > 0. Following each

3.1. Description 27

0T

RTR

Φ

ΦT

ΦR ΦT

ΦR

X

Figure 3.4: Latch Controller State Diagram

ca b0 0 01 00 11 1

unchangedunchanged

1

cb

aC

Figure 3.5: A traditional C-element

rising edge of ΦT event, the latch controller outputs a corresponding rising edge for ΦX .

Then, there will be the next rising event for ΦR followed by a rising event for ΦT before

the controller outputs the next rising edge of ΦX . Conversely, for the receiver-last case,

following each rising edge of ΦX event, the controller sees a rising event for ΦT followed by

a ΦR event before generating the next rising edge for ΦX . In either case, between producing

consecutive rising edges of ΦX , the latch controller receives rising edges from both ΦT and

ΦR.

In my design, timing is determined by the rising edges of the clocks. Accordingly, I use

an edge-triggered self-resetting [CC+91, SF01] implementation as shown in Figure 3.6. On

a rising edge of ΦT , transistors m1 and m2 pull node aT low. The three-inverter chain to the

gate of m1 disables the pull-down path shortly after ΦT has gone high to make the circuit

edge-sensitive rather than level-sensitive. Likewise, node aR drops on a rising edge of ΦR.

When both have dropped, node c goes low which generates a pulse on ΦX . The low value

of node c forms a pull-up path on nodes aT and aR which in turn resets node c back to its

intial high value. At this point, one cycle of operation is complete and the interface is ready

to accept the next set of inputs. Delay δT ensures that the delay from a rising edge of ΦT

3.2. Skew Tolerance 28

RaΤaΦRΦT

ΦT’ ΦR’

ΦX

c

δRδT

m1

m2

m3

m4

m5

m6

Figure 3.6: The Latch Controller

to a rising edge of ΦX is greater than tset−up + tprop . Likewise, delay δR ensures that the

delay from ΦR to ΦX is greater than thold − tprop . The keeper inverters on nodes aT and

aR help in resolving metastability that may occur during intialization and ensure correct

operation at arbitrarily low clock frequencies. As promised, the design is extremely simple

and requires very little layout area.

3.2 Skew Tolerance

To analyze the skew tolerance of my design, we start with a transmitter-last scenario; proper

operation requires γTR > 0 which is equivalent to δTR > 2(tset−up +tprop). If the initial time

difference, δTR,0, is greater than this value, then the transmitter may be further delayed by

up to δTR,0 − 2(tset−up + tprop) without malfunction of the interface.

Figure 3.7 shows what happens starting from a transmitter-last scenario where transmit-

ter events occur progressively earlier due to drift in the skew. In this figure, the transmitter

outputs the sequence of values: [−1, 0,A,B,C, . . .] on node QT . The values shown for QX

3.2. Skew Tolerance 29

ΦR

XQΦX

RQ

ΦT

TQ A B C D E F G

B C D E F

0 FA B C D E

−1 0 A B C D

Figure 3.7: Drifting Skew

and QR show how transmitter data propagates to the other two latches. For each ΦX event,

the figure shows a vertical dotted line labeled with the value loaded into latch-X by that

event, and with arrows from ΦT and ΦR events showing the two events that triggered the

latch controller. The rising edges of ΦX for values B and C are transmitter-last events; for

value D, the ΦT and ΦR events are coincident; and for values E and F, ΦX is generated by

receiver-last events. In all cases, the latch controller waits until it has received events on

both inputs. The relative order of arrival of the rising edges of ΦT and ΦR does not matter;

thus, no synchronization is necessary.

When the ΦT event precedes the ΦR event, then the interface operates in the receiver last

mode, starting with δRT = P . The interface continues to operate without dropping a value

as long as δRT > 2(thold −tprop). Starting with an initial time difference of δTR,0, transmitter

events can occur up to P − δTR,0 time units earlier before the receiver-last scenario occurs.

At this point δRT = P , and the transmitter can occur up to P − 2(thold − tprop) time units

earlier without malfunction.

To summarize, if the interface starts in a transmitter-last scenario with δTR = δTR,0,

then the ΦT can be delayed with respect to ΦR by up to δTR,0 − 2(tset−up + tprop) time

3.3. Initialization 30

units, and it can be advanced by up to 2P − δTR,0 − 2(thold − tprop) time units without

malfunction of the interface. The total width of the interval of relative delays for which the

interface operates correctly is 2(P − tset−up − thold ). Equivalent arguments hold starting

from a receiver-last scenario. Thus, if the latch set-up and hold window is small relative to

the clock period, then my design offers nearly two clock periods of skew tolerance.

In addition to the set-up and hold requirements, node c in the latch controller must

return high following the generation of a ΦX pulse before the arrival of the next rising edge

on ΦT or ΦR. Let η be the time between triggering the latch controller and the subsequent

return of node c to a high value. Let δT ′R′ and δR′T ′ denote the time from a rising edge of

ΦT ′ to the next rising edge of ΦR′ and vice-versa. As shown in scenario 1 in figure 3.8, proper

operation with transmitter last requires that δT ′R′ > η be satisfied. Similarly, scenario 5

requires δR′T ′ > η be satisfied. At least one of the two modes is feasible if

P > 2η (3.2)

For my proof-of-concept test-chip [CG02], our latch set-up and hold windows were signif-

icantly smaller than the latch controller’s cycle time. Thus, equation 3.2 is the critical

constraint for my design.

3.3 Initialization

Under many circumstances, γTR and γRT are both positive. When this occurs, the interface

can operate in either transmitter-last or receiver-last mode. This section describes two

criteria for selecting the “better” mode and initialization procedures to achieve it. We

assume that it is acceptable for the interface to drop values, duplicate values, and/or exhibit

metastable behavior during initialization. Of course, it must deliver data without error after

completion of initialization.

3.3. Initialization 31

3.3.1 Maximum Robustness

Clock jitter, temperature drift, and other fluctuations cause the skew on physical chips

to vary while the chip is operating. Typically, this variation is just as likely to make the

transmitter earlier as it is to make it later. To maximize robustness to skew variation, ini-

tialization should reflect the mode that tolerates the largest skew change in either direction.

This corresponds to starting in the transmitter-last mode if δTR > δRT and in the receiver-

last mode if δTR < δRT . For example, in Figure 3.8, scenario 1 and 3 have the same value

of ∆TR, while, scenario 3 can tolerate substantial changes of the skew in either direction,

scenario 1 will fail if ΦR arrives any earlier relative to ΦT . Thus, scenario 3 is the preferred

initialization. Similarly, scenarios 2 and 4 have the same ∆TR. Scenario 2 is slightly more

robust to later skew variations and will be the preferred initialization for many designs.

An easy way to achieve this is to insert an adjustable delay into the self-reset cycle of

the latch-controller. If this delay is initially very large, then neither mode is feasible and

the latch-controller will generate ill-timed clock signals. By gradually decreasing this delay,

the circuit will reach a point where exactly one of the two modes is feasible and after one

or two cycles the latch controller will operate stably in that mode. As the delay is further

decreased, the latch controller will remain in the first mode that became feasible. This is

the mode with the larger skew margin. Thus, the analog dynamics of our circuit provide a

very simple mechanism for initialization.

We employ a training period during which the internal delays of the latch controller

are greater than with those of normal operation. We do this in our implementation by

using a separate ground signal for the latch controller connected to an internal voltage

reference. This voltage sweeps from 1.8V (equal to Vdd) down to 0V (normal operation).

The controller speeds up during this sweep according to the linear relationship between

power supply voltage and speed.

When the controller is sufficiently slow, it cannot cycle as fast as the clocks. Under these

3.3. Initialization 32

tset−up thold

0

tclk Q,max tclk Q,min

05:

4:

3:

2:

1:

A0

B C

A

0

A

A

B C

A

A

A

B

B

B

B C

A

A

B

C

A

A

B

A

B C

B

B

A

B

C

A

C

A

B C

C

X

ΦΦ

R

X

ΦT

ΦX

ΦΦ

ΦR

R

ΦT

R

ΦR

ΦX

+

ΦXΦTΦ

))) + (2(Ρ − (

Figure 3.8: Five timing Scenarios

3.3. Initialization 33

conditions, nodes aT and aR in Figure 3.6 will still go low in response to their respective

clock inputs, and when both go low, the controller will generate a ΦX event and return to

state 0. However, the controller may miss incoming clock events that occur before the reset

is complete.

Assume that ∆TR < ∆RT as in scenarios 1 and 3, and consider operation at a time

during the initialization when the controller takes time ∆TR to traverse a path from state

TR to state 0 as in Scenario 1. If the latch controller reaches state TR (Figure 3.4) in

response to a ΦR event, then it will return to state 0 in time for the next ΦT event and

will continue to cycle correctly. On the other hand, if the controller reaches state TR in

response to a ΦT event, then it will return to state 0 after the next ΦR event. It will

remain in state 0 until the next ΦT event and then transition to state T. With the next ΦR

event, the controller will move to state TR and continue to cycle properly from there. This

corresponds to scenario 3, the more robust initialization as noted above. Having reached this

cycle, the controller will continue to complete all transitions on time with further reductions

of its internal delays. Thus, it will remain in the preferred cycle.

Metastable behaviour [CM73] is possible if ∆TR ≈ P/2. In this case, the controller can

settle to either of two scenarios that are nearly equally robust to future variations in the

skew. As with other metastable situations, the probability of remaining in an indetermi-

nate state decays exponentially with time. Accordingly, my circuit can be initialized very

reliably, and no metastability occurs after successful initialization. If the interface has a

skew tolerance greater than one clock period (i.e. the clock period is large enough), then our

initialization method can find a robust operating point for any initial phase difference be-

tween the transmitter and receiver. To ensure robust operation, an implementation should

either use a “strong keeper” circuit for these inverters or allow extra time for initialization.

This is because the metastability which results in intermediate voltage levels on nodes aT

and aR can settle down to stable values with the keepers providing a strong feedback.

3.3. Initialization 34

3.3.2 Minimum Latency

When both transmitter-last and receiver-last modes are feasible, the transmitter-last mode

has a latency that is one clock period less than that of the receiver-last mode. For designs

where latency is critical for performance (e.g. [DDX95, IM02]), it may be desirable to select

transmitter-last mode whenever possible. The following initialization procedure achieves

this behavior:

1. Start the interface running at full-speed (no need for the speed adjustment used in

section 3.3.1). The latch controller will settle into one of its two modes.

2. Wait long enough to ensure that the probability of metastability failures is insignifi-

cant.

3. Suppress one transmitter clock event. If the latch controller had been in the transmitter-

last mode, it will now see two receiver events before the next transmitter event and

continue in transmitter-last mode. On the other hand, had it been in the receiver-

last mode, the latch controller will see one receiver event before the next transmitter

event and switch to transmitter-last. If δT ′R′ > η, then the controller will remain in

transmitter-last, otherwise it will miss a receiver event when the controller’s internal

reset completes after the arrival of a rising edge of ΦR′ and then resume operation in

receiver-last.

4. Allow adequate time for the resolution of metastability that can occur if δT ′R′ ≈ η.

As described, this procedure make no guarantees of robustness when forcing the transmitter-

last mode. To provide some robustness against skew drift and clock jitter, the latch con-

troller can be operated with a slight slow-down during this initialization and brought to

full speed under normal operation. Alternatively, section 5.1 describes a near-miss detector

circuit that can detect when the controller is close to its limits; in which case, the controller

can be returned to the receiver-last mode by suppressing a ΦR event.

3.3. Initialization 35

In this chapter, I presented a single-stage FIFO design which can be used to interface two

clock domains running at same frequency but with unknown phase relations between them.

The design can work with any arbitrary amount of initial clock skew and can dynamically

account for almost two clock periods of skew. It uses very simple hardware and is very

robust. Provisions also exist for operating the interface at minimum latency for latency-

critical applications. This design appears to the user as a latch with two clock inputs and

can be made part of standard cell library and used in ASIC design flows.

Chapter 4

Implementation and Test Results

4.1 Implementation

We have designed a proof-of-concept chip for our interface which we have fabricated using

the TSMC 0.18µ process through CMC, the Canadian Microelectronics Corporation.

4.1.1 Design Overview

The design of the chip shown in Figure 4.1. The transmitter’s domain consists of a Linear

Feedback Shift Register(LFSR) which generates a psuedo-random sequence of numbers for

Synch

δ1

δ 2

ΦT

ΦD

ΦU

ΦR

LFSR YSPRXT SR

Latch Controller

ToggleCkt

recvIPtransOP enToggle

reset

Gnd

enIdeal

ErrordrShift Register DAC

Errordetect

C0.....C4

reset

Figure 4.1: Design Setup

36

4.1. Implementation 37

data transmission. The transmitter’s latch, T, takes the output of the LFSR and forwards it

to the intermediate latch X. Both the LFSR and Latch T are triggered by the transmitter’s

clock signal ΦT . Latch X takes data from the latch T and forwards it to latch R. The latch

controller takes both ΦT and ΦR as its inputs and generates en which triggers latch X. On the

receiving end, the shift register SR takes output from the receiver’s latch R and generates

the same sequence of data as generated by the transmitter LFSR. More specifically, SR

predicts the next expected bit based on the previous eleven bits and reports an error if

there is a discrepancy. The Error Detection ciruit takes the output from latch R(obtained

data) and SR(expected data), compares them and generates an error signal if they don’t

match. More specifically, SR predicts the next expected bit based on the previous eleven

bits and the error detection circuit reports an error if there is a discrepancy. The shift

register and DAC modulates the ground signal of the latch controller and thus implement

the dynamic initialization process.

4.1.2 Implementation Details

This section describes each component of the design in greater detail.

Transmitter stage

The LFSR consists of a series of Yuan-Svensson latches [YS89] which generate a pattern of

length 2047. The choice of the latch was primarily based upon their ability to operate at

high speeds which enabled the interface to operate at the maximum data rate. As shown in

Figure 4.2, this LFSR has a tap at the 3rd cell which generates the psuedo-random sequence.

The reset signal is an external signal which forces a logic high into the LFSR and triggers

the pattern generation. A synchronizer is used to synchronizes the reset signal with ΦT .

Intermediate stage

This consists of the shift register, DAC, latch controller, intermediate latch X, and toggle

circuit. The shift register, as shown in Figure 4.3, takes two clock inputs ΦD and ΦU and

one data input dr as its input. One bit is shifted into the shift register on every rising edge of

4.1. Implementation 38

reset

0 31 2 4 11

Figure 4.2: LFSR

dr

PhiD

PhiX

C1 C2 C3 C4C0 GndReset

Figure 4.3: Shift Register

ΦD. Once the shift register has a set of six data values, ΦU goes high generating the output

C0 through C4 and the gndReset signal. With proper sequence of data, this circuit can act as

a shift register. The DAC(Figure 4.4) takes the output from the shift register and generates

a ground signal for the latch controller of the corresponding voltage. This is achieved as

follows: The inverter blobks B0 through B4 are of varying gate width with B0 having the

widest gate width and B4 with the narrowest gate width. This leads to B0 providing the

C0 C1 C2 C3 C4

Block 0 Block 1 Block 2 Block 3 Block 4

resetfalseGroundto Latch Controller

Voltage Follower

Figure 4.4: DAC

4.1. Implementation 39

x

e

z q

Φ

ΦΦ

y

d

Φ

Φ

w

Figure 4.5: Modified Yuan-Svensson Latch

weakest pull-up on the p-channel transistor of voltage follower. Similarly B4 provides the

strongest pull-up. This provides the first 24 step gradient on the signal falseGround which

forms the ground signal for the latch controller. The n-channel transistors which start

conducting only when both C3 and C4 are high, provide the next 8 steps of the gradient.

The reset signal indicates the end of dynamic intialization and thus brings falseGround

completely low and allows the latch controller to run at its higest speed.

Using the DAC circuits, the ground signal of the latch controller can be lowered gradu-

ally. The resetGnd goes high at the end of the sequence which brings the falseGround signal

to its low value. Thus, the latch controller can be initialized at a very slow speed and it

slowly speeds up following the gradient of the ground signal.

The Yuan-Svensson latch, due to its sensitivity towards high data input when the clock

input is high, has a rather large set-up time requirement for a low data input and thus

is not suitable for the latches T,X and R of the design. The modified latch as shown in

Figure 4.5, avoids the problem by attaching the data input to the pull-down path of the

output of both p-stage and n-stage of the latch. Thus, while the clock input is high, if the

data goes low and then high, the output of the latch remains stable thus reducing the large

setup time requirement for low data inputs.

The set-up and hold times for our latches were found to be roughly 150ps and 90ps

respectively from the simulation studies. These times are much shorter than the delays

4.1. Implementation 40

c

cB enB enBB enBBBen

To Latch X

Figure 4.6: en generation

40 31 2 65 11

Ideal recvIP

Figure 4.7: Receiver Shift Register

though the latch control circuitry. I widen the skew tolerance window by delaying the

clocks for latches latch-T and latch-R(δ1 and δ2). With this padding, the skew tolerance of

the interface is determined by the minimum cycle time of the latch controller. This cycle

time is 340ps. Thus, the skew tolerance window has width, 2P −680ps. The skew window is

wider than the clock period for a clock period of 1400MHz or lower. Under these conditions,

the interface can operate with an arbitrary fixed skew. Initially, the self-resetting latch con-

troller generated pulses on en that were of marginal width for triggering the latches. I did

not want to modify the latch controller as this would increase its cycle time and decrease its

skew tolerance. Instead, I used a self-resetting buffer to generate en as shown in Figure 4.6

and widened the pulse by including sufficient delay, in the reset path for the buffer. This

buffer takes c output from the latch controller(see Figure 3.6)and generates the en signal

which triggers the latch X. Finally the toggle circuit converts the narrow pulse on en into

one transition per pulse to facilitate off-chip observation.

Receiver stage

The receiver stage consists of receiver’s latch R, shift register SR and an error detection

4.2. Test Results 41

RSYS0 YS1 YS2 YS3

YS4YS5YS6

recvIP

Ideal s

rb

Error

Figure 4.8: Error Detection Circuit

circuit. The shift register SR, takes output from the latch recvIP as its input and recreates

an ideal pattern of data which is sent as input to the error circuit. The error detection

circuit consists of a series of Yuan-Svensson latches along with an RS latch. As seen in

Figure 4.8, the XOR gate generates a high signal whenever the obtained data differs from

the expected data. The rest of the latches keep the error signal high for one out of six

receiver clock cycles. Thus, an error propagating through the receiver shift register does

not produce multiple error reports.

4.2 Test Results

Figure 4.9 shows the structure of the chip.

It has a set of six input signals as follows:

1. ΦT : transmitter’s clock

2. ΦR: receiver’s clock

3. reset: resets the LFSR of the transmitter’s domain

4. ΦU , ΦD and dr: input to the ground modulator of the latch controller.

and four output signals are:

1. transOP: output data of the transmitter’s domain

2. recvIP: input data to the receiver’s domain

3. enToggle: Toggled version of the signal en

4. error: shows error when data from the two domains do not agree.

4.2. Test Results 42

ΦT ΦR

ΦU

ΦDreset

dr

Agilent 81200enToggle

Core Design

transOP

errorrecvIP

signal generatorsynthesized

Agilent 8373B

SRS DS345synthesized fn generator

Parallel BERT

Rhode&SchwarzSMT03signal generator

Figure 4.9: Test Setup

4.2. Test Results 43

I tested the chip in the System-on-Chip Lab of UBC. The signal ΦT was generated from

a Rhode&Schwarz SMT03 Signal Generator with phase modulation ability. The output

ΦR, generated from Agilent 8373B, also formed external clock reference for Agilent 81200.

To achieve complete synchronization between ΦT and ΦR, a synthesized function generator

provided a common time base. The parallel BERT, Agilent 81200, generated the other

programmed input signals, namely, reset, ΦD, ΦU and dr.

With the reset signal going high, I observed the correct psuedo-random sequence on

transOP using an oscilloscope. As soon as the intialization process was completed using

ΦD, ΦU and dr, the recvIP signal exhibitted the same pattern at a fixed offset to transOP.

This verified the functioning of the latch controller. To verify the functionality of LFSR, I

programmed the Agilent 81200 to generate a pulse with a period of 2047 units, to match the

pattern length of the LFSR, and used this pulse to trigger the oscilloscope. Then, stable

data patterns were observed on transOP and recvIP. Due to a top-level wiring error, the

output of the signal Error was shorted and thus I devised additional methods of detecting

errors as explained later.

To measure the skew tolerance of the design, I phase modulated the transmitter grad-

ually until the interface no longer operated successfully as indicated by our error detection

methods described later. The observed error free zone is based on the following principle: at

any arbitrary phase offset between the ΦT and ΦR, the interface can handle a skew tolerance

of almost two clock periods; but the amount of skew tolerance in either direction, meaning

with transmitter earlier or later than the receiver, is non-uniform. As an example, fig 4.10

shows two scenarios with different displacement of the skew tolerance windows. In the first

scenario, ΦT and ΦR are exactly matched in phase and thus ΦT can happen a clock period

earlier or later than the receiver. Thus, an error free zone of almost two clock periods can

be observed in this case. In the second scenario, ΦT and ΦR are 180 degrees out of phase.

In this case, ΦT can happen half a clock period later and one and half clock periods earlier

than the receiver. Because ΦT moves equally in either direction during phase modulation,

4.2. Test Results 44

PhiT

PhiR

en

Error Free Zone

PhiT

PhiR

en

Error Free Zone

Figure 4.10: Skew Tolerance Window

4.2. Test Results 45

0 50 100 150 200 250 300 3503

3.5

4

4.5

5

5.5

6

6.5

initial phase offset(degrees)

phas

e m

odul

atio

n to

lera

nce

(rad

ians

)

peak at(168,6.2)

phase modulation tolerance vs. intial phase offset

Figure 4.11: Phase Modulation Tolerance at 20MHz

I observed errors as soon as it hit the closer boundary. Thus, a phase modulation of more

than half clock period resulted in errors. Any phase offset intermediate of these two val-

ues resulted in an error free zone of some intermediate displacement. Thus, the ideal plot

would depict a minimum error free zone at zero phase offset between ΦT and ΦR and would

gradually increase with the increase in phase difference as it reaches 180 degrees. Further

increase in phase offset would result in gradual decrease in error free zone until it reaches

the minimum again.

I started with ΦT and ΦR operating at about 20MHz. The plot obtained is shown in

Figure 4.11. As expected, the phase modulation tolerance first increased and then decreased

with increasing initial phase offset between transmitter and receiver. The maximum phase

modulation obtained, which is about 6.2 radians suggested a skew tolerance of almost 2

clock periods and implies a cycle time for the latch controller of 662 picoseconds. Similarly,

at 30MHz, as shown in Figure 4.12 a maximum phase tolerance of 6.17 radians is obtained

which is once again almost two clock periods.

To detect the boundaries of the error free zone, and thus determine the skew tolerance

of the design, I devised two methods:

4.2. Test Results 46

0 50 100 150 200 250 300 3503

3.5

4

4.5

5

5.5

6

6.5

initial phase offset(degrees)

phas

e m

odul

atio

n to

lera

nce

(rad

ians

)

peak at(157,6.17)

phase modulation tolerance vs. intial phase offset

Figure 4.12: Phase Modulation Tolerance at 30MHz

• Method 1: In this method, the transmitter and receiver are initialized at some

phase offset from each other. The outputs transOP and recvIP are observed on an

oscilloscope. The transmitter is then phase modulated with the amount of phase

modulation increased gradually. Initally the output from recvIP is stable and does

not drift with time. This is possible due to operation of the latch controller in the

receiver-last mode. But when the amount of phase modulation is such that it hits one

of the boundaries of the error-free zone, the recvIP output drifts by a clock period at

every cycle due to shifts in the mode of operation from receiver-last to transmitter-last

as the controller crossed the boundaries of feasible window of operation.

• Method 2: In this method, the Agilent 81200 was controlled programmatically. The

output recvIP was captured and observed. A lookup table was constructed internally

consisting of identical pattern of data as produced by the transmitter LFSR. The first

11 bits of any block of captured data, indicated its position in the lookup table and

then the subsequent bits were compared against the lookup table and any inconsis-

tency were reported as “error”. Once again the transmitter was phase modulated

with the amount of phase modulation gradually increased. The output of recvIP was

4.2. Test Results 47

attached to Agilent 81200 and when the amount of phase modulation crossed the

feasible range, “errors” were reported by the software which determined the position

of the error free zone.

This chapter desrcibed the implementation details of a proof-of-concept chip demon-

strating the functionality of minSTARI. The test data is generated and verified on the chip

itself using additional structures. Subsequent sections describe the test setup and the vari-

ous observations and results obtained from testing the silicon. The design has been shown

to be possesing very high skew tolerance at lower frequencies. I am in the process of further

testing the interface at higher frequencies ranging up to 2 GHz. Once the skew tolerance

limits are confirmed are higher frequencies, the robustness of the interface will be tested

using dynamic initialization.

Chapter 5

Generalizations: Rational, Close and

Arbitrary Clocks

This chapter describes extensions of our basic interface, which can handle multipe clock do-

main designs with rational, nearly matched and arbitrary, stable clocks. As the frequencies

of the clocks of different domains can differ, data is not transmitted or received at every

cycle of the faster clock. Thus, proper flow control is necessary to ensure reliable commu-

nication between domains. I start with the scenario where clock frequencies are rational

multiples of each other. Sections 5.2 and 5.3 extend the design further to scenarios where

clocks are nearly matched in frquency or have arbitrary and stable frequencies.

5.1 Rational Clock Frequency Multiples

Consider the situation depicted in Figure 1.2: the frequencies of the sender’s and receiver’s

clocks are pre-determined rational multiples of each other. Let PT be the period of the

transmitter’s clock and PR be the period of the receiver’s clock. Let NR and NT be positive

and mutually prime with NR/NT = PT /PR (NT and NR correspond to the frequencies

of the respective clocks). I developed my designs assuming NR > NT and describe the

NT > NR case at the end of this section.

48

5.1. Rational Clock Frequency Multiples 49

latch−X

ΦT ΦR

ΦU

latch controller

QD

tran

smitt

erfr

om

ntnr

data

valid_next

data

to re

ciev

er

multiplierrate

Figure 5.1: An Interface with Rational Clocks

Figure 5.1 shows the design for NR > NT . By the assumption that the receiver operates

at a higher rate than the transmitter, there will be receiver cycles for which no new trans-

mitted data is available. The rate multiplier outputs NT pulses on node ΦU to the latch

controller for every NR cycles of the receiver’s clock as shown below:

sum := 0;

for each cycle of ΦR do

if sum ≥ 0 then

output a pulse on ΦU ;

sum := sum + NT − NR;

else sum := sum + NT ;

endif

od

By analogy with ΦR and Φ′

R, let ΦU ′ be the internally delayed version of ΦU in the latch

controller, and let δT ′U ′ and δU ′T ′ be defined as δT ′R′ and δR′T ′ , respectively. As noted above,

the rate multiplier introduces periodic jitter into ΦU with a period of NRPR = NT PT . Let

δU ′T ′,0 denote the time from the rising edge of ΦU ′ produced by the (kNR)th rising edge of

ΦR to the next rising edge of ΦT ′ for any integer k. It is straightforward to show:

min(δU ′T ′) = δU ′T ′,0

max(δU ′T ′) = δU ′T ′,0 + PR − PR

NT

min(δT ′U ′) = PT − δU ′T ′,0 − PR + PR

NT

max(δT ′U ′) = PT − δU ′T ′,0

(5.1)

5.1. Rational Clock Frequency Multiples 50

ΦT

ΦR

ΦU,1

ΦU,2

−2 −1 2 0 −21

21 −1

sum1

sum2 2

10

0 −2 0 −2

BA

C

Figure 5.2: Exploiting Periodic jitter

The cycle time constraints of the latch controller can be satisfied if:

max(δU ′T ′,0, PT − δU ′T ′,0 − PR + PR

NT) > η (5.2)

which holds for any value of δU ′T ′,0 if:

PT − (1 − 1

NT)PR > 2η (5.3)

For designs where the latch set-up and hold requirements are the dominant constraints, sim-

ilar bounds for PT and PR can be derived. Thus, the one-stage FIFO can interface between

synchronous domains operating at different, rationally related frequencies. Furthermore,

the initialization methods described in Section 3.3.1 apply directly to the rational clocks

case. Comparing equation 5.3 with equation 3.2 shows that the minimum period required

for the slower clock has been increased by the jitter of ΦU created by the rate multiplier.

The rate multiplier introduces periodic jitter. We can exploit this predictability to

increase the robustness of the interface. For every NR consecutive cycles of the receiver’s

clock, the variable sum takes on each value in {NT − NR, . . . , NT − 1} exactly once. The

initial value of sum is arbitrary, and we can use this freedom to increase the skew tolerance

of our design.

5.1. Rational Clock Frequency Multiples 51

Figure 5.2 shows the operation of our interface where the transmitter clock frequency

is 3/5 that of the receiver. The traces for sum1 and ΦU,1 show the worst-case sequence for

sum: with this choice η < PR/2 must hold for proper operation. In particular, if the ΦU

event generated when sum1 transitions from 2 to 0 triggers the latch controller to produce

a ΦX event, then the self-reset cycle must complete in time for the rising edge of ΦT that

occurs PR/2 later (indicated by the arrow labelled A in the diagram). On the other hand,

if this ΦU event does not trigger the latch controller, then the subsequent ΦT event must,

and the resulting self-reset cycle must complete in time for the next ΦU event, again PR/2

later (indicated by the arrow labelled B in the diagram).

The traces for sum2 and ΦU,2 show the optimal sequence for sum for the same transmitter

and receiver clocks. For this scenario, the critical timing occurs when the rising edge of

ΦU that is produced when sum2 goes from 2 to 0 triggers a ΦX pulse. The self-reset of

the latch controller must complete prior to the next ΦU pulse PR time units later. Thus,

with this choice of the sum sequence, the latch controller can operate at half the rate as

required by the worst-case sequence. I first derive the constraints on PT and PR that ensure

proper operation for any phase difference between the two clocks assuming the optimal sum

sequence. I then describe how our initialization technique from Section 3.3.1 can be adapted

to find this sequence.

Regardless of how the sum sequence is chosen, the ΦU clock has a jitter of (1−1/NT )PR

with respect to an evenly spaced clock with period PT . The maximally robust sequence for

sum centers the ΦU jitter interval as closely as possible on the ΦT clock. The interval may

be off center by as much as gcd(PR, PT ) = PR/NT due to the discrete set of choices for the

sum sequence. From these observations, the smaller of the time from a rising edge of ΦT ′

that triggers the latch controller to the next rising edge of ΦU ′ or vice-versa is PT − PR/2.

It is also possible that a rising edge of ΦU ′ triggers the latch controller and that the next

input event for the controller is the next rising edge of ΦU ′ . The minimum time between

two such rising edges is bNR/NT cPT . Combining these two constraints yields that there is

5.1. Rational Clock Frequency Multiples 52

ΦX

ΦX ΦX

ΦU

Φ

ΦT

Q

miss

Sync D

R

y

x

Figure 5.3: A Miss Detector

5 10 15 20 25 300

5

10

15

20

25

Receiver Frequency

Max

imum

Per

mitt

ed C

ycle

Tim

e

Transmitter Frequency=7 units Improved MarginOriginal Margin

Figure 5.4: Receiver Frequency vs. Cycle Time constraint

a feasible sequence for sum such that the cycle time constraints of the latch controller are

satisfied as long as:

PT − max(

1

2, NR mod NT

NT

)

PR > η (5.4)

Comparing with equation 5.3 we see that choosing the optimal sum sequence can greatly

relax the cycle time requirement for the latch controller, or, equivalently, greatly increase

the robustness of the interface. For example with PR = 1ns and PT = 1.2ns, equation 5.3

(fixed choice for the sum sequence) requires η < 0.1ns. With the optimal choice for the sum

sequence, equation 5.4 requires η < 0.7ns, a reduction in the speed required by a factor of 7

5.1. Rational Clock Frequency Multiples 53

for this example. As another example, Figure 5.4 shows the improved margin obtained by

choosing the optimal sum sequence with a fixed transmitter frequency which in this case is

7 units.

The sequence for sum that maximizes robustness can be selected as part of the initial-

ization of the interface. This approach is based on two observations. First, the optimal

sequence works with a larger value of η than any other sequence. Second, we can shift from

one sequence to another by adding NT +1 (resp. NT −NR +1) to sum instead of NT (resp.

NT −NR). Generalizing the initialization technique described in Section 3.3.1, we can start

with a large value for η and gradually decrease it. Each time the latch-controller fails to

reset in time for the next ΦT ′ or ΦU ′ event, we shift to the next sum sequence. When η

is small enough that the latch controller can operate with the optimal sum sequence, but

not the others, then the rate multiplier will switch from one sequence to the next until it

reaches the optimal one. At this point, the latch controller will successfully reset after each

cycle in time for the next ΦT ′ and ΦU ′ events, and the rate multiplier will remain with the

optimal sequence.

Figure 5.3 shows my circuit that reports when a rising edge of ΦT or ΦX arrives at the

latch controller prior to the completion of the controller’s internal reset. Such an event is

called a “miss” and I call the circuit a “miss detector”. A miss occurs if a rising edge is

received while the c signal of the latch controller (see Figure 3.6) is still low. Noting that

ΦX is an inverted version of c, we can use the ΦX signal in the series stacks of transistors

that detect such events. The delay of the inverter that produces ΦX gives our circuit a

little extra margin: it also reports “near misses”. When a (near) miss occurs, node x goes

low and node y goes high. These transitions occur asynchronously with respect to ΦR. It

is assumed that the pulse width of ΦX is less than ΦR to ensure that the “miss” signal is

asserted for only one period of ΦR. The synchronizer provides a delayed version of y in the

receiver’s clock domain and the receiver switches to the “next” sequence on the reported

“miss”. The synchronizer is only active during initialization and does not contribute to the

5.2. Plesiochronous Interfaces 54

missT

D QD Q

ΦT

D Q

detector‘‘near’’ miss

ΦRController−stuff +stuffLatch

missR

Figure 5.5: Interface for Nearly Matched Clocks

latency of data transfers under steady-state operation. Accordingly, the synchronizer can

have a large latency and correspondingly minuscule probability of failure.

This section has shown how minSTARI can be extended for use in designs with multiple,

rationally related clock frequencies. I have focused on the case where the receiver clock

frequency is greater than that of the transmitter. If the transmitter has the higher clock

frequency, equivalent designs can be used with the rate multiplier in the transmitter’s clock

domain.

5.2 Plesiochronous Interfaces

Now consider designs with multiple clock domains with independent clocks that are closely

matched in frequency; these are called “plesiochronous” interfaces (see [Mes90, DDX95]).

Such designs occur, for example, when the sender and receiver are physically separated (e.g.

networks), or when separate clock generators are used to avoid introducing a single point

of failure into the design. Typically, the clock frequencies will be matched to within a few

parts per million, a tolerance that is easily achieved with crystal oscillators.

With close frequency matching, the relative timing of clock edges at the latch interface

changes very slowly. In particular, critical synchronization events occur at a rate corre-

sponding to the difference between the clock frequencies. We can modify the miss detector

circuit from Figure 5.3 to provide an output indicating when a rising edge from ΦT occurs

5.2. Plesiochronous Interfaces 55

shortly after the latch controller completes its reset, and another output indicating when

the timing of ΦR is close to the margin. Furthermore, we can use a delayed version of ΦX

so that near-misses will be reported when a significant margin still remains. If these signals

indicate, for example, when only 0.1P of margin remains, thousands of cycles remain before

an error could actually occur. Thus, we can synchronize these signals to the transmitter

and receiver clocks with extremely high reliability, and use the synchronized versions to take

appropriate corrective action. For example, if a rising edge of ΦR occurs less than 0.1P after

the latch controller completes its reset, then the receiver can skip clocking the latch con-

troller on a subsequent cycle. This will switch the interface from operating with the rising

edge of the transmitter’s clock arriving much after the corresponding edge of the receiver’s

clock to operation where the receiver’s clock edge arrives slightly after the transmitter’s.

Likewise, if a rising edge of ΦT occurs less than 0.1P after the latch controller completes

its reset, then the transmitter can skip sending data and clock on a subsequent cycle. Such

protocols are commonly implemented using “stuff bytes” [DDX95] where extra bytes are

padded at the end of a data packet, with padding added if the transmitter lags behind or

deleted if the receiver is slow. Such protocols are easily implemented in the framework of

our latch controller and miss detector.

Although synchronizations are required during operation, the latency of these synchro-

nizations is not critical for the latency of the data path. The data latency for our interface

is always less than 2P . By adding an arbiter to detect when the latch controller is in

receiver-last mode with enough margin to be able to safely switch to transmitter last, the

worst-case data latency can be reduced to slightly greater than P with an average latency

slightly greater than P/2.

5.3. Arbitrary Clock Frequencies 56

D QD Q D Q

detector‘‘near’’ miss

ΦTrate

multiplierrate

multiplier

saturatingcounter

saturatingcounter

Controller−stuff +stuffLatch ΦR

Figure 5.6: Interface for Arbitrary Clocks

5.3 Arbitrary Clock Frequencies

We now consider the case where the transmitter and receiver operate with independent

clocks at arbitrary frequencies. Initially, it might seem that such a design requires the

overhead of synchronizing buffers as described in Section 2.3.2. However, clock frequencies

are extremely stable in nearly all synchronous designs. We can exploit this stability even

if the frequencies aren’t known in advance. We combine our designs from the previous two

sections to support communication with arbitrary clock frequencies.

Firstly, the transmitter and receiver forward their clocks to each other. Each uses a

counter to produce an initial estimate of the clock frequency of the other. These estimates

provide a rational approximation of the ratio of two clock frequencies. If the nominal clock

frequencies are known in advance, this step can be skipped.

Secondly, if the receiver’s clock frequency is higher than that of the transmitter, it

uses a rate-multiplier to create an approximation of the transmitter’s clock. Likewise, the

transmitter uses a rate multiplier if it has the higher clock frequency. The latch controller

operates with the (possibly rate-multiplied) clocks provided by the transmitter and receiver.

Because the frequency values that we have for the two clocks are only approximations,

albeit very accurate ones, the FIFO will be prone to occasional underflow or overflow.

Separate “near-miss” signals for ΦT and ΦR forward near miss events to the client with the

5.4. A FIFO Interface 57

faster clock, i.e. the one using the rate-multiplier. This client updates its estimate of the

other client’s clock frequency thus changing the rate of events output by its rate multiplier.

This is a second-order control system, and a little bit of care is needed to ensure stability.

A simple approach is that the client with the faster clock uses a counter to measure the

time between near miss events and uses this information to update its estimate of the other

client’s clock frequency. This process is quadratically convergent and stable. At the same

time as updating the frequency estimate, a first-order correction can be applied by adding

an offset to sum to bring the latch controller back to a point near the center of its safe

operating region. If the near miss was for the clock generated by the rate multiplier, then

this offset should be negative (to retard the rate multiplier), otherwise it should be positive.

As for the plesiochronous interface described in the previous section, synchronizations

are required during operation. Again, the latency of these synchronizations is not critical for

the latency of the data path. These synchronizations are infrequent; their rate is determined

by the resolution of the rate-multiplier and the drift rate of the clock frequencies.

5.4 A FIFO Interface

The interfaces presented in the Sections 5.1 through 5.3 do not provide data transfers on

every cycle of the transmitter and receiver clocks. The cycles on which transfers occur is

determined by the interface. In many designs, the sender and receiver need more control

of when transfers occur. Having solved the problems of mismatched clocks, flow control is

straightforward. For example, Figure 5.7 shows an implementation that presents a FIFO

interface to both the transmitter and receiver. The equations for Empty, gr, and pr are:

Empty = er ∧ ¬pt

gr = Get ∧ ¬er

pr = pt ∧ ¬(er ∧ Get)

(5.5)

5.4. A FIFO Interface 58

d_inputnear_full

d_outget

empty

rate mult.

a1

a0 s

y

rate mult.

ΦTΦR

φRφT

R’φφT’

pt

pr grer

FIFO−R

FIFO−1q’d’Put

D_in

Full

mux

D_out

Get

Empty

d qen

dq

Figure 5.7: Implementing a FIFO interface

FIFO-R is a purely synchronous FIFO clocked by ΦR. FIFO-1 is our single-stage FIFO

design. The two rate-multipliers compensate for frequency differences between ΦR and ΦT ,

and the box labelled “sync” is a synchronizer.

When the transmitter performs a put, it must assert Put and the data at D in until

the next ΦT ′ event. If the transmitter’s clock frequency is no greater than the receiver’s,

then there can be a φT ′ event for every event of ΦT and the interface can accept data from

the transmitter on every cycle. If the transmitter does not perform a Put for some φT ′

event, then the resulting “empty” value will be noted by a false value of pt at the output

of FIFO-1.

If FIFO-R is non-empty, Get requests from the receiver are forwarded to FIFO-R on

gr and data from FIFO-R is output on D out. If FIFO-R is empty, then the receiver is

signalled that data is available if an only if the transmitter performed a Put for the last

transfer of FIFO-1: i.e., pt is true. In this case, the value at the output of FIFO-1 bypasses

FIFO-R and goes directly through the multiplexer to D out. The only increase in latency

for our design is the latency of this multiplexer.

If many cycles elapse without the receiver performing a Get, then FIFO-R may become

full. The near full output of this FIFO indicates that FIFO-R has little remaining capacity.

5.4. A FIFO Interface 59

QD QD QD

QDQDQD

D_in d_outd_in

FullFull_Tnear_full

en

ΦΦT RController

Latch

Figure 5.8: Symmetry in FIFO interface

This signal must be conveyed to the ΦT domain to indicate to the transmitter that the

interface is full. A basic symmetry underlying our design helps us in achieving this transfer

without much effort. The latches which generate D in and Full T belong to the transmitter’s

domain. Let Full T be the signal which stops further generation of data. The timing

diagram of Figure 5.9 shows the scenarios where generation of data “A” leads to nearly

filled FIFO-R. As seen, regardless of the mode of operation of the latch controller, the

return path latency of the interface is always two clock periods. Thus, having an additional

capacity of one data item in FIFO-R when “near full” goes high, is sufficient.

In this chapter, I have presented interfaces for various multiple clock domain scenarios by

adding additional functionalities to the basic single-stage FIFO interface. The designs either

do not need synchronization or the required synchronization is not part of the latency critical

path and thus, do not introduce extra latency. I also introduced a novel “miss-detector”

that can be used for detecting “misses” or “near-misses” in communication. Finally, a

generic FIFO interface is presented which exploits the existing symmetry in our design and

can be applied to all possible timing scenarios discussed earlier.

5.4. A FIFO Interface 60

BA

BA

Full_T

Full

d_out

near_full

d_in

D_in

b. receiver−last scenario

Full_T

Full

d_out

near_full

d_in

D_in

a. transmitter−last scenario

BA

A0

BA

BA

Figure 5.9: Timing scenarios with nearly full FIFO-R

Chapter 6

Conclusion

I have presented a very simple design for source-synchronous communication. It is based on

a self-timed, ripple FIFO with a single stage. Whereas a single-stage, pointer FIFO provides

no skew compensation (such a pointer FIFO is simply a latch clocked by the transmitter),

the single-stage ripple FIFO provides nearly two clock periods of skew tolerance and can

operate correctly for any initial phase offset between the transmitter and the receiver of the

channel. The simplicity of the single-stage FIFO enables simplifications and optimization,

thus taking good advantage of self-timed design. I presented a design consisting of a self-

resetting, edge-triggered C-element that generates a clock intermediate to the clocks of the

transmitter and receiver. This intermediate clock strobes a latch that conveys data from

the transmitter to the receiver. The timing of this clock signal ensures that the set-up and

hold requirements of the receiver and the intermediate latch are all satisfied. My design can

be initialized to provide maximal robustness against clock jitter and skew drift by adjusting

the speed of the self-resetting C-element during its initial operation. Alternatively, the

interface can be initialized for minimum latency by deliberately suppressing a transmitter

clock event to the latch during initialization.

Chapter 5 showed how this design can be adapted for more generalized clocking scenarios

including clocks with rationally related frequencies, closely matched clocks, and arbitrary

61

62

clocks. In all of these designs, any synchronization is carried out on a path whose latency

does not impact the data path. Thus, my designs can achieve latencies that are at most

slightly more than one clock period and typically about half the clock period. To achieve

this performance, we exploit the frequency stability of clocks in synchronous designs.

In general, the overheads of synchronization and handshaking are only needed to address

timing issues that cannot be resolved statically. When the clocks of the transmitter and

receiver are identical in frequency, then only the relative phase needs to be resolved, and

this can be done by a simple handshaking circuit such as the latch controller shown in

Figure 3.6. When the clocks are rationally related, the client with the faster clock can

use a rate multiplier to construct an approximation of the other client’s clock. There are

numerous isomorphic sequences of events that can be generated by the rate multiplier, and

synchronization occurs during initialization to determine the optimal sequence. When the

clocks are closely matched in frequency, only the long-term drift needs to be identified. The

synchronizer that detects this drift can have high latency without impacting the data path

latency. Finally, when arbitrary clock frequencies are used, the frequency stability of these

clocks enables accurate approximation of the frequency ratio. Again, synchronization is

only needed to detect long-term drift, and this does not impact the data path latency.

I designed and fabricated a proof-of-concept chip in the TSMC 0.18µ CMOS process

for my design for clients operating at identical clock frequencies using a single-stage FIFO

interface. Intial test results have demonstrated my claims about skew tolerance of the design

and I am in the process of more extensive testing. I am also currently designing a chip to

demonstrate interfaces based on miss-detectors for rational, closely matched and arbitrary

clock frequency designs and will fabricate it in the near future.

Bibliography

[BB98] Daniel W Bailey and Bradley J Benschneider. Clocking design and analysis

for a 600MHz Alpha microprocessor. IEEE Journal of Solid-State Circuits,

33(11):1627–1633, November 1998.

[BC+99] Kerry Bernstein, Keith M. Carrig, et al. High Speed CMOS Design Styles.

Kluwer, 1999.

[BDM02] Keith A. Bowman, Steven G. Duvall, and James D. Meindl. Impact of die-

to-die and within-die parameter fluctuations on the maximum clock frequency

distribution for gigascale integration. IEEE Journal of Solid-State Circuits,

37(2):183–190, February 2002.

[CC+91] Terry I. Chappell, Barbara A. Chappell, et al. A 2-ns cycle, 3.8-ns access 512-

KB CMOS ECL SRAM with a fully pipelined architecture. IEEE Journal of

Solid-State Circuits, 26(11):1577–1585, November 1991.

[CG02] Ajanta Chakraborty and Mark R. Greenstreet. A minimalist source-

synchronous interface. In Proceedings of the 15th IEEE ASIC/SOC Conference,

pages 443–447, September 2002.

[Cha84] Daniel M. Chapiro. Globally-Asynchronous, Locally-Synchronous Systems. PhD

thesis, Department of Computer Science, Stanford University, October 1984.

Tech. Report STAN-CS-84–1026.

[CM73] T.J. Chaney and C.E. Molnar. Anomalous behavior of synchronizer and arbiter

circuits. IEEE Transactions on Computers, C-22(4):421–422, April 1973.

[CN01] Tiberiu Chelcea and Steven M. Nowick. Robust interfaces for mixed-timing

systems with application to latency-insensitive protocols. In Proceedings of the

38th ACM/IEEE Design Automation Conference, pages 21–26, June 2001.

[CZ02] Atanu Chattopadhyay and Zeljko Zilic. High speed asynchronous structures for

inter-clock domain communication. In Proceedings of the 2002 International

Symposium on Circuits and Systems, pages 517–520, September 2002.

[Dam97] Roger A. Dame. The Alphaserver 4100 low-cost clock distribution system.

Compaq Digital Technical Journal, 8(4):38–47, April 1997.

63

Bibliography 64

[Dav99] Bijan Davari. CMOS technology: Present and future. In Proceedings of 1999

Symposium on VLSI Circuits, pages 5–10. IEEE, June 1999.

[DDX95] Larry R. Dennison, William J. Dally, and Duke Xanthopoulos. Low-latency

plesiochronous data retiming. In Proceedings of the Sixteenth Anniversary Con-

ference on Advanced Research in VLSI, pages 304–315, 1995.

[FEG00] S.F. Furber, D. A. Edwards, and J. D. Garside. AMULET3: a 100 MIPS

asynchronous embedded processor. In Proceedings of the 2000 International

Conference on Computer Design, pages 329–334, September 2000.

[GA+02] Stephen Geissler, David Appenzeller, et al. A low-power RISC microprocessor

using dual PLLs in a 0.13µ SOI technology with copper interconnect and low-k

BEOL dielectric. In Proceedings of the 2002 International Solid-State Circuits

Conference, pages 148–149, February 2002.

[Gre93] Mark R. Greenstreet. STARI: A Technique for High-Bandwidth Communi-

cation. PhD thesis, Department of Computer Science, Princeton University,

January 1993.

[Gre95] Mark R. Greenstreet. Implementing a STARI chip. In Proceedings of the 1995

International Conference on Computer Design, pages 38–43, Austin, Texas,

October 1995.

[HN01] David Harris and Sam Naffziger. Statistical clock skew modeling with data

delay variations. IEEE Transactions on VLSI Systems, 9(1):888–898, December

2001.

[HN03] Shaomei Huang and Radu Negulescu. High-performance mixed-clock commu-

nication with resampling, 2003.

[IM02] Anoop Iyer and Diana Marculescu. Power-performance evaluation of Globally

Asynchronous, Locally Synchronous processors. In Proceedings of the 29th In-

ternational Symposium on Computer Architecture, pages 158–168, June 2002.

[JG93] Howard Johnson and Martin Graham. High-Speed Digital Design: A Handbook

of Black Magic. Prentice Hall, 1993.

[KA+01] Andre Kowalczyk, Victor Adler, et al. The first MAJC microprocessor: A dual

CPU system-on-a-chip. IEEE Journal of Solid-State Circuits, 36(11):1609–

1916, November 2001.

[Kat98] Cameron Katrai. Managing clock distribution and optimizing clock skew in

networking applications. Application Notes #14, Pericom Semiconductor Cor-

poration, 1998.

Bibliography 65

[KB+01] Nasser A. Kurd, Javed S. Barkatullah, et al. Multi-GHz clocking scheme for

Intel r© Pentium r© 4 microprocessor. In Proceedings of the 2001 International

Solid-State Circuits Conference, pages 404–405, February 2001.

[KN+02] Georgios K. Konstadinidis, Kevin Normoyle, et al. Implementation of a third-

generation 1.1-GHz 64-bit microprocessor. IEEE Journal of Solid-State Cir-

cuits, 37(11):1461–1469, November 2002.

[KP+99] Gergios Kornaros, Dionisios Pnevmatikatos, et al. ATLAS 1: Implmenting a

single-chip ATM switch with backpressure. IEEE Micro, 19(1):30–41, Jan/Feb

1999.

[KS96] S Kim and R Sridhar. Self-timed mesochronous interconnection for high-speed

VLSI systems. In Sixth Great Lakes Symposium on VLSI, pages 122–125, March

1996.

[MB+89] Alain J. Martin, Steven M. Burns, et al. The design of an asynchronous mi-

croprocessor. In Proceedings of the Conference on Advanced Research in VLSI,

Caltech, 1989.

[Mes90] David G. Messerschmitt. Synchronization in digital system design. IEEE Jour-

nal on Selected Areas in Communications, 8(8):1404–1419, October 1990.

[MJ+97] Charles E. Molnar, Ian W. Jones, et al. A FIFO ring oscillator performance

experiment. In Proc. International Symposium on Advanced Research in Asyn-

chronous Circuits and Systems, pages 279–289. IEEE Computer Society Press,

April 1997.

[ML+97] Alain J. Martin, Andrew Lines, et al. The design of an asynchronous MIPS

R3000 microprocessor. In Proceedings of the 17th Conference on Advanced

Research in VLSI, pages 164–181, September 1997.

[MS01] Fenghao Mu and Christer Svensson. Self-tested self-synchronization circuit for

mesochronous clocking. In IEEE transactions on Circuits and Systems, pages

129–140, 2001.

[MT+02] S.W. Moore, George Taylor, et al. Point to point GALS interconnect. In

Proceedings of the Eigth International Symposium on Advanced Research in

Asynchronous Circuits and Systems, pages 62–68, April 2002.

[MVF00] Jens Mutterbach, Tomas Villiger, and Wolfgang Fichtner. Practical design of

globally-asynchronous, locally-synchronous sytems. In Proceedings of the Sixth

International Symposium on Advanced Research in Asynchronous Circuits and

Systems, pages 52–59, April 2000.

Bibliography 66

[RB+01] P.A. Riocreux, L.E.M. Brackenbury, et al. A low-power self-timed Viterbi

decoder. In Proceedings of the Seventh International Symposium on Advanced

Research in Asynchronous Circuits and Systems, pages 15–24, 2001.

[RJDC98] Phillip J. Restle, K. A. Jenkins, A Deutsch, and P W Cook. Measurement

and modelling of on-chip transmission line effects in a 400 MHz microprocessor.

IEEE Journal of Solid-State Circuits, 33(4):662–665, April 1998.

[RM+01] Phillip J. Restle, Timothy G. McNamara, et al. A clock distribution network

for microprocessors. IEEE Journal of Solid-State Circuits, 36(5):792–799, May

2001.

[RT00] Stelan Rusu and Simon Tam. Clock generation and distribution for the first

IA-64 microprocessor. In ISSCC00, pages 176–177, 2000.

[RWW+02] Woonghwan Ryu, Albert Lu Chee Wai, Fan Wei, Wai Lai Lai, and Joungho

Kim. Over GHz low-power RF clock distribution for a multiprocessor digital

system. IEEE Transactions on Advanced Packaging, 25(1):18–27, February

2002.

[S03] Ingemar Soderquist. Globally updated mesochronous design style. IJSSC,

38(7):1242–1249, 2003.

[Sei79] Charles L. Seitz. System timing. In Introduction to VLSI Systems (Carver

Mead and Lynn Conway), chapter 7, pages 218–262. Addison Wesley, 1979.

[Sei94] Jakov N. Seizovic. Pipeline synchronization. In Proceedings of the First In-

ternational Symposium on Advanced Research in Asynchronous Circuits and

Systems, pages 87–96. IEEE Computer Society Press, 1994.

[SF01] Ivan Sutherland and Scott Fairbanks. GasP: A minimal FIFO control. In

Proceedings of the Seventh International Symposium on Advanced Research in

Asynchronous Circuits and Systems, pages 46–53, April 2001.

[SH97] S Sidiropoulos and M Horowitz. A semi-digital DLL with unlimited phase

shift capability and 0.08-400 MHz operating range. In Proceedings of the 1997

International Solid-State Circuits Conference, pages 332–333, February 1997.

[SM00] Allen E. Sjogren and Chris J. Myers. Interfacing synchronous and asynchronous

modules within a high-speed pipeline. IEEE Transactions on VLSI Systems,

8(5):573–583, October 2000.

[SPL02] Tiberiu Seceleanu, Juha Piosila, and Pasi Liljeberg. On-chip segmented bus: A

self-timed approach. In Proceedings of the 15th IEEE ASIC/SOC Conference,

pages 216–220, September 2002.

Bibliography 67

[TR+00] Simon Tam, Stefan Rusu, et al. Clock generation and distribution for the first

IA-64 microprocessor. IEEE Journal of Solid-State Circuits, 35(11):1545–1552,

November 2000.

[WGG02] Anthony J. Winstanley, Aurelien Garivier, and Mark R. Greenstreet. An

event spacing experiment. In Proceedings of the Eigth International Sympo-

sium on Advanced Research in Asynchronous Circuits and Systems, pages 42–

51, Manchester, UK, April 2002.

[Wik03] Daniel Wiklund. Mesochronous clocking and communication in on-chip net-

works. In Proceedings of Swedish System-on-chip conference, April 2003.

[XB+01] Thucydides Xanthopoulus, Daniel W Bailey, et al. The design and analysis of

the clock distribution network for a 1.2 GHZ alpha microprocessor. In ISSCC01,

pages 402–403, 2001.

[YD96] Ken Y. Yun and Ryan P. Donohue. Pausible clocking: A first step toward

heterogeneous systems. In Proceedings of the 1996 International Conference on

Computer Design, pages 118–123, October 1996.

[YH00] Evelina Yeung and Mark A. Horowitz. A 2.4 Gb/s/pin simultaneous bidirec-

tional parallel link with per-pin skew compensation. IEEE Journal of Solid-

State Circuits, 35(11):1619–1628, November 2000.

[YS89] Jiren Yuan and Christer Svensson. High-speed CMOS circuit technique. IEEE

Journal of Solid-State Circuits, 24(1):62–70, February 1989.