low power digital receivers for multi- gb/s wireline
TRANSCRIPT
Low Power Digital Receivers for Multi-
Gb/s Wireline/Optical Communication
by
A K M Delwar Hossain
A thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science
in
Integrated Circuits and Systems
Department of Electrical and Computer Engineering
University of Alberta
© A K M Delwar Hossain, 2017
ii
Abstract
As the gate length scaling continues to less than 10nm, digital performances of integrated
circuits (IC) continue to improve at a faster rate than their analog performances. This
naturally leads to a trend where traditional high-speed mixed signal transceivers have to be
replaced by their digital counterparts. For a wireline receiver, this means replacing
traditional analog mixed signal equalizers with digital equalization techniques while still
fitting within challenging power budget. Similarly, traditional optical receivers use analog
mixed signal techniques to solve data dependent DC offset, burst mode timing recovery,
etc. This work introduces fully digital techniques to address these challenges to design
compact, low power digitally-enhanced optical receivers.
The first receiver architecture of this dissertation describes the design technique of energy-
efficient sequence detection and equalization without the use of any ADC and DSP. This
scheme takes advantage of the inter-symbol-interference (ISI) in the channel to reconstruct
the time domain bit sequence. It is the most power efficient digital receiver reported to
date. It improves power efficiency i.e. power consumed per bit by 2.5X and power
consumed per bit per dB channel loss by 2.65X of the state-of-art.
The second receiver takes the architecture of the first receiver and modifies it to introduce
data trace-back. This is the first-time implementation of data trace-back in SerDes. The
data trace-back improves noise immunity and the voltage margin of the system. The added
iii
data trace-back with the decision feedback equalization (DFE) improved BER from 10-10
(DFE only) to 10-12.
The third receiver architecture describes a low power 7-10 Gb/s burst mode DC-coupled
receiver for photonic switch networks. The concurrent operation of DC and timing
recovery implies low latency in burst mode receivers. DC is recovered using SAR
(successive approximation register) logic within 6 cycles of 1/8th of data rate clock, which
is 6.5X improvement from the current state-of-art. It consumes only 32.7 mW during
runtime.
iv
Preface
The concept, architecture, and measurement results of the work in Chapter 2 has been
published in Symposium on VLSI Circuits: “"A 35 mW 10 Gb/s ADC-DSP less direct
digital sequence detector and equalizer in 65nm CMOS," A. D. Hossain, Aurangozeb, M.
Mohammad and M. Hossain, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits),
Honolulu, HI, 2016, pp. 1-2.” I was responsible for the system design, data collection, and
manuscripts composition. Aurangozeb assisted me during design and data collection. Dr.
Masum Hossain was the supervisor of the project.
The paper containing overall concept and measurement results of optical receiver described
in Chapter 4 has been submitted for the peer review journal of IEEE Transactions on
Circuits and Systems I: Regular Papers. “"Burst Mode Optical Receiver with 10ns Lock
Time Based on Concurrent DC Offset and Timing Recovery Technique," A. D. Hossain,
Aurangozeb and M. Hossain, IEEE Transactions on Circuits and Systems I.” I was
responsible for the system design, data collection, and manuscripts composition.
Aurangozeb assisted me during system design, manuscript editing, and data collection. Dr.
Masum Hossain was the supervisor of the project.
v
Acknowledgements
First and foremost, I would like to express my sincere gratitude to my supervisor Prof.
Masum Hossain, for giving me the golden opportunity of being part of his group when I
knew little about circuits. Dr. Hossain has been a great inspiration for me in last three years.
His way of teaching and supervising has helped me a lot in understanding very basics of
circuit design. His friendly endeavor to me made working with him productive and fun. He
will be a role model that I will look up to as I start my career. Special thanks to him for
providing me with proper funds to carry out my studies.
I would also like to thank Prof. Duncan Elliott and Prof. Kambiz Moez for being on my
supervisory committee. I would like to thank Prof. Pedram Mousavi and CMC
microsystems for lending us the equipment for testing the chips.
It has been my privilege to work with team members of Mixed-signal Integrated systems
in Nano-Technology (MINT) lab at University of Alberta. I would like to thank Amlan,
Aurangozeb and Waleed for their friendship and technical insight. My sincere thanks to all
friends and well-wishers, particularly Manir, Ishtiza and Shibli for making my life in
Edmonton enjoyable.
Thanks to my better half, Raka, for all her support and encouragement. This three year in
Edmonton would be impossible without her support and patience. Finally, I thank my
parents and siblings for always believing in me.
vi
Contents
ABSTRACT ............................................................................................................ II
PREFACE ............................................................................................................. IV
ACKNOWLEDGEMENTS ................................................................................... V
CONTENTS .......................................................................................................... VI
LIST OF TABLES ................................................................................................ IX
LIST OF FIGURES ............................................................................................... X
CHAPTER 1. INTRODUCTION ........................................................................... 1
1.1. MOTIVATION .............................................................................................. 3
1.2. ORGANISATION OF THE THESIS .................................................................. 3
CHAPTER 2. A 10 GB/S DIRECT DIGITAL SEQUENCE DETECTOR AND
EQUALIZER WITHOUT ADC-DSP IN 65NM CMOS ................................................... 5
2.1. EQUALIZATION: MOVING FROM ANALOG TO DIGITAL ............................... 6
2.2. CONCEPT OF SEQUENCE DECODING ......................................................... 11
2.2.1. Reference Placement: ADC vs. Sequence Detector ........................... 12
2.2.2. Digital Sequence Detection ................................................................ 14
2.2.3. Advantages & Challenges in Direct Sequence Decoding .................. 17
2.2.4. Sequence DFE .................................................................................... 19
2.3. LINK DESIGN ............................................................................................ 21
2.3.1. Comparator Reference for Sequence Generation .............................. 23
2.3.2. Sequence Generation and DFE Implementation ................................ 24
2.4. HARDWARE COST AND PERFORMANCE OF THE SYSTEM ........................... 31
2.5. ERROR TOLERANCE .................................................................................. 33
2.5.1. Bottom Prediction Error Tolerance ................................................... 34
2.5.2. Mid Prediction Error Tolerance ........................................................ 35
2.5.3. Top Prediction Error Tolerance ......................................................... 38
vii
2.5.4. In-Bank Comparator Error Tolerance ............................................... 39
2.6. SYSTEM DESIGN ....................................................................................... 40
2.6.1. Passive Equalizer ............................................................................... 41
2.6.2. Sample & Hold (SH) ........................................................................... 44
2.6.3. Comparator ........................................................................................ 48
2.6.4. SR Latch ............................................................................................. 51
2.6.5. Reference Muxing Timing Issue ......................................................... 52
2.6.6. Sequence DFE Critical Timing .......................................................... 53
2.7. IMPLEMENTATION & EXPERIMENTAL RESULTS ........................................ 54
CHAPTER 3. A 16 GB/S DIRECT DIGITAL SEQUENCE DETECTOR WITH
DATA TRACE-BACK AND EQUALIZER IN 65NM CMOS ....................................... 60
3.1. NOISE TOLERANCE LIMIT OF CURRENT RECEIVER ................................... 61
3.1.1. Case I .................................................................................................. 62
3.1.2. Case II ................................................................................................ 63
3.2. IMPROVING NOISE TOLERANCE OF CURRENT RECEIVER .......................... 64
3.2.1. Setting Fixed Data Comparators ....................................................... 65
3.2.2. Case I .................................................................................................. 66
3.2.3. Case II ................................................................................................ 69
3.2.4. Trace-back Sequence Generation ...................................................... 70
3.2.5. Conditions of Data Trace-back .......................................................... 71
3.2.6. Improved Noise Margin ...................................................................... 72
3.3. SYSTEM DESIGN ....................................................................................... 74
3.3.1. Reference Muxing Timing .................................................................. 75
3.3.2. DFE and Trace-back Feedback ......................................................... 76
3.4. EXPERIMENTAL RESULTS ......................................................................... 77
CHAPTER 4. BURST MODE OPTICAL RECEIVER WITH 10NS LOCK TIME
BASED ON CONCURRENT DC OFFSET AND TIMING RECOVERY TECHNIQUE .
........................................................................................................... 82
4.1. CONVENTIONAL OPTICAL RECEIVER ........................................................ 83
4.2. CHALLENGES IN BURST-MODE RECEIVER ................................................ 85
viii
4.3. PROPOSED BURST MODE RECEIVER ......................................................... 90
4.3.1. Trans-impedance Amplifier (TIA) ...................................................... 91
4.3.2. DC Recovery ....................................................................................... 99
4.3.3. Clock Recovery ................................................................................. 101
4.3.4. Timing Skew Correction ................................................................... 104
4.4. IMPLEMENTATION AND MEASUREMENT RESULTS .................................. 106
4.5. COMPARISON WITH STATE-OF-ART ........................................................ 111
CHAPTER 5. CONCLUDING REMARKS....................................................... 113
5.1. FUTURE WORKS ..................................................................................... 116
BIBLIOGRAPHY ............................................................................................... 117
ix
List of Tables
Table 2.1: Performance Summary of the 10 Gb/s Receiver. .............................................59
Table 3.1: Reference placement for checking probable bank miss. ..................................68
Table 3.2: Trace-back choice logic. ...................................................................................71
Table 3.3: Detecting Strong 1/0. ........................................................................................72
Table 3.4: Receiver Summary. ...........................................................................................81
Table 4.1: Performance Comparison Summary of the TIA. ............................................111
Table 4.2: Performance Comparison Summary of the Receiver. ....................................112
Table 5.1: Performance summary of the implemented receivers. ...................................113
x
List of Figures
Figure 1.1: Data rate trend in digital receivers ([3]–[12]). ..................................................2
Figure 1.2: Power efficiency of high-speed digital receivers over the years ([4],
[11]–[16]). .......................................................................................................2
Figure 2.1: Conventional wireline receiver. ........................................................................6
Figure 2.2: N-tap feedforward equalization technique. .......................................................7
Figure 2.3: N-tap decision feedback equalization technique. ..............................................7
Figure 2.4: 2-tap loop unrolled DFE. ...................................................................................8
Figure 2.5: Conventional ADC-based solution for the wireline receivers...........................9
Figure 2.6: Digital receiver using FFE and maximum likelihood sequence
detector. .........................................................................................................10
Figure 2.7: ADC-based receiver block diagram and reference placement. .......................12
Figure 2.8: Sequence detector reference placement depending on tap coefficient. ...........14
Figure 2.9: Digital sequence detection technique at t=t0 and t=t1. .....................................15
Figure 2.10: Digital sequence detection technique at t=t2 and t=t3. ...................................15
Figure 2.11: Digital sequence detection technique (summarized). ....................................16
Figure 2.12: ADC vs. Sequence Detection. .......................................................................18
Figure 2.13: Sequence DFE choices for 3 taps. .................................................................20
Figure 2.14: Example showing sequence DFE technique. ................................................21
Figure 2.15: Channel response and single bit response before and after the passive
equalizer of example transmission link. ........................................................22
xi
Figure 2.16: References for the link. .................................................................................23
Figure 2.17: Placement of edge comparator for prediction ...............................................24
Figure 2.18: Use of the edge comparators for placing the data comparators. ...................26
Figure 2.19: Sequence DFE feedback for the link. ............................................................27
Figure 2.20: Reduction of bank comparators using DFE. .................................................27
Figure 2.21: (a) Placement of floating comparators. (b) Verification of edge
prediction. (c) Example sequence generation and DFE
implementation. .............................................................................................29
Figure 2.22: Sequence sorting process of the overall system. ...........................................30
Figure 2.23: Comparison of the required number of comparators among loop
unrolled DFE, ADC, and sequence DFE. .....................................................32
Figure 2.24: Comparison of noise margin between ADC-based DFE and sequence
DFE. 33
Figure 2.25: Error tolerance of edge comparator prediction to send the floating
data comparators to bottom at time t=t1. .......................................................35
Figure 2.26: Error tolerance of edge comparator prediction to send the floating
data comparators to the middle at time t=t2. .................................................36
Figure 2.27: Error tolerance of edge comparator prediction to send the floating
data comparators to bottom at time t=t3. .......................................................37
Figure 2.28: Error tolerance of edge comparator prediction to send the floating
data comparators to the top at time t=t3. .......................................................38
Figure 2.29: In-bank comparator error tolerance of the system using DFE. .....................39
Figure 2.30: Maximum likelihood sequence detector with passive equalization
and timing recovery.......................................................................................41
Figure 2.31: Schematic of passive equalizer with its AC response for different
settings. .........................................................................................................42
xii
Figure 2.32: Passive equalizer response for example link. (a) AC analysis, (b)
input to the equalizer and (c) output of the equalizer....................................43
Figure 2.33: Layout of the implemented passive equalizer. ..............................................43
Figure 2.34: Concept of sample and hold. .........................................................................44
Figure 2.35: Implemented sample and hold circuitry (a) and its simulated
differential operation (b). ..............................................................................46
Figure 2.36: Layout of sample and hold circuit. ................................................................48
Figure 2.37: Schematic of the implemented comparator. ..................................................49
Figure 2.38: Operational stages of the comparator. ...........................................................49
Figure 2.39: Simulation results showing sampling, regeneration, decision, and
reset stages of the comparator. ......................................................................50
Figure 2.40: Layout of the implemented comparator. .......................................................50
Figure 2.41: SR Latch implementation and truth table. .....................................................51
Figure 2.42: Layout of the implemented SR latch. ............................................................52
Figure 2.43: Edge and data SH and comparator clocking for CH0. ..................................53
Figure 2.44: Reference settling behavior. ..........................................................................53
Figure 2.45: Sequence DFE and loop unrolling for B+1 feedback (red box). ....................54
Figure 2.46: Implemented prototype in 65nm CMOS. ......................................................55
Figure 2.47: Complete receiver with its power consumption. ...........................................56
Figure 2.48: Pin diagram and test setup of the prototype. .................................................56
Figure 2.49: Measured half rate 4-bit sequence DAC output. ...........................................57
xiii
Figure 2.50: Measured 10 Gb/s input eye (a) and 16 levels output eye of sequence
detector (b). ...................................................................................................57
Figure 2.51: Measured 2.5 GHz recovered clock eye (a) and histogram (b). ....................57
Figure 2.52: BER bathtub (a) and the recovered 2.5 Gb/s PRBS (Pseudo-random
bit sequence) checked data eye. ....................................................................58
Figure 3.1: Case I – error in sequence detection. B0 and B-1 detected incorrect. ..............63
Figure 3.2: Case II – error in sequence detection. B-1 detected incorrect. .........................64
Figure 3.3: Introduction of two fixed data comparator reference instead of fixed
edge comparators. .........................................................................................65
Figure 3.4: Fixing error with little margin using fixed data comparators. .........................66
Figure 3.5: Introduction of check comparator to find out whether it may miss a
bank or not. ...................................................................................................67
Figure 3.6: Overall reference placement of the architecture with trace-back. ...................69
Figure 3.7: Resolving the issue of case II. .........................................................................70
Figure 3.8: Comparison of noise margins of 4 tap sequence DFE with and without
data trace-back with ADC-based DFE. .........................................................73
Figure 3.9: System architecture of quad-rate receiver with data trace-back. .....................74
Figure 3.10: Reference Muxing for quadrate channel running at 4 GHz. .........................75
Figure 3.11: Feedback for DFE and Trace-back. ...............................................................76
Figure 3.12: Implemented prototype in 65nm. ..................................................................78
Figure 3.13: Pin diagram and test setup of the prototype. .................................................79
Figure 3.14: Channel Response at 8 GHz. .........................................................................79
Figure 3.15: Measured 16 Gb/s input eye after passive equalizer (a) and 16-level
output eye of the sequence decoder (b). ........................................................80
xiv
Figure 3.16: Measured 4 GHz clock eye (a) and PRBS checked recovered 4 Gb/s
data eye (b). ...................................................................................................80
Figure 3.17: BER bathtub comparing DFE and Trace-back. .............................................80
Figure 3.18: 2D color-map of BER showing voltage margin in Y-axis and timing
margin in X-axis of sequence DFE (a) and data trace-back (b). ...................81
Figure 4.1: Conventional vs. proposed implementation of the optical receiver. ...............84
Figure 4.2: Conventional implementation of burst mode receiver with DC offset
calibration. .....................................................................................................86
Figure 4.3: Settling time vs. bandwidth of DC recovery loop. ..........................................87
Figure 4.4: Conventional timing recovery loop. ................................................................88
Figure 4.5: Effect of DC on duty cycle of the data. ...........................................................89
Figure 4.6: Block Diagram of the receiver with power breakdown. .................................91
Figure 4.7: Detailed schematic of TIA and its DC recovery loop. ....................................92
Figure 4.8: TIA transfer function and its simulated and measured gain. ...........................93
Figure 4.9: Measured 10 Gb/s output eye of the front end. ...............................................93
Figure 4.10: Simulated amplifier stage output eye with (left) and without (right)
C1 & C2 for input currents of 100 µA, 250 µA and 600 µA (from top
to bottom). .....................................................................................................94
Figure 4.11: Simulated and calculated input referred noise of TIA. .................................99
Figure 4.12: Proposed DC recovery technique. ...............................................................100
Figure 4.13: Measured scope shot of DC recovery operation within 4.8 ns. ...................101
Figure 4.14: DC recovery without LPF and use of rising edge pulses allowing
concurrent operation of DC and timing recovery. ......................................102
Figure 4.15: Ring oscillator with pulse filtering and its timing diagram. ........................103
xv
Figure 4.16: Timing Skew compensation. .......................................................................104
Figure 4.17: Slope Detection. ..........................................................................................105
Figure 4.18: Slope detection logic. ..................................................................................106
Figure 4.19: Implemented die photo in 0.13 µm. ............................................................107
Figure 4.20: Pin diagram and test setup of the prototype. ...............................................107
Figure 4.21: Measured scope shot of DC and timing recovery with preamble and
data pattern. .................................................................................................108
Figure 4.22: Recovered clock eye. ...................................................................................108
Figure 4.23: Recovered clock histogram. ........................................................................109
Figure 4.24: Phase noise plot of the recovered clock. .....................................................109
Figure 4.25: On-chip PRBS checked recovered 2.5 Gb/s channel data eye. ...................110
1
Chapter 1.
Introduction
Wired data communication systems can be classified into two major categories based on
their medium of transmission: wireline and optical. Wireline systems use copper medium
for short distance data transmission. For chip-to-chip backplane data transmission, copper
is a cost-effective option. On the other hand, optical systems use fibre optic cable to carry
a large amount of data over a long distance. Fibre optic cables can provide significantly
higher data bandwidth than its counterpart. As a result, optical receivers are finding their
place in data centers. Wireline and optical receivers implemented in the analog domain
have shown excellent performance and power efficiency at high speed ([1], [2]). However,
with shrinking technology nodes and increasing data rates, the performance of these
receivers is not improving as much as their digital counterparts due to supply scaling,
increased device leakage, increased noise and process variations. Therefore, the recent
trend shows that digital implementations of these receivers are getting higher interest from
the circuit design industry than their analog counterparts. Figure 1.1 shows the increasing
data rates of state-of-art digital receivers over the years implying we can achieve good
high-speed performance in digital receivers as well. However, digital implementations
require analog-to-digital converters (ADC) followed by digital signal processing (DSP)
2
units which make them exceed existing transceiver power budget. Figure 1.2 illustrates
power efficiency (consumed power per bit) of high-speed digital receivers over the years.
Figure 1.1: Data rate trend in digital receivers ([3]–[12]).
Figure 1.2: Power efficiency of high-speed digital receivers over the years ([4], [11]–
[16]).
2011 2012 2013 2014 2015 2016
5
10
15
20
25
30
35
Year of Publication
Da
ta R
ate
(G
b/s
)
Ting, ISSCC[5]
Varzaghani, JSSC[6]
Zhang, ISSCC[8]
Shafik, ISSCC [11]
Hossain, VLSI [12]
Rylov, ISSCC[10]
Cui, ISSCC[9]
Abiri, ISSCC[3]
Tabasy, JSSC[7] Chen, JSSC[4]
2008 2009 2010 2011 2012 2013 2014 2015 2016
101
102
Year
Po
wer
Eff
icie
ncy
(p
J/b
it i
n l
og
sca
le) Agazzi, JSSC [13]
Cao, ISSCC [14]
Yamaguchi, ISSCC [15]
Chen, JSSC [4]
Zhang, ISSCC [16]
Shafik, ISSCC [11]
Hossain, VLSI [12]
3
1.1. Motivation
The main motivation of this thesis is to make digital receivers power efficient in both
wireline and optical cases. For wireline receivers, as the high-speed data passes through
the copper medium, it suffers from dielectric loss and parasitic crosstalk. The channel
appears as a low pass filter to the high-frequency component of the signal and attenuates it
by adding random and deterministic noise. So, at the presence of high-speed data and
channel loss, un-equalized transceivers cannot provide adequate performance. Therefore,
different equalization schemes have become a necessary part of receiver design. One of the
motivations of this thesis is to search for a cost-effective low power equalization scheme
for wireline digital receivers.
Burst mode optical receivers are used in rapidly reconfigurable photonic switch networks.
This reconfiguration operation requires both burst mode DC and timing recovery in optical
receivers. This thesis also concentrates on a low latency low power digital burst mode
optical receiver.
Although pulse amplitude modulation (PAM) provides us with lower bandwidth
requirement, for the sake of simplicity and higher signal to noise ratio (SNR), non-return-
to-zero (NRZ) binary data will be used in this thesis.
1.2. Organisation of the Thesis
A total of three data receivers are designed for this thesis. Two of them are focused on
wireline applications and the other one is an optical receiver.
4
Chapter 2 describes an energy efficient 10 Gb/s sequence detector and equalizer without
the use of any ADC and DSP implemented in 65nm CMOS technology. The chapter
contains the concept of sequence detection and equalization using a highly digital
architecture. Implemented analog front end, digital logics and timing issues of the system
are also described in detail.
Chapter 3 takes the receiver design from Chapter 2 to have improved noise immunity. The
receiver design is modified to have 1-bit data trace-back that provides better noise
performance and voltage margin than before. Front-end and digital logics are made to
handle a higher data rate of 16 Gb/s for this receiver.
Chapter 4 describes a 7-10 Gb/s burst mode optical receiver implemented in 0.13 μm. DC
recovery operation without low-pass filter enables us to have timing recovery operation
going on at the same time resulting in a low latency receiver. The chapter covers analog
front-end with its noise performance, successive approximation register (SAR) logic based
DC recovery, injection locking based timing recovery and timing skew correction using
slope detection of the signal.
Chapter 5 summarizes the performance of all three receivers and proposes future work that
can be done.
5
Chapter 2.
A 10 Gb/s Direct Digital Sequence
Detector and Equalizer without ADC-
DSP in 65nm CMOS
This chapter describes a low power 10 Gb/s sequence detector and equalizer without the
use of ADC and DSP in TSMC 65nm. The chapter begins with a brief description of the
conventional receivers in Section 2.1. This section compares state-of-art mixed signal and
ADC-based solutions. Section 2.2 discusses the technique of digital sequence detection and
compares this technique with ADC-based solutions. The digital sequence decoding
naturally leads us to sequence DFE technique. Section 2.3 takes an example transmission
link and discusses sequence generation and DFE implementation techniques used in this
receiver. Section 2.4 discusses hardware cost and noise margin of the system and compares
it with ADC-based solutions. Section 2.5 discusses error tolerance of the receiver with
examples. Section 2.6 provides details of the receiver design starting from the top-level
block diagram and then descending to the overall design. This section includes description
of sensitive analog front-end blocks and critical timing margins of the system. Section 2.7
shows experimental results and comparison with the state-of-art ADC-based receivers.
6
2.1. Equalization: Moving from Analog to Digital
In high-speed wireline transceivers, the frequency dependent channel loss is the main
source of inter-symbol interference (ISI). In simple words, ISI is the residue of the current
symbol that affects the following symbols (pre-cursor) as well as the previous symbols
(post-cursor). For high loss channels, conventional receiver designs ([1], [17]–[20]) usually
feature analog continuous time linear equalization (CTLE), implemented using active or
passive elements, in the front end as shown in Figure 2.1. In addition, feed-forward
equalization (FFE) (Figure 2.2) and decision feedback equalization (DFE) (Figure 2.3)
techniques are used for further ISI cancellation and bit detection. There are feedback timing
concerns for the first tap cancellation in DFE for high-speed data since this feedback has
to be done within 1 unit interval (UI). Although prior work [19] has demonstrated that
Figure 2.1: Conventional wireline receiver.
CTLE
• Passive
• Active
Channel
From
TX
Z-1
Z-1
Clock
Recovery
Recovered
Data
7
Figure 2.2: N-tap feedforward equalization technique.
Figure 2.3: N-tap decision feedback equalization technique.
direct cancellation of the first tap is possible, but as the data rates go up, it becomes hard
to comply with the DFE timing margin for the first tap. To resolve this timing issue, a
commonly used approach is loop unrolled DFE ([21], [22]). In this approach, signals are
pre-calculated and then one of them is selected using previously decoded bits. This
approach can be extended to other taps if there are feedback timing concerns with them.
Figure 2.4 shows 2-tap loop unrolled DFE architecture. The loop unrolled approach
increases front-end hardware and power by two times as the number of taps with timing
concerns increase.
T Ty(t) y(t-T)
Ty(t-2T)
Ty(t-NT)
∑
α0 α1 α2αN
f(2)f(1)f(0) f(N)
FFE Output
Decision
CircuitT
d(t) d(t-T)T T
d(t-NT)
+y(t) +
∑
β0 β1 βN
b(1)b(0) b(N)
-
8
Figure 2.4: 2-tap loop unrolled DFE.
In general, analog mixed-signal solution can equalize with excellent energy efficiency
(around ~3pJ/bit ([1], [20])). However, their performance can be limited by the analog
performance of the technology such as gain-bandwidth product and comparator resolution.
First, there is SNR degradation which results from the CTLE, that generally inverts the
channel, which amplifies noise, including crosstalk. Second, the linearity requirement is
hard to achieve - scaled supply reduces maximum achievable linear swing. Third, process
variation makes it very difficult to achieve reliable control over zero and pole frequencies
to achieve the desirable frequency response. All these factors limit the performance of the
symbol-by-symbol detection technique. In such SNR-limited cases, sequence decoders
d(t)
Comparator
y(t)+
-+β1+β2
Comparator
y(t)+
-+β1-β2
Comparator
y(t)+
-+β1+β2
Comparator
y(t)+
-+β1-β2
Z-1 Z-1MUX4X1
d(t-T) d(t-2T)
9
Figure 2.5: Conventional ADC-based solution for the wireline receivers.
outperform symbol-by-symbol detectors. However, existing maximum likelihood
sequence detectors (MLSD) require an analog-to-digital (ADC) converter in the front-end
([13], [16]). There are previously-reported ADC-based solutions where equalization
(FFE/DFE) is moved to the digital domain (Figure 2.6) ([4], [14], [16], [23]).
Harwood et. al. [23] implemented a receiver with 2-tap FFE and 5-tap DFE in digital
domain using baud rate sampling ADC for 12.5 Gb/s operation. The architecture used two
6.25 Gb/s (half rate) 4.5-bit flash ADC and DSP to perform the numerical FFE and DFE
to compensate 24dB channel loss with a power efficiency of 26.4 pJ/bit. Zhang et. al. [16]
combined both conventional approaches by designing a dual path receiver; CTLE and DFE
for high SNR input and ADC-based solution for low SNR data. The time-interleaved 6-bit
ADC alone takes 195 mW for 10.3125 GS/s operation. Moreover, additional DSP
consumes 500 mW [14]. In recent works ([4], [11]), it is demonstrated that ADC-based
solutions can achieve a power efficiency of ~10pJ/bit. Chen et. al. [4] demonstrated a
power efficiency of 13 pJ/bit using a four-way time interleaved 4-bit flash ADC. A three-
Channel
From
TX
Clock
Recovery
Recovered
DataADCDigital
Equalization
• DFE
• FFE/IIR
CTLE
• Passive
• Active
10
stage continuous time high pass filter and a 2-tap FFE were implemented in the analog
front-end and a 5-tap DFE was implemented in digital domain to equalize 29 dB channel
loss. Shafik et. al.[11] demonstrated a power efficiency of 8.7 pJ/bit using a 32-way time
interleaved 6-bit SAR ADC. A 4-tap FFE and 3-tap DFE were implemented in digital
domain for compensating channel loss up to 25.3 dB.
In [13], Agazzi et. al. demonstrated the first maximum likelihood sequence detector
(MLSD) for multimode fibers (Figure 2.6). The reported sequence decoder outperforms
symbol-by-symbol detectors by at least 2 electrical dB for 10.3125 Gb/s operation. The
analog front-end of the digital receiver incorporates an 8-way time-interleaved 10-stage
pipelined ADC with self-calibration where each path has two slices. At any given time in
one path, one slice is in normal operation mode and the other one is in calibration or power
down mode. For on-chip DSP, the outputs of the ADCs are further demultiplexed by a
factor of 2. The digital back-end of the receiver has a nonlinear MIMO (multiple-input,
Figure 2.6: Digital receiver using FFE and maximum likelihood sequence detector.
ADC Z-1 Z-1y(n) y(n-1)
Z-1y(n-2) y(n-N)
∑
α0 α1 α2αN
f(2)f(1)f(0) f(N)
FFE Output
Maximum
Likelihood
Sequence
Detector
N=5→25
11
multiple-output) channel estimator, a 5-25 tap FFE and an 8-state sliding block Viterbi
decoder (SBVD).
In all these cases, while ADC alone approaches 10 pJ/bit power consumption, including
the DSP [16] significantly exceeds the receiver power budget for SerDes solution space.
Therefore, a solution without ADC and DSP is an attractive low-power alternative to
existing approaches.
2.2. Concept of Sequence Decoding
Any high-speed data going through a lossy medium suffers from ISI, crosstalk, and random
noise. In most cases, ISI is the most dominant factor contributing to channel loss. Due to
ISI, we can say the channel has memory. This memory can extend up to hundreds of
previous symbols and a few following symbols. For simplicity, we can consider a channel
where ISI is limited to within three taps - one pre-cursor (h-1), main (h0) and one post-
cursor (h+1). We can assume a transmitter from which bits are transmitted, and a receiver
receives those bits with ISI after passing through the lossy channel. The received signal is
essentially the result of convolution between the transmitted bit symbols with the impulse
response of the channel. Therefore, baud spaced sampled values will be the combination
of these three taps. An example of the transmitted bit sequence and the received signal after
channel loss are illustrated in Figure 2.7 which will be used throughout this section to
discuss the concept of sequence detection.
12
2.2.1. Reference Placement: ADC vs. Sequence Detector
For ADC-based solutions, if we consider a flash ADC, the received signal is fed to a
comparator bank (Figure 2.7). The total signal space is divided into 2N levels for N-bit
ADC to cover the whole dynamic range. As we are considering 3-taps of ISI for sequence
detector, we can also consider a 3-bit ADC where the signal space will be divided into 23=8
levels. The comparator bank output is ideally thermometric in nature. A thermometric to
Figure 2.7: ADC-based receiver block diagram and reference placement.
0 0 0 1 0 1 1 1
time
Transmitted
Bits
Received
Signal Total
Signal
Space
Channel
From
TX
h0
h-1 h+1
Comp
Bank
Therm.
to Binary
Digital
FFE & DFE
111
110
101
100
011
010
001
000
MSB LSB
13
binary converter gives us the ADC output, which is then processed in the digital domain
for DFE and FFE operation. The digital implementation of the FFE is realized by
subtracting the ADC output bits of nearby cursors from the ADC output of the main cursor.
In digital subtractors, the main cursor output of the ADC will go as is and the nearby-cursor
taps will be subtracted using the ratio of their tap coefficients. For the DFE implementation
in the digital domain, the previous bit decisions are fed back which is then subtracted from
the ADC output to get the current bit decision.
For sequence detection, the received signal at any point in time can be seen as a
combination of pre-cursor and post-cursor components of the neighboring bit stream and
main cursor component of the current bit. Therefore, from the received signal sample, we
can reconstruct the corresponding time sequence of previous, current and next bit. In this
case (Figure 2.8), as there are only three taps, the bit sequence is B+1B0B-1. Here, B+1 is the
previous bit, B0 is the current bit, and B-1 is the next bit. The received signal has to go
through a comparator bank. To set the references for these comparators, the simplest
approach is to directly calculate the distance from the received sampled value to each
sequence constellation. It can be done by comparing the sampled value to a set of references
based on different combinations of main (h0), pre-cursor (h-1) and post-cursor (h+1) taps. In
general, we will have a time sequence of length N if there are N number of un-equalized
taps in the single bit response. In the signal space, these N taps can combine in 2N number
of ways, creating 2N signal levels corresponding to 2N unique sequences.
14
Figure 2.8: Sequence detector reference placement depending on tap coefficient.
2.2.2. Digital Sequence Detection
Figure 2.11-Figure 2.11 presents the sequence detection technique. For digital sequence
detection, we are placing the comparator references as a combination of tap values
representing time sequence as illustrated in Figure 2.8. At time t=t0, the signal is at the
bottom of signal space (Figure 2.9 (a)). At this time, from transmitted bit sequence, the
previous bit is 0, the current bit is 0, and the next bit is also 0. So, at this point, the 3-bit
sequence decoder should give
0 0 0 1 0 1 1 1
time
Transmitted
Bits
Received
Signal Total
Signal
Space
Prev.
Bit
B+1
110
111
011
101
010
100
001
000
Curr.
Bit
B0
Next
Bit
B-1
Channel
From
TX
h0
h-1 h+1
Comp
Bank
Sequence
Decoder
15
(a) t=t1
(b) t=t2
Figure 2.9: Digital sequence detection technique at t=t0 and t=t1.
000. The current bit for t=t0 becomes the previous bit at t=t1 (Figure 2.9 (b)). Similarly, the
next bit for t=t0 becomes the current bit at t=t1. Now at t=t1, the next bit is 1. Due to this
bit, the received signal starts to increase. The decoded sequence at t=t1 should be 001.
(a) t=t2
(b) t=t3
Figure 2.10: Digital sequence detection technique at t=t2 and t=t3.
0 0 0 1 0 1 1 1
time
Transmitted Bits
Received Signal
110
111
011
101
010
100
001
000
t5t4t3t2t1t0
0 0 0 1 0 1 1 1
time
110
111
011
101
010
100
001
000
t5t4t3t2t1t0
Transmitted Bits
Received Signal
0 0 0 1 0 1 1 1
time
110
111
011
101
010
100
001
000
t5t4t3t2t1t0
Transmitted Bits
Received Signal
0 0 0 1 0 1 1 1
time
110
111
011
101
010
100
001
000
t5t4t3t2t1t0
Transmitted Bits
Received Signal
16
Figure 2.11: Digital sequence detection technique (summarized).
Similar things happen at t=t2 (Figure 2.10 (a)) and t=t3 (Figure 2.10 (b)). At t=t2 and t=t3,
the signal is in the middle of signal space. Although at t=t3 the main bit of the sequence is
0, it cannot go down due to its post-cursor and pre-cursor taps. Figure 2.11 shows decoded
time sequence from t=t0 to t=t5. As we are considering a sequence length of 3 bits
corresponding to three taps of ISI from the channel, each bit is decoded three times: first
as next bit, then as current bit and after that as previous bit. In general, if there are N taps
present in the signal, each bit should be decoded N times. This N time decoding of same
bit results in improved signal to noise ratio (SNR) and can be used later for verification
purposes. The post-cursor bits can be used for DFE operation, whereas the pre-cursor bits
can be used for further error correction as will be discussed later. If the received signal is
pushed up or down due to noise, the DFE can compensate for those errors by the sequence
decoder.
t0 t1 t2 t3 t4 t5
Decoded
Sequence,
B+1
B0
B-1
0
0 0
0 0 0
1 1 1
0 0 0
1 1 1
1 1
1
Each Bit decoded 3 times
17
2.2.3. Advantages & Challenges in Direct Sequence Decoding
The basic concepts of ADC and sequence detection have been discussed in prior sections
which give us the background to compare between these two techniques and find out the
challenges in implementing a sequence decoder.
Advantages of sequence decoding over ADC-based links:
Sequence decoder uses ISI in a constructive way. References are set using ISI tap
coefficient for each dominant tap in single bit response. ADCs work independently
of ISI; however, the DSP and the FFE/DFE do care about ISI. In the FFE/DFE, ISI
components are subtracted. Due to this subtraction, we lose signal power as well as
add quantization noise.
If there are N dominant taps present in single bit response of sequence decoder,
each bit will be decoded N-times. The example discussed so far has 3 dominant
taps, thus each bit is detected 3 times which improves SNR. In ADC-based links,
additional quantization noise degrades SNR.
Sequence decoder will detect the time sequence directly. The main symbol comes
as a part of the sequence. There is no need for additional DSP.
In sequence decoder, we are decoding and predicting previous and upcoming bits
respectively. These bits can be used later for further error correction. However, for
ADC-based links, an additional power hungry DSP is required for the DFE/FFE
implementation.
18
Figure 2.12: ADC vs. Sequence Detection.
Challenges of sequence decoding over ADC-based links:
In ADC-based links, the comparator references are binary and monotonically
increasing. However, in sequence decoder, references are non-binary and non-
monotonic (Figure 2.12).
In ADC-based links, the required comparator resolution for N-bit ADC is: total
signal space/2N. The required comparator resolution is not well defined for
sequence decoding as the references are non-binary and non-monotonic. It can be
less than total signal space/2N for N-bit sequence detection.
In ADC-based links, the thermometric comparator output can be easily converted
to binary values. However, converting comparator output to possible sequences, so
far, seems difficult.
time
Received Signal
Total
Signal
Space
111
110
101
100
011
010
001
000
MSB LSB
time
Received Signal
Prev.
Bit
B+1
110
111
011
101
010
100
001
000
Curr.
Bit
B0
Next
Bit
B-1
19
2.2.4. Sequence DFE
The channel we assumed for understanding this concept has three dominant taps. Out of
these taps, only one is post-cursor (h+1) which corresponds to B+1 in the bit sequence. As
shown in sequence detection, this B+1 at any time is detected as B0 just 1-unit interval (UI)
before. We can use this B0 of previous time (1 UI before) as a feedback for current sequence
detection. This post-cursor bit (B+1) feedback eases the pressure on the sequence detector
to give us the correct sequence directly. From sequence decoder, we can take a set of two
possible sequences differentiable using B+1, and out of those two we can choose one
depending on the value of B+1 (Figure 2.13). This post-cursor feedback is same as DFE
implementation. As we are implementing it for sequence detector, we can call it sequence
DFE. In general, if there are N post-cursor taps present in the single bit response of the
received signal at sequence detector input, we can generate 2N sequences. Out of these 2N
sequences, only one will be chosen using N post-cursor bits.
For this example channel, to give choices to DFE, let us have eight comparators in the
comparator bank. Each of them is getting references for different possible combinations of
h+1, h0, and h-1. Although sequences are not increasing monotonically, comparators C0-7,
here are set in monotonic increasing order. The truth table in Figure 2.13 shows the choices
for DFE implementation as each comparator output changes. We will discuss how easily
we can generate these choices in a later section. In the truth table, if all the comparators
give 0 or 1, there is only one choice to DFE. For all 0 case, 000 goes as a choice, which is
already paired with 001 when only one comparator is 1. We can get rid of this single option
20
Figure 2.13: Sequence DFE choices for 3 taps.
as this choice is already available. The same thing goes for all 1 case. So, this DFE feedback
gives us the option to remove the top and bottom most comparators and still have all the
possible sequences in the DFE choice list.
Figure 2.14 shows an example of the DFE implementation. At time t=t2, four of the bottom
comparators are 1 and the rest of them are 0. According to the truth table in Figure 2.13,
DFE choices would be 010 and 101 going to a 2:1 MUX. As the previously decoded bit is
0, out of these two choices the one with B+1=0 will go to the output. So, the output here is
010. Sequence generation for DFE, DFE implementation and timing margin of DFE will
be discussed in detail in later sections.
110
111
011
101
010
100
001
000
C6
C5
C4
C3
C2
C1
C0
C7
Comp
Bank
Sequence
Decoder
DFE Output
B+1
Feedback
Refs
Received SignalC7 C6 C5 C4 C3 C2 C1 C0 DFE Choice
0 0 0 0 0 0 0 0 000
0 0 0 0 0 0 0 1000
100
0 0 0 0 0 0 1 1001
100
0 0 0 0 0 1 1 1100
010
0 0 0 0 1 1 1 1010
101
0 0 0 1 1 1 1 1101
011
0 0 1 1 1 1 1 1011
110
0 1 1 1 1 1 1 1011
111
1 1 1 1 1 1 1 1 111
Single
choice
&
already
given in
another
pair
Redundant Comparator
21
Figure 2.14: Example showing sequence DFE technique.
2.3. Link Design
The receiver to be designed considers a voltage mode transmitter where the transmitter
differential swing varies from 600mVpp to 1Vpp. Therefore, signal attenuation is needed
to scale the received signal to match the dynamic range at the comparator bank input. Since
the high-frequency signal is already attenuated by the channel (top of Figure 2.15), only
the low-frequency signal is attenuated which translates to passive equalization.
0 0 0 1 0 1 1 1
time
Transmitted
Bits
Received
Signal
110
111
011
101
010
100
001
000
t5t4t3t2t1t0
C6=0
C5=0
C4=0
C3=1
C2=1
C1=1
C0=1
C7=0
@t2
Comp
Bank
Sequence
DecoderDFE
Output
B+1=0
010
101010
22
Figure 2.15: Channel response and single bit response before and after the passive
equalizer of example transmission link.
For a channel giving 27 dB loss, only 5 to 7 dB boost (or DC attenuation) is sufficient to
contain ISI components within four taps – one pre-cursor, main and two post-cursor taps
(Figure 2.15). This partially equalized signal is then fed to a 4-bit sequence decoder. The
same scheme works for other channels as well if we can have a programmable boost from
the passive equalizer.
-27 dB Loss at 5 GHz
Ch
an
nel
Res
po
nse
Frequency (GHz)
Passive
EqualizationChannel
From
TXComp
Bank
Sequence
Gen.
Sequence DFE
Passive EQ O --- Data Sample
X — Edge Sample
23
2.3.1. Comparator Reference for Sequence Generation
The channel we assumed in section 2.2 had only 3 taps before going to comparator bank
or sequence decoder. The practical channel after passive equalization has 4 dominant taps
in its single bit response. For reference generation, instead of having the taps in terms of
their time sequence we can have them organized using the descending order of their
weights. Main cursor tap (h0) will always have the highest weight among them. In most
cases, we will find: h0>h+1>h-1>h+2. So, we will have MSB bit representing B0, MSB-1 bit
representing B+1, MSB-2 giving B-1 and LSB representing B+2 (Figure 2.16).
Figure 2.16: References for the link.
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
11
11
10
01
00
01
11
10
01
00
10
11
10
01
00
00
11
10
01
00
Passive
EqualizationChannel
From
TX
h0
h-1
h+1
h+2
Reference Sequence: B0B+1B-1B+2
24
The taps don’t have binary relation with each other which gives rise to overlaps between
reference levels. For example, here 0011 goes above 0100 levels. There are a total of 3
overlaps for this link. If the channel loss increases further, the number of overlap will
increase. However, if we divide the references into different banks of B0 and B+1, within
each bank there will be no overlap as usually h-1>h+2. The banks will have overlap between
them, but not within them.
2.3.2. Sequence Generation and DFE Implementation
For an N-bit sequence detector, the input signal needs to be compared to 2N reference levels
that result in a power and area penalty similar to a flash ADC. So, for a 4-bit sequence, 16
reference levels will be there. Note that due to ISI, sample to sample signal variation is
limited – therefore, it is not necessary to cover the entire signal space. Rather based on
previous sample position, covering only 50% of the signal space is sufficient.
Figure 2.17: Placement of edge comparator for prediction
time
Seqn=B0B+1B-1B+2
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
time
Seqn=B0B+1B-1B+2
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Edge
Comparators
for Prediction
25
As we are covering 50% of signal space, rather than having 24=16 comparators, we can
have half of them. In other words, rather than having 4 banks, we can have only 2 of them
at a time. This selection of two banks depends on previous sample value. In clock and data
recovery circuits, the sample before a data sample is an edge sample. We can place two
comparators at the edge sample and use the output of these comparators to recycle the data
comparators each time (Figure 2.17).
Figure 2.18 illustrates how two edge comparators (Cedge1 & Cedge0) predict data position
and place the references for the data comparators. In Figure 2.18(a) at time t=t0, the data
was at the bottom of signal space. The edge sample in between t=t0 and t=t1, results in
Cedge1=0 and Cedge0=1. This edge information implies that the next sample may be in the
middle of signal space. So, rather than having bottom bank (00) references, we can give 10
bank references at t=t2. However, 01 bank references remain in their position.
In Figure 2.18(b) at time t=t1, the data was in the middle of signal space. The edge sample
in between t=t1 and t=t2, results in Cedge1=1 and Cedge0=1. This edge data implies that the
next sample may be on the top part of signal space. So, rather than having one of the middle
banks (01) references, we can give 11 bank references at t=t3. However, 10 bank references
remain in their position. So, we can see these two edge comparators directly control
references going to the floating comparators. Cedge1 controls reference switching between
11 and 01 banks and Cedge0 does the same for 10 and 00 banks. Figure 2.18(c) illustrates
reference placement for all the cases. The number of comparators now reduces from 16 to
10 out of which 2 are edge comparator and 8 are floating data comparator.
26
(a)
(b)
(c)
Figure 2.18: Use of the edge comparators for placing the data comparators.
Ref.
Switch
time
0000
0001
0010
0011
0100
0101
0110
0111
t4t3t2t1t0
time
0100
0101
0110
0111
1000
1001
1010
1011
Cedge0
=1
Cedge1
=0
Edge
Sample
t4t3t2t1t0
Ref.
Switch
time
t4t3t2t1t0
time
0100
0101
0110
0111
1000
1001
1010
1011
Cedge0
=1
Cedge1
=1
Edge
Sample
t4t3t2t1t0
1000
1001
1010
1011
1100
1101
1110
1111
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
27
In sequence detection, we have two post-cursor bits. Sequence DFE allows the sequence
decoder to give four possible sequences differentiated by B+1 and B+2. At first, B+2 feedback
will come in action as in Figure 2.19. Within the banks, sequences are differentiated by
B+2. So, when we are using a bank reference, we can take two possible sequences from
there with different B+2 values and sort out the correct value of B+2 using DFE feedback.
Figure 2.20 takes 00 bank as an example. Here even if we remove the comparators C0 and
C3 of the bank, we will get all possible sequences.
Figure 2.19: Sequence DFE feedback for the link.
Figure 2.20: Reduction of bank comparators using DFE.
DFE
Output
Comp
Bank
Sequence
Decoder
B+2
Feedback
B+1
Feedback
C3 C2 C1 C0
Generated
Sequence
0 0 0 0 0000
0 0 0 10000
0001
0 0 1 10001
0010
0 1 1 10010
0011
1 1 1 1 0011
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
11
11
10
01
00
01
11
10
01
00
10
11
10
01
00
00
11
10
01
00
Reference Sequence: B0B+1B-1B+2
C3
C2
C1
C0
28
So, the number of comparator per bank is now 2. In total, we need 6 comparators for
sequence detection. Two edge comparators are for prediction and four floating data
comparators (CF0-3) are for in-bank comparison.
In-bank comparison gives us possible combinations of B-1B+2. B0 and B+1 differentiate the
banks. But out of four banks, two are selected using the edge comparators. We can’t use
the edge prediction directly to generate possible combinations of B0B+1. A verification of
that prediction is needed using the data comparators. Although the data comparators are
floating, the data comparator having the top floating reference will be called CF3 and the
bottom one will be called CF0. Figure 2.21(a) shows the positions of all six comparators at
different times. For edge prediction verification in Figure 2.21(b), when the edge
comparators send the floating comparators to bottom two banks, we check the top floating
comparator. If CF3=0, it implies that the signal is actually at the bottom. For the middle
case, both the top (CF3) and the bottom (CF0) floating comparators are checked. CF3=0, in
this case, implies we have not missed the top bank and CF0=1 implies we have not missed
the bottom bank. The possible combinations of B0B+1 is also given in Figure 2.21(b).
Figure 2.21(c) shows two examples of how sequence generation and DFE works. At time
t=t2, the edge comparators predict and the data comparators verify that the signal is actually
on top. The top position implies 11 and 10 are the possible sequences of B0B+1. For the 11
bank, the floating comparators, CF3 and CF2, are 0 at t=t2 giving possible sequences of 00
and 01 for B-1B+2. So from the 11 bank the two possible sequences are: 1101 and 1100.
29
(a) (b)
(c)
Figure 2.21: (a) Placement of floating comparators. (b) Verification of edge prediction.
(c) Example sequence generation and DFE implementation.
CF1 and CF0 do the in-bank comparison for the 10 bank. At t=t2, both of them are 1 which
implies 11 and 10 are the two possible combinations of B-1B+2 in the 10 bank. So from the
10 bank, the two possible combinations are 1011 and 1010. The generated four
combinations go to the DFE multiplexer. One sequence from each bank is selected using
B+2 feedback. At t=t2, B+2 feedback is 0. As B+2 is LSB here, feedback is compared with
the LSB and the combinations with B+2=0 is passed to next MUX. Now, there are two
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Prediction
Verification Position
Possible
Sequence
of (B0B+1)Cedge1 Cedge0
0 0 CF3=0 Bottom00
01
0 1
CF3=0
&
CF0=1
Middle
01
10
1 1 CF0=1 Top10
11
Time Position CF3 CF2 CF1 CF0
Gen.
Seq.
B+2
Feedback
After B+2
Feedback
B+1
FeedbackOutput
t2 Top
0 01 1 0 1
0
1 1 0 0
1 11001 1 0 0
1 11 0 1 1
1 0 1 01 0 1 0
t3 Mid
0 01 0 0 1
1
1 0 0 1
1 01111 0 0 0
1 10 1 1 1
0 1 1 10 1 1 0
30
sequences that can be differentiated using B+1. At t=t2, B+1 feedback is 1. In the
combinations coming to MUX, B+1 bit is the MSB-1 bit. After comparison of B+1 feedback,
1100 sequence is passed as DFE output. A similar example is shown for t=t3.
Figure 2.22 shows the overall sorting process of the system. As there are 4 dominant taps
left in the single bit response after partial equalization by the passive equalizer, we have
total 16 sequences- 4 banks. The edge comparator prediction and the floating data
comparator verification give us 8 possible sequences from 2 banks. The in-bank floating
comparators get rid of half of them giving us a total of four sequences each from 2 banks.
These four sequences enter the DFE MUX. B+2 feedback chooses one sequence from each
bank. B+1 feedback chooses a final sequence that comes out as the DFE output.
Figure 2.22: Sequence sorting process of the overall system.
Prediction
Floating
Comparators in
Bank
Post-Cursor B+2
Feedback
Post-Cursor B+1
Feedback
h0
h-1
h+1
h+2
Input24=16
Sequence
8 Sequence
4 Sequence
2 Sequence
DFE Output
31
2.4. Hardware Cost and Performance of the System
For comparison among loop unrolled DFE, ADC-based DFE, and sequence DFE, we can
consider a general channel, which has N dominant taps remaining to be equalized. Out of
these taps, let us consider L taps are precursors and M taps are post-cursors. There is always
one main cursor. Therefore, the number of comparators required for the loop unrolled DFE
architecture can be written as
1
DFE UnrolledLoop 2or 2 LNMC . 2.1
For N tap channel, if we consider P-bit ADC, the number of comparators required can be
written as
12ADC PC . 2.2
In the sequence DFE architecture discussed so far, there are edge comparators for
prediction and data comparators for sequence generation and verification of prediction.
The number of comparators required can be written as
Factor Prediction
222 1
DFE Sequence
LMMC
2.3
where, the prediction factor is the measure of prediction capability of the edge comparators.
The first term in Eq. 2.3 gives the number of comparators required for prediction and the
second term gives the number of comparators required for in-bank comparison.
Figure 2.23 shows the comparison among the three cases. For the sequence DFE, the
prediction factor for the example 4 tap channel is considered to be 2, as we are covering
half of the total signal space. This figure shows the required number of comparators for the
sequence DFE for three values of the prediction factor (2, 3, and 4). In all cases, the number
32
of pre-cursors is considered 1. Figure 2.23 shows the number of comparators required for
the sequence DFE and the loop unrolled DFE is comparable.
Figure 2.23: Comparison of the required number of comparators among loop unrolled
DFE, ADC, and sequence DFE.
The noise margin of the sequence DFE can be defined as half the distance between two
banks having similar B+1 values. For an N tap sequence DFE having L taps of precursors
(h-i), main cursor (h0), and M taps of post-cursors (h+i), the noise margin can be written as
2
1
0
DFE Sequence
M
L
i
i hhh
NM
. 2.4
We get the noise margin of a conventional DFE when the un-equalized taps (in this case
precursors (h-i)) interfere destructively with the main cursor (h0). In the case of an ADC-
based DFE, the noise margin will further reduce due to the quantization noise (QN) from
the ADC. The quantization noise reduces as the resolution of the ADC increases.
3 3.5 4 4.5 5 5.5 6 6.5 70
20
40
60
80
100
120
140No. of Comparators Vs. Resolution
Resolution
No
. o
f C
om
pa
ra
tors
Loop Unrolled DFE
ADC
Sequence DFE-Prediction Factor=2
Sequence DFE-Prediction Factor=3
Sequence DFE-Prediction Factor=4
33
Figure 2.24 shows the comparison of noise margins between ADC-based DFE and
sequence DFE. In this figure, practical channels are considered that have 4-7 dominant
taps. The tap values are: h0=0.26 mV, h+1=0.16 mV, h-1=0.12 mV, h+2=0.08 mV, h+3=0.04
mV, h+4=0.02 mV, h+5=0.01 mV. For all cases, only one precursor is considered. For N tap
channel, N bit sequence DFE and N bit ADC-based DFE are considered. As the number of
bits of the ADC increases the noise margin also increases. For sequence DFE, the noise
margin increases as the number of post-cursor taps increases.
Figure 2.24: Comparison of noise margin between ADC-based DFE and sequence
DFE.
2.5. Error Tolerance
Section 2.3.2 discusses sequence generation and DFE implementation. There are prediction
error tolerance logic implementations that give the system margin to work with errors.
3 4 5 6 7
-10
10
30
50
70Voltage Margin Vs. Resolution
No. of Bits in ADC
Volt
age M
argin
(m
V)
4-tap Channel-ADC Based DFE
4-tap Channel-Sequence DFE
5-tap Channel-ADC Based DFE
5-tap Channel-Sequence DFE
6-tap Channel-ADC Based DFE
6-tap Channel-Sequence DFE
7-tap Channel-ADC Based DFE
7-tap Channel-Sequence DFE
34
Figure 2.21(b) discusses the verification of prediction done by the edge comparators.
Sections 2.5.1-2.5.3 discuss the prediction error correction techniques of the edge
comparators. Section 2.5.4 discusses the in-bank comparator error tolerance due to DFE
feedback.
2.5.1. Bottom Prediction Error Tolerance
Figure 2.25 shows one of the examples of edge comparator prediction error. With correct
prediction, the floating data comparators should be in the middle at time t=t1. Due to noise
or sampling error, the edge comparators predicted wrong and the floating data comparators
stayed at the bottom. At bottom position, possible B0B+1 combinations are 00 and 01. DFE
feedback at time t=t1 should be 0 as the previous bit is 0. This feedback will make final
output for B0 to be 0 after DFE feedback. So, due to the error in prediction, the main cursor
will be detected wrong. However, the are prediction error tolerance logic in the system
prevents it from happening. As the edge comparators send the data comparators to the
bottom, all the floating data comparators output go to 1. At the bottom, if all the data
comparators give 1, this implies an error in prediction of position of the floating data
comparators. So, we should overwrite the position of the data comparators to the middle.
For the middle position, CF3 and CF2 become CF1 and CF0 respectively. Their outputs are
overwritten in CF1 and CF0. In this error case, if the comparators were placed in the middle,
the most probable outputs of the top two comparators are 0. So, 0 is overwritten there. This
error check gives the correct choice for DFE operation.
35
Figure 2.25: Error tolerance of edge comparator prediction to send the floating data
comparators to bottom at time t=t1.
2.5.2. Mid Prediction Error Tolerance
The prediction error tolerance of the mid position has two cases. In case I, the correct
position for the floating data comparators are on top, but due to noise, they ended up in the
middle (Figure 2.26). Any time an error happens in predicting the position of the floating
data comparators, the system will make an error in B0. Case I for the mid position prediction
error tolerance is similar to the bottom position prediction error tolerance. While in the
With Noise Wrong Prediction Correction using Floating Comp. OP
Position CF3 CF2 CF1 CF0 Position CF3 CF2 CF1 CF0
Bot
1 1
Mid
0 0
1 1 1 1
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Comp. Position
for correct
prediction
Comp. Position
with noise
While in bottom,
if all floating
comps give 1, it
implies error
Needs
Correction
36
middle, if all the floating data comparators give 1, it suggests an error has occurred. As all
of them are 1, the signal may be on top instead. So, after verification, mid position is
overwritten to the top here. For the top position, CF3 and CF2 become CF1 and CF0
respectively. Their output is overwritten in CF1 and CF0. CF3 and CF2 will be overwritten
with 0 as it is the most probable outcome.
Figure 2.26: Error tolerance of edge comparator prediction to send the floating data
comparators to the middle at time t=t2.
Case II of the mid prediction error tolerance is illustrated in Figure 2.27. At time t=t2, the
correct floating comparator position is at the bottom but due to an error, it ended up in the
With Noise Wrong Prediction Correction using Floating Comp. OP
Position CF3 CF2 CF1 CF0 Position CF3 CF2 CF1 CF0
Mid
1 1
Top
0 0
1 1 1 1
While in mid, if
all floating comps
give 1, it implies
error
Needs
Correction
time
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111Comp. Position
for correct
prediction
Comp. Position
with noise
t4t3t2t1t0
37
middle. While in the mid position, if all the floating data comparators give 0, it implies an
error has occurred. As all of them are 0, the signal may be at the bottom instead. So, now
mid position is overwritten to the bottom. For the bottom position, CF1 and CF0 become CF3
and CF2 respectively. Their output is overwritten in CF3 and CF2. CF0 and CF1 will be
overwritten with 1 as it is the most probable outcome.
Figure 2.27: Error tolerance of edge comparator prediction to send the floating data
comparators to bottom at time t=t3.
With Noise Wrong Prediction Correction using Floating Comp. OP
Position CF3 CF2 CF1 CF0 Position CF3 CF2 CF1 CF0
Mid
0 0
Bot
0 0
0 0 1 1
While in mid, if
all floating comps
give 0s, it implies
error
Needs
Correction
time
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Comp. Position for
correct prediction
Comp. Position
with noise
t4t3t2t1t0
38
2.5.3. Top Prediction Error Tolerance
Top prediction error tolerance is similar to case II of the mid prediction error tolerance.
From the top position, the most probable miss will be the mid position. While on top, if all
the data comparators are 0, prediction verification overwrites position to the mid. Outputs
of CF1 and CF0 go to CF3 and CF2 respectively. CF1 and CF0 are overwritten with 1.
Figure 2.28: Error tolerance of edge comparator prediction to send the floating data
comparators to the top at time t=t3.
With Noise Wrong Prediction Correction using Floating Comp. OP
Position CF3 CF2 CF1 CF0 Position CF3 CF2 CF1 CF0
Top
0 0
Mid
0 0
0 0 1 1
While in top, if all
floating comps
give 0, it implies
error
Needs
Correction
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Comp. Position
for correct
prediction
Comp.
Position
with noise
39
2.5.4. In-Bank Comparator Error Tolerance
In prior sections, we discussed how we could fix the issues of comparator and sampling
error that result in incorrect positioning of the floating data comparators. We showed that
if we make an error in predicting the position of the floating data comparators, we can fix
that using the verification logics. Now, if there is an error in data comparators, the DFE is
there to fix it.
Figure 2.29: In-bank comparator error tolerance of the system using DFE.
Noise Position CF3 CF2 CF1 CF0
Gen.
Seq.
B+2
Feedback
After B+2
Feedback
B+1
FeedbackOutput
w/o Mid
0 01 0 0 1
1
1 0 0 1
1 01111 0 0 0
1 10 1 1 1
0 1 1 10 1 1 0
w
+ve
noise
Mid
0 11 0 1 0
1
1 0 0 1
1 01111 0 0 1
1 10 1 1 1
0 1 1 10 1 1 0
w
-ve
noise
Mid
0 01 0 0 1
1
1 0 0 1
1 01111 0 0 0
1 10 1 1 1
0 1 1 10 1 1 0
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise
Sample
with -ve
Noise
Sample
with +ve
Noise
40
Figure 2.29 shows three cases of sampling at time t=t3. In the first case, when there is no
noise, the correct output is 0111. Now, with some noise, if the signal is pushed up, one of
the floating comparators, CF2, gives 1 which was giving 0 in the no noise case. So, from 10
bank DFE choices are now 1010 and 1001. However, from 01 bank DFE choices do not
change, as comparators related to this bank are still giving the correct result. Now, the DFE
comes in action and detects the correct sequence. If noise pushes signal down, as long as it
is above CF1 reference level, no error will occur.
2.6. System Design
The overall block diagram of the quadrate implementation of the system is illustrated in
Figure 2.30. The architecture requires three edge comparators and four data comparators
in each time-interleaved path. Three edge comparators serve dual purposes: (a) provide the
timing error information with higher resolution and (b) place the data comparators in the
vicinity of the next sample. In addition, the edge samples with the decoded sequence allow
us to filter edges with ISI. Two edge comparators and four data comparators are used to
generate sequences with redundancy for sequence DFE. The decoded main bits from
previous two time-interleaved paths come in as B+1 and B+2 feedbacks and help in DFE
operation.
In subsequent sections, we will talk about the analog front-end and the critical timing issues
of the system. Section 2.6.1 - 2.6.4 discuss passive equalizer, sample and hold (SH),
comparator, and SR latch respectively. Section 2.6.5 - 2.6.6 discuss the critical timing
issues of the system.
41
Figure 2.30: Maximum likelihood sequence detector with passive equalization and
timing recovery.
2.6.1. Passive Equalizer
The passive equalizer used in this system is a C-R high pass filter that attenuates low-
frequency portion of the incoming signal. Figure 2.31 shows the schematic and AC
response of the equalizer. The transfer function of the equalizer can be written as:
.1
Equalizer, theof Zero
.1GainFrequency High
.Gain DC
.)(
Function,Transfer
4
4
44
44
EQ
z
EQ
EQEQ
EQ
out
in
CR
RR
R
RsCRRR
RsCRR
V
V
2.5
CH270
Passive
Equalizer
2.5 GHz
Clock GenTDC
φ0
φ315
Reference
Generator
Sequence
GeneratorB0B1B-1B2
Sequence
DFE
CH0
CH90
CH180
(Data S/H)
(Edge S/H)
CH270
42
)(
1 Equalizer, theof Pole
4RRC EQ
p .
The choice of values of resistors and capacitor is critical. Small values of R imply
degradation in input matching and the equalizer needs a bigger capacitor to have zero at
proper frequency. Larger values of R give higher input time constant. The thermal noise
generated from these resistors is small enough that it does not effect the sequence DFE
operation. The tunable REQ of the equalizer makes it possible to have a DC attenuation
from 3 to 8 dB. It also moves the pole and the zero of the system. The tunable boost also
makes it possible to work with channels of different loss. As long as the passive equalizer
gives us a single bit response with four dominant taps remaining to be equalized, the system
can equalize the rest. If channel taps remain unequalised, their impact will come as
deterministic noise. Figure 2.32(a) shows the AC analysis of the equalizer for the example
link with 27 dB loss. Here, the passive equalizer equalizes 6 dB of channel loss and the
maximum likelihood sequence detector will equalize the remaining 21 dB of loss. The
transient eyes of input and output of the passive equalizer are shown in Figure 2.32(b) and
Figure 2.32(c). As the equalizer output only has 4 dominant taps, 16 levels with overlap
appear at the equalizer output. Figure 2.33 shows the layout of the passive equalizer.
Figure 2.31: Schematic of passive equalizer with its AC response for different settings.
107
108
109
1010
1011
-8
-7
-6
-5
-4
-3
-2
-1
0
Passive Equalizer AC Response for Different Settings of REQ
Frequency (Hz)
Ga
in (
V/V
)
REQ
=R1
REQ
=R1||R
2
REQ
=R1||R
3
REQ
=R1||R
2||R
3
INN
C
R1
R2
R3R4
INP
C
R1
R2
R3
R4
REQ
REQ
VCM
OUTP
OUTN
C R1 R2 R3 R4
1 pF 405 Ω 810 Ω 405 Ω 203 ΩCL
CL
43
Figure 2.32: Passive equalizer response for example link. (a) AC analysis, (b) input to
the equalizer and (c) output of the equalizer.
Figure 2.33: Layout of the implemented passive equalizer.
-50 0 50
-0.6
-0.3
0
0.3
0.6
Input to the Passive Equalizer
Time (ps)
Am
pli
tud
e (V
)
107
108
109
1010
-35
-30
-25
-20
-15
-10
-5
0
Passive Equalizer AC Analysis
Frequency (Hz)
Ga
in (
V/V
)
Channel Attenuation
Equalizer Boost
After Equalization
-50 0 50
-0.3
-0.15
0
0.15
0.3
Output of the Passive Equalizer
Time (ps)
Am
pli
tud
e (V
)
(a)
(b) (c)
6 dB DC
Attenuation 27 dB
Lossy
Channel
Loss yet
to be
equalized
Capacitor Capacitor
Resistors
Boost Control
44
2.6.2. Sample & Hold (SH)
Figure 2.34 shows the concept of a conventional sample and hold circuit. The input data,
Vin, comes in and hits a switch. A Clock signal controls the switch. Whenever the Clock is
high, the switch gives a direct path from the input to the output. The RC time constant of
resistance from switch and Chold has to be small so that the SH has enough bandwidth to
follow the high-speed input and change accordingly. When Clock goes low, Chold will hold
the last value it gets from the input.
Figure 2.34: Concept of sample and hold.
High performance and high-speed data receivers require good linear performance from
their SH circuits. High-performance SH circuits are usually implemented using Switched
Capacitor (SC) circuits. The input sampling switch limits the linearity of these SH circuits.
Non-linearity associated with the sampling switch is mainly attributed to non-linear on-
resistance (RON) and associated parasitic capacitance of the transistors. These transistor
switches produce harmonic distortion when sampling high-frequency signals. As a result,
SNDR (Signal to Noise and Distortion Ratio), SFDR (Spurious Free Dynamic Range), and
THD (Total Harmonic Distortion) of the incoming signal deteriorate.
The on-resistance (RON) of a transistor switch is given by [24]
Vin Vout
Clock
holdC
45
where,
(t)VtVL
WμC
(t)R
THgsox
ON
1
r. transisto theof voltageThreshold)(V
voltagesource toGate
areaunit per eCapacitanc Oxide
r transistoofcarrier charge ofMobility
TH
t
(t)V
Cox
μ
GS
2.6
VGS(t) and VTH(t) depend on the incoming data signal, Vin(t).
tVtVtV inggs 2.7
where,
)φ2φ2(γ)( 0 FSBFTHTH tVVtV
.parameters Deviceφ γ,
gebody volta toSource
0 when voltageThresholdV
F
TH0
(t)V
V
SB
SB
2.8
where,
BinSB VtVtV
VB=Body Voltage.
2.9
The RON along with the Chold define the tracking bandwidth of the SH circuit as given by
Eq. 2.10. The dependence of RON on the time varying input signal means that f-3dB will be
different for different values of the input signal. Therefore, the SH will not track all values
of the input signal equally.
hold
THgsoxn
holdON
dBC
(t)VtVL
WCμ
CRf
π2
11
π2
13 .
2.10
46
Figure 2.35: Implemented sample and hold circuitry (a) and its simulated differential
operation (b).
For a fixed Chold, the RON has to be small enough to achieve high bandwidth in the SH. So,
in our design, as illustrated in Figure 2.35 (a), MP1 and MP2 have high W/L ratio, which
allows low RON and high bandwidth. Differential operation gets rid of clock feedthrough
that comes when sampling pulses come in or when the comparator in the next stage of the
SH is clocked. The expected signal swing at the input of SH is about 600 mV with a
common mode of around 800 mV. Using transmission gates here as switches will not help.
Bootstrapping makes RON independent of Vin(t) but adds a lot of complexity.
P1M
INP
INN
CLKB
CLK
OUTP
OUTNP2M
P3MP4M
2C
1C
-0.4
-0.2
0
0.2
0.4
Am
pli
tud
e (V
)
1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0
1.2
Time (ns)
Am
pli
tud
e (V
)
Input=INP-INN Sampled Value=OUTP-OUTN
Sampling
Pulse
Track
TimeHold Time
(a)
(b)
47
The sampling time is also a function of Vin(t). When the voltage of the capacitor node is
lower than the input voltage, the input node acts as the source of the sampling PMOS. The
relation among Vin(t), Vgs(t), and VTH(t) is shown in Eq. 2.7 - 2.9. In the implemented SH,
for low values of Vin(t), the PMOS switch turns off too fast while for high values of Vin(t)
it turns off too late causing distortion.
Charge injection is another source of non-linearity in the SH. In the case of PMOS
switches, when CLK signal goes high, the charge in the channel is distributed between the
drain and the source of the MOSFET. The amount of charge escaping from the channel is
a complex function of impedance defined by the amount of charge to the ground and the
transition time of the controlling clock. This charge injection gives us gain error and DC
offset in the SH. The transistors, MP3 and MP4, are there to absorb the injected charge from
the channel, which is a crude way to get rid of this problem without adding complexity to
the SH. The transistors MP3 and MP4 are sized half of MP1 and MP2, assuming that half of
the injected charge will flow in each way. Figure 2.35(b) shows the differential operation
of the implemented SH. The SH in this design is triggered with 25% duty cycle pulses. As
a quadrate system, the SH tracks the incoming data for 1 UI and holds the data for 3 UI.
Figure 2.36 shows the layout of the implemented sample and hold circuit. The figure
specifies the positions of sampling PMOS, compensating PMOS, and capacitor in the
sample and hold.
48
Figure 2.36: Layout of sample and hold circuit.
2.6.3. Comparator
Figure 2.37 shows the implemented 4-input comparator. First, for simplicity, we can
remove MN5, MN6, and MN7 and consider this as a 2-input comparator. The strong-arm latch
based comparator gets the sampled value from the SH circuit. The theoretical depiction in
Figure 2.38 shows four stages of operation of the comparator. At first, the sampling phase
starts when CLK signal goes from low to high at t=t0. The transistors MN3 and MN4
discharge VN and VP node respectively. MN1 and MN2 discharge OUTN and OUTP
respectively. The sampling phase ends when PMOS transistors MP5 and MP6 turn on. The
regeneration phase starts at t=t1. At this stage, as PMOS transistors are ON, we get cross-
coupled inverters formed by MN1 and MP5 pair and MN2 and MP6 pair. The positive feedback
from the cross-coupled inverters amplifies the signal. When the comparator outputs touch
the rail at t=t2, it enters the decision phase. The decision from this stage is latched using an
SR latch afterwards. When the CLK signal goes low at t=t3, the comparator outputs resets,
i.e. both outputs go to VDD. The speed of the comparator depends on its sampling and
CapacitorSampling
PMOS
Compensating
Devices
49
regeneration stage. Figure 2.39 and Figure 2.40 show the simulation results and the layout
of the implemented comparator.
Figure 2.37: Schematic of the implemented comparator.
Figure 2.38: Operational stages of the comparator.
N1M
P3M
N2M
P6M
INP
OUTP
VDD
P1MP2M
P4MP5M
INN REFPREFN
OUTN
CLK CLK
CLK CLK
5NM 3NM 4NMN6M
N8MN7M
VN VP
CLK
Sampling Regeneration Decision Reset
SH_OUTP
SH_OUTN
COMP_OUTP
COMP_OUTN
t=t0 t=t1 t=t2 t=t3
50
Figure 2.39: Simulation results showing sampling, regeneration, decision, and reset
stages of the comparator.
Figure 2.40: Layout of the implemented comparator.
-0.4
-0.2
0
0.2
0.4
Am
pli
tud
e (V
)
0
1.2
1.5 1.6 1.7 1.8 1.9
0
1.2
Time (ns)
Am
pli
tud
e (V
)
0
Sampled Value → Input to the comparator
Clock
Sampling
Regeneration
Decision Reset
OUTN
OUTP
Strong-
Arm
Input
51
2.6.4. SR Latch
The implemented comparator resets to VDD, as CLK goes low. Therefore, the
implemented set-reset (SR) latch has to hold the previous value when both of its inputs are
high. Figure 2.41 shows the NAND implementation of SR latch. In this implementation,
both S=0 and R=0 are invalid inputs. In our case, the comparator outputs (OUTP and
OUTN) will never go to VSS at the same time. When both signals are high, the SR latch
retains its previous value. When R=0, it forces Q(t+1) =1. As, S=1 and Q(t+1) =1, QB(t+1)
goes to 0. When S=0, it forces QB(t+1) =1. As, R=1 and QB(t+1) =1, Q(t+1) goes to ‘0’.
In this way, cascading a strong-arm based comparator with an SR latch, we get a strong-
arm flip-flop. The advantages of this strong-arm flip-flop are zero static power
consumption and full rail to rail swing.
Figure 2.41: SR Latch implementation and truth table.
S
RQB
Q
S R Q (t+1) QB (t+1)
1 1 Q (t) QB (t)
1 0 0 1
0 1 1 0
0 0 Not Allowed
N1M
P1M
N2M
P4M
S
QB
VDD
P2MP3M
R
Q
N3M N4M
52
Figure 2.42: Layout of the implemented SR latch.
2.6.5. Reference Muxing Timing Issue
The receiver architecture utilizes the output of the edge comparators to place the references
of the data comparators. The references of the data comparators have to settle before they
are clocked. The timing diagram of the SHs and comparators for the edge and data are
shown in Figure 2.43. For CH0, the edge SH is clocked by P315, i.e. the pulse rising in-line
with Ф315. The edge comparators are clocked 0.5 UI later using Ф0, and at the same time,
the data signal is sampled using P0. As the generated pulses have 3 UIs of hold time, the
data comparators have to be clocked within these 3 UIs. In this receiver, we have clocked
the data comparators using Ф180, which is 2 UI later. The simulation results of Figure 2.44
shows that reference settles within 10% of its final value which is within 150 ps. Therefore,
in this case, it gives 50 ps timing margin.
53
Figure 2.43: Edge and data SH and comparator clocking for CH0.
Figure 2.44: Reference settling behavior.
2.6.6. Sequence DFE Critical Timing
The sequence generator gives the sequence DFE four 4-bit sequences which are
differentiable using post-cursors, B+1 and B+2. In this quadrate system, DFE feedback is
shown in Figure 2.45. In this 2-tap DFE, B+2 feedback comparison is done using the usual
XOR operation. However, like all DFE architecture, 1st-tap DFE feedback timing is critical
here. To facilitate this timing, for B+1 feedback, sequences are pre-calculated for XOR
comparison which save one gate delay. Whenever B+1 feedback arrives, the data has to
pass through just one MUX (Figure 2.45). The logic implemented is given by the equations
below:
P315
P0
Φ0
Φ180
Edge SH
Edge Comp
Data SH
Data Comp
2 UI
6.15 6.2 6.25 6.3 6.35 6.4 6.45 6.5 6.55
-0.3
-0.2
-0.1
0
0.1
6.15 6.2 6.25 6.3 6.35 6.4 6.45 6.5 6.55
-0.3
-0.2
-0.1
0
0.1
6.15 6.2 6.25 6.3 6.35 6.4 6.45 6.5 6.55
-0.1
-0.05
0
0.05
0.1
0.15
References
Edge Comparator
output & Reference
update
Edge Reference
50 ps
Margin
Edge SampleData Sample
φ0
φ180
Time (ns)
Φ0
Φ180
References
Input
Edge Reference
54
Figure 2.45: Sequence DFE and loop unrolling for B+1 feedback (red box).
]2D1B1_FB[0:3D1]2D1B1_FB[0:3D0
]2D1B1_FB2D1B1_FB[0:3D1
]2D1B1_FB2D1B1_FB[0:3D0
]2D10:3D1_PR2D10:3D0[B1_FB
]2D10:3D1_PR2D10:3D0[B1_FB
B1_FB0:3D1_PRB1_FB0:3D0_PR0:3OP
2D10:3D1_PR2D10:3D00:3D1_PR
2D10:3D1_PR2D10:3D00:3D0_PR
2.11
2.12
2.13
2.7. Implementation & Experimental Results
The implemented quadrate receiver shown in Figure 2.46 occupies only 0.23 mm2 with
each time-interleaved path taking 240 μm X 240 μm. The digital sequence decoder
consumes only 35 mW of power out of which 14 mW is taken in digital operation (Figure
B-1 B0 B+1 B+2CH0
CH90
CH180
CH270
B+2 Comparison
Four 4b
Sequence
DFE_OP
Two 4b
Sequence
B+1Comparison
4b
B-1 B0 B+1 B+2
B-1 B0 B+1 B+2
B-1 B0 B+1 B+2
OP<3:0>
B1_FB
0
1
0
1
D0<3:0>
D1<3:0>
D0<3:0>
D1<3:0>
D1<2>
0
1
D1_PR<3:0>
D0_PR<3:0>
Loop Unrolling for B+1 Feedback
55
2.47). The additional clock recovery circuit consumes around 9 mW. The test setup with
all the required instruments is shown in Figure 2.48. Different amount of channel loss is
realized by having different lengths of coaxial cable. Each time with different channel loss,
the single bit response of the channel is observed and the dominant tap values are measured.
These tap values are used to set the reference values for the comparators. Tunable current
sources of the reference generator of the chip allow us to do the tuning of references for
different channels. The received signal and the decoded half-rate time sequence DAC
output are put on top of one another in Figure 2.49. Each time the decoded sequence DAC
output relates close to the received signal after the passive equalizer. The 10 Gb/s input eye
and the 16 level sequence detector output eye from one of the four channels are shown in
Figure 2.50. Figure 2.51 shows the measured recovered clock of the system with only 11.56
ps peak to peak jitter. Without any transmit equalization, 4-bit sequence decoder operates
error-free over a 27 dB loss channel with 90 mV voltage margin and 25 ps timing margin
(Figure 2.52).
Figure 2.46: Implemented prototype in 65nm CMOS.
Test & Interfacing
Circuit
CH90
Clocking
CH0
CH270CH180
240 µ
m
Digital
Back-end
Analog
Front-end
240 µm
Co
mp
ara
tor
Ba
nk
REF
Lines
RE
F G
EN
56
Figure 2.47: Complete receiver with its power consumption.
Figure 2.48: Pin diagram and test setup of the prototype.
CH270
Passive
Equalizer
2.5 GHz
Clock GenTDC
φ0
φ315
Reference
Generator
Sequence
GeneratorB0B1B-1B2
Sequence
DFE
CH0
CH90
CH180
(Data S/H)
(Edge S/H)
CH270
9 mW
9 mW
12 mW
14 mW
OU
TP
_E
Q
OU
TN
_E
Q
IRE
F_M
IRV
SS
VC
MIN
P
INN
OUTP_B0
OUTN_B0
PRBS_ERROR
AVDD
VSS
AVDD
VSSDVDD
IREF_H1
IREF_H2
IREF_H0
IREF_H_1
IRE
F_O
UT
DVDD
SR_IN
SR_CLK
CV
DD
VS
S
OUTP_16
OUTN_16CLK_OUTP
CLK_OUTN
RE
SE
T
CV
DD
Tektronix
DPO 71604C
Oscilloscope
Agilent
86100D
Infiniium
DCA-X
Oscilloscope
Used for
eye
diagrams
Transient Single bit response is used
for characterizing the channel
Tektronix
AWG
70002A
Input Data Pattern
Channel
Control Bit
Load using
Arduino
Used for
tuning
references
On-Chip
PRBS
Checker
Output
CHIP ID:
ICSAADEQ
57
Figure 2.49: Measured half rate 4-bit sequence DAC output.
Figure 2.50: Measured 10 Gb/s input eye (a) and 16 levels output eye of sequence
detector (b).
Figure 2.51: Measured 2.5 GHz recovered clock eye (a) and histogram (b).
4-bit Sequence
Input
(a) 10 Gb/s Input Eye (b) 16-level output eye of the sequence decoder
20 ps/div 80 ps/div
(a) Recovered Clock Eye (b) Recovered Clock Histogram
50 ps/div 10 ps/div
58
Figure 2.52: BER bathtub (a) and the recovered 2.5 Gb/s PRBS (Pseudo-random bit
sequence) checked data eye coming out of the chip using 1-bit DAC (50Ω buffer).
The comparison between this work and the existing state-of-art ADC-based solutions is
listed in Table 2.1. Chen et al. [12] and Zhang et al. [16] worked with 4-way time-
interleaved flash ADC architecture with baud-rate timing recovery in place. Although
Zhang et al. can compensate channel loss up to 34 dB, the additional DSP required for that
consumes 500 mW of power. Shafik et al.[11] demonstrated a power efficiency of 8.7
pJ/bit using a 32-way time-interleaved SAR ADC without any clock recovery system. The
architecture can compensate up to 25.3 dB channel loss, which gives it a figure of merit
(FoM) of 0.344 pJ/bit/dB. This work improves the state-of-art by compensating 27dB
channel loss at 10 Gb/s consuming only 35 mW. Therefore, the design achieves a power
efficiency of 3.5 pJ/bit and FoM of only 0.13 pJ/bit/dB.
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.510
-12
10-10
10-5
100
BE
R
Clock Phase (UI)
25%
(a) BER bathtub (b) Recovered Data Eye
100 ps/div
59
Table 2.1: Performance Summary of the 10 Gb/s Receiver.
Chen
JSSC’12[4]
Zhang
ISSCC’13[16]
Shafik
ISSCC’15[11] This work
Equalizer
Architecture
4x Variable
Ref. Flash
ADC
4x Rectified
Flash 32xSAR
4x Sequence
DFE
Data Rate 10 Gb/s 10.3125 Gb/s 10 Gb/s 10 Gb/s
Technology 65nm 40nm 65nm 65nm
Supply Voltage
(V) 1.1 N/A 1 1.2
Compensated
Channel loss
29/23 dB
@10 Gb/s
34 dB
@10.3125 GS/s
25.3 dB
@10 GS/s
27 dB
@10 GS/s
BER <10-12 <10-12 <10-10 <10-12
Timing
Recovery Baud-rate Baud-rate None
Data-Edge
Sampled
Power
Consumption
(mW)
Rx– 130 ADC-195
DSP- 500
ADC – 79
Dig. EQ – 8
Sampler – 12
Digital – 14
Clocking – 9
Area (mm2) 0.26 0.82 0.81 0.23
Efficiency
(pJ/bit) 13/10.6 67.4 8.7 3.5
FoM (pJ/bit/dB) 0.45/0.46 1.98 0.344 0.13
60
Chapter 3.
A 16 Gb/s Direct Digital Sequence
Detector with Data Trace-back and
Equalizer in 65nm CMOS
The previous chapter introduced the concept of sequence decoding where we use the ISI
components in a constructive way to recreate the time sequence of bits transmitted from
the other side of the channel. This naturally leads to a fundamentally different way of
processing the pre-cursor. In the traditional approach, the pre-cursor is reduced at the cost
of SNR whereas here the pre-cursor ISI is used to predict the next bit. This opens up the
potential for better equalization strategy that results in higher voltage margin. Due to the
fixed supply, SNR will degrade especially when higher order modulations are introduced.
The system may benefit from techniques as discussed in this chapter by achieving lower
BER. The main concept of the pre-cursor ISI utilization is to correct and mitigate the DFE
error propagation. Since the next bit at time TN-1 is part of the sequence that is selected
through the sequence DFE, one UI later at time TN we can compare the current bit with the
next bit from the previous cycle to detect error. In addition to error detection, we can also
correct the error. The error correction mechanism involves three steps. First, we identify
bit decisions that are not corrupted by the DFE – we name those decisions as high
61
confidence decisions. We only use such high confidence decisions for error correction.
Second, we make use of the pre-cursor to generate the next bit. Rather than generating a
single sequence, we generate two potential sequences differentiable based on the next bit.
Third, when the bit is detected with high confidence, we use that to select the sequence
with the next bit that matches the high confidence correct bit.
The chapter begins with performance limiting factors of the receiver architecture described
in Chapter 2. Section 3.1 discusses two cases where noise limits the performance of the
current architecture. Section 3.2 discusses what we can do to overcome these limitations to
give better noise immunity to the system. This section arrives at the technique of 1-bit data
trace-back. Sequence generation and feedback for data trace-back are also discussed here.
Section 3.3 describes the modified block diagram of the receiver. The receiver in this
chapter also works at a higher speed than the previous one. Section 3.4 shows the
experimental results and summary of the modified receiver performance with data trace-
back.
3.1. Noise Tolerance Limit of Current Receiver
The receiver architecture discussed in Chapter 2 is capable of compensating 27 dB channel
loss with a voltage margin of around 10% of total signal space at BER of 10-12. Section 2.5
of Chapter 2 discussed how we can recover from errors introduced by noise and prediction
by the edge comparators. If there is an error in prediction by the edge comparators, we can
verify that prediction using the floating data comparators. Errors caused by the floating
data comparators can be fixed using the DFE feedback.
62
In this section, we will discuss two cases where noise limits the performance of the current
receiver architecture. In case I, the edge comparators predict incorrectly and the
verification using the data comparators cannot fix that error. In case II, the edge
comparators predict correctly, but the data comparators see a large noise for in-bank
comparison.
3.1.1. Case I
Figure 3.1 shows a case of simultaneous error demonstrated by both the edge and the data
comparators due to noise at time t=t3. In the correct operational case, the prediction by the
edge comparators will send the data comparators to the middle position with references
from the 10 and 01 banks. In this position, two of the floating comparators, CF1,0, give 1 as
output and the other two, CF3,2, give 0. So, the generated sequences from the 10 bank are
1000 and 1001; the generated sequences from the 01 bank are 0110 and 0111. B+2 feedback
will come at first and as it is 1, 1001 sequence from the 10 bank and 0111 from the 01 bank
will be forwarded for B+1 feedback sorting. At t=t3, B+1 feedback is 1. The correct DFE
output should be 0111 sequence. For the error case, the edge sample of the incoming data
was pushed up by noise and this noise sent the data comparators to the top. The position
verification system does not work here as the bottom data comparators see data that is
higher than its reference level. Here, CF0=1 and the rest are 0. For the top position,
sequences from the 11 bank are 1100 and 1101 and from the 10 bank are 1001 and 1010.
If we assume the DFE feedbacks are still there and working the way they should work, the
final output will be 1101. In this situation, both B0 and B-1, that are not in feedback, are
incorrect.
63
Figure 3.1: Case I – error in sequence detection. B0 and B-1 detected incorrect.
3.1.2. Case II
Figure 3.2 shows an error by the in-bank floating comparators at time t=t3. The correct case
is already discussed in Section 3.1.1. In this case, the prediction by the edge comparators
is correct. At time t=t3, the possible banks are 10 and 01. Due to large noise, the incoming
signal is pushed down and CF1 becomes 0; however, in the ideal case, it should be 1. So,
from the 01 bank possible sequences are 0101 and 0110. After DFE feedback, 0101 is
selected as the output which implies an error in B-1 detection.
Noise Position CF3 CF2 CF1 CF0
Gen.
Seq.
B+2
Feedback
After B+2
Feedback
B+1
FeedbackOutput
w/o Mid
0 01 0 0 1
1
1 0 0 1
1 01111 0 0 0
1 10 1 1 1
0 1 1 10 1 1 0
w
+ve
noise
Top
0 01 1 0 1
1
1 1 0 1
1 11011 1 0 0
0 11 0 1 0
1 0 0 11 0 0 1
Comp. Position
for correct
prediction
Comp.
Position
with noise
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise
Sample
with +ve
Noise
64
Figure 3.2: Case II – error in sequence detection. B-1 detected incorrect.
3.2. Improving Noise Tolerance of Current Receiver
The current architecture as discussed in Chapter 2 uses the DFE feedback and detects the
main cursor, B0. The precursor, B-1, comes as a by-product. The by-product can be used
further to verify the detection of B0. The current architecture works with the prediction
from the edge comparators that is later verified using the floating data comparators. Instead
of the prediction, we can use sub-ranging ADC approach, where the floating data
comparators will be placed based on the data sample rather than the edge sample.
Noise Position CF3 CF2 CF1 CF0
Gen.
Seq.
B+2
Feedback
After B+2
Feedback
B+1
FeedbackOutput
w
-ve
noise
Mid
0 01 0 0 1
1
1 0 0 1
1 0 1 0 11 0 0 0
0 10 1 1 0
0 1 0 10 1 0 1
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise Sample
with
huge -ve
Noise
65
3.2.1. Setting Fixed Data Comparators
Two fixed data comparators can be used instead of using the edge comparators. These two
data comparators will directly drive the reference multiplexers of the floating comparators.
The reference levels of these two comparators are well defined (Figure 3.3). The two fixed
comparators have to differentiate between banks. CFIX1 is there to differentiate between the
11 and 01 banks. So, it can be placed in between the top of the 01 bank and the bottom of
the 11 bank, i.e. in between levels 0111 and 1100. If CFIX1=0, the reference levels of the 01
bank will be passed to the floating comparators. If it is 1, the reference levels of the 11
bank will be passed to the floating comparators. The same is true for the 10 and 00 banks.
CFIX0 is in between 0011 and 1000 levels. CFIX0=0 implies references of the 00 bank will
be passed to the floating comparators and CFIX0=1 implies references of the 10 bank will
be passed to the floating comparators.
Figure 3.3: Introduction of two fixed data comparator reference instead of fixed edge
comparators.
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
11
11
10
01
00
01
11
10
01
00
10
11
10
01
00
00
11
10
01
00
CFIX1
CFIX0
1
0
1
0
66
3.2.2. Case I
The fixed data comparators fix the error of case I with little margin (Figure 3.4(a)). If there
is higher noise, it will push the signal up a little bit more again producing error (Figure
3.4(b)). The additional noise sends the floating comparators to the top again and the
verification logic cannot recover: the error prevails. To fix this issue, we can use the
decoded B-1 bit of the sequence. Figure 3.5 shows how the concept of using B-1 bit works.
At time t=t3, both B+2 and B+1 are 1. To illustrate the concept, we can start with the feedback
first. B+2 and B+1 DFE feedback leave 4 possible combinations out of 16 possible
sequences; two combinations are from the 11 bank and two are from the 01 bank. As the
fixed data comparators place the floating comparators to the top position, it is highly
unlikely to have 0101 as a possible sequence. 0101 is the bottom-most possible sequence
while the fixed data comparators placed the floating comparators on the top. So, we can
Figure 3.4: Fixing error with little margin using fixed data comparators.
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise
Sample
with +ve
Noise
New Data
Comps. fixes
error with little
margin
time
t4t3t2t1t0
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise
Sample
with +ve
Noise
Correct
Comp.
Position
Comp.
Position
with noise
(a) (b)
67
Figure 3.5: Introduction of check comparator to find out whether it may miss a bank or
not.
neglect 0101 sequence. When the fixed data comparators are placing the floating
comparators for the 11 and 10 banks, the most probable sequence we can miss is the top of
the 01 bank i.e. 0111 sequence. To see if there is any chance of missing the top of the 01
bank, we can use a check comparator. In this case, the placement of the check comparator
is in between the top of the probable missing bank and the in-bank bottom sequence having
the same values of B+2 and B+1. The check comparator at t=t3 gives 0. Out of the probable
3 sequences, this implies that the top most can be neglected. As the next bit is a definite 1,
it can come as feedback and correct the bit decision. Here, the tracing back options are
1101 and 0111 sequence. LSB+1 bit is B-1 bit. After comparing B-1 bit with LSB+1 bit, the
output will be 0111, which is the correct output. If we go back to what we had at the
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sequence DFE
(B+1 and B+2
Feedback)
In this case,
B+1 = 1
B+2 = 1
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
In this case,
CFIX1 = 1
CFIX0 = 1
Unlikely
CCHK = 0
t4t3
Sample
w/o
Noise
Sample
with +ve
Noise
Correct
Comp.
Position
Comp.
Position
with noise
Next Bit,
Definite 1
Probable
SequencesCCHK
Trace-back
ChoicesNext Bit
Trace-back
Output
1111 0
(implies it may
miss top of 01
bank)
1 01111101 1 1 0 1
0111 0 1 1 1
DFE
Output
68
beginning of case I, DFE output was already 1101. We added the check comparator and it
said the data might be from the missing bank. As the next bit was a definite 1, we checked
back to see if we decoded the correct sequence (0111) and found out we had the wrong
sequence (1101). We corrected the error using B-1 bit.
When the floating data comparators are on the top position (11 and 10 bank), the check
comparator checks the probability of a missing sequence from the nearest bank (01 bank).
Table 3.1 shows all the possible cases of missing the probable banks and the corresponding
comparator placement. When the signal is supposed to be in the middle (10 and 01 bank),
there are two cases: (i) missing the bottom of the 11 bank (1100 sequence) and (ii) missing
the top of the 00 bank (0011 sequence). 1100 has B+1=1 and B+2=0. The closest in-bank
sequence having the same B+1 and B+2 is 0110. So, the check comparator reference will be
set in between 1100 and 0110. The case of missing the top of 00 bank is similar to the case
of missing the top of 01 bank. So, for the mid position we need two comparators to check
whether there is any probable missing bank. However, for the top and the bottom position,
we need only one comparator.
Table 3.1: Reference placement for checking probable bank miss.
PositionAvailable
Banks
Most
Probable
Bank
Missed
Most
Probable
Sequence
outside Banks
B+1 of the
Most
Probable
Sequence
B+2 of the
Most
Probable
Sequence
Closest Ref.
in bank
having same
B+1 & B+2
Check
Comp
Placement
Top11
1001 0111 1 1 1101
1101
0111
Mid10
01
11 1100 1 0 01101100
0110
00 0011 0 1 10011001
0011
Bottom01
0010 1000 0 0 0010
1000
0010
69
The references for the check comparator can also be multiplexed using the fixed data
comparators. However, when the signal is on the top or on the bottom, one of the check
comparators will not be clocked and that one will receive common mode voltage (VCM) as
reference. Figure 3.6 illustrates placement of all the comparator references; for the check
comparators, it also describes why it is being used.
Figure 3.6: Overall reference placement of the architecture with trace-back.
3.2.3. Case II
The check comparators also resolve the issue of Section 3.1.2. At t=t3, the check
comparators check whether the system has missed the bottom of the 11 bank and the top
of the 00 bank. As the DFE output was 0101 in case II, the check comparator that checks
whether it missed the 11 bank becomes the one to be considered for trace-back. As it says
0, there is no chance of data being on top. Another probable sequence is the 0111 in-bank
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
11
11
10
01
00
01
11
10
01
00
10
11
10
01
00
00
11
10
01
00
CFIX1
CFIX0
1
0
1
0
CFIX<1:0> = 01
CFIX<1:0> = 11 CFIX<1:0> = 00
Checks if it missed
bottom of 11 bank
Checks if it missed
top of 00 bank
Checks if it missed
top of 01 bankChecks if it missed
bottom of 10 bank
70
option. As the next bit is a definite 1, it comes for trace-back and fixes the sequence as
shown in Figure 3.7.
Figure 3.7: Resolving the issue of case II.
3.2.4. Trace-back Sequence Generation
There are two types of sequence generated for trace-back in this system: (i) outside bank
trace-back sequence (case I) and (ii) within bank trace-back sequence (case II). Table 3.2
lists the logic showing which trace-back option will be considered for the different outputs
of the check comparators.
time
t4t3t2
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Sample
w/o
Noise
Sample
with
huge -ve
Noise
Checks if it missed
bottom of 11 bank
Checks if it missed
top of 00 bank
CCHK
Trace-back
ChoicesNext Bit
Trace-back
Output
0
(implies it is
not missing
bottom of 11
bank)
0 1 0 1
1 0111
0 1 1 1
71
Table 3.2: Trace-back choice logic.
Bank Check Comp Output Trace Back Choice
11 0 Outside
1 Within
10 0 Outside
1 Within
01 0 Within
1 Outside
00 0 Within
1 Outside
Now, for outside bank trace-back, there may be an error in B0 and B-1. So, the sequence
choices given for the trace-back are:
DFE selected one (B0B1B-1B2)
The other option having B0 B-1 flipped.
Within bank trace-back implies there may be an error in B-1 only. So, the sequence choices
given for the trace-back are:
DFE selected one (B0B1B-1B2)
The other option having B-1 flipped.
3.2.5. Conditions of Data Trace-back
For the data trace-back discussed so far, we have always mentioned the next bit to be
definite 1/0. When both the fixed data comparators detect the received signal to be on the
top or on the bottom position and the floating top and bottom check comparators verify the
sample location, the data is considered to be a strong 1/0. The logic for detecting strong
72
1/0 is listed in Table 3.3. If the next bit is not strong 1/0, the incorrect DFE feedback will
propagate. The strong 1/0 logic ensures the next bit is not dependent on the DFE feedback
for its main cursor decision. If the next bit is a strong 1/0, it can be used for error detection
by the DFE.
Table 3.3: Detecting Strong 1/0.
Fixed Data Comparators
Floating Top/Bottom Check
Comparators Bit
CFIX1 CFIX0 CCHK_TOP CCHK_BOT
1 1
1
x
Strong 1
0 Not strong 1
0 0 x
1 Not strong 0
0 Strong 0
We cannot do trace-back when top-bottom-mid verification has already resolved the fixed
data comparators errors.
3.2.6. Improved Noise Margin
The data trace-back adds two additional comparators to increase the noise margin of the
system. The required number of comparators increases by 2 only. We can modify Eq. 2.3
to get the required number of comparators:
2Factor Prediction
222 1
back-Trace and DFE Sequence LM
MC . 3.1
73
The noise margin of the modified architecture of the sequence DFE can be defined as half
the distance between two banks having similar B+1 value as discussed in Section 2.4. For a
4 tap sequence DFE having one precursor (h-1), main cursor (h0) and two post-cursors (h+1
& h+2), the noise margin can be written as
2
210DFE Sequence
hhhNM . 3.2
The noise margin of the data trace-back increases as we consider similar values of B+2 as
well. However, the additional noise margin will be there if and only if the next bit is a
strong 1/0.
Figure 3.8 shows the comparison of noise margins for these three cases. As the number of
bits of the ADC increases, the noise margin also increases. However, the noise margin
achieved by the 4 tap sequence DFE with data trace-back is always higher than the ADC-
based DFE. In this figure, h0=0.26 mV, h+1=0.16 mV, h-1=0.12 mV and h+2=0.08 mV.
Figure 3.8: Comparison of noise margins of 4 tap sequence DFE with and
without data trace-back with ADC-based DFE.
4 5 6 7
30
40
50
60
70
Voltage Margin Vs. Resolution
No. of Bits
Volt
age M
argin
(m
V)
ADC Based DFE
Sequence DFE with Trace-back
Sequence DFE without Trace-back
4 tap Sequence DFE
noise margin
74
3.3. System Design
The overall implementation of the quadrate system is shown in Figure 3.9. The receiver
uses similar passive equalizer at the front end as used in Chapter 2. The system requires
two fixed data comparators, four floating data comparators and two check comparators –
overall eight comparators. The output of the fixed comparators places the references of the
floating and check comparators. For the check comparators, when in the top/bottom, only
one of them is clocked. If the comparator is not clocked, it gets common mode voltage as
reference input. The system uses an external clock as it does not have any built-in timing
recovery in place. The receiver works with 16 Gb/s input data with each time-interleaved
path running at 4 GHz.
Figure 3.9: System architecture of quad-rate receiver with data trace-back.
Φ0
(COARSE S/H)
Reference
Generator
Sequence
GeneratorB0B1B-1B2
Φ0
(FINE S/H)
Sequence
DFE
Passive
Equalizer
Trace
BackTop Bottom
Check
One hot code
from Fix
Comp Op (3b)
CH0
CH90
CH180
CH270
75
3.3.1. Reference Muxing Timing
As the system is running at a higher speed than the receiver described in Chapter 2, the
comparators are resized to get similar timing margin for reference muxing. As we are not
using the prediction by the edge comparators, coarse (SH for fixed data comparators) and
fine (SH for floating comparators) SHs sample the data at P0, i.e. the pulse rising at Φ0.
The coarse comparators are clocked instantly using a delayed version of Φ0. Now, we have
time up to the hold time of P0, which is less than 3 UI. The fine and top/bottom check
comparators are clocked 2 UI later using Φ180. The outputs of the fixed comparators take
around 100 ps to update the references of the floating comparators (Figure 3.10). It still
gives 25 ps timing margin to clock the floating comparators to achieve 16 Gb/s operation.
The check comparators are clocked using the same scheme.
Figure 3.10: Reference Muxing for quadrate channel running at 4 GHz.
5.8 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6-0.3
-0.15
0
0.15
0.3
Time (ns)
Am
pli
tud
e (
V)
Input
Data Sample
Coarse Comp. OP & Ref. update ≈100 ps
P0 Φ0 Φ180
Margin ≈ 25 ps
Top Check Ref.
Bot Check Ref.
Fine Ref.
76
3.3.2. DFE and Trace-back Feedback
The DFE feedback works as described in Section 2.6.6. Loop unrolled architecture is
carried over from the previous architecture described in Figure 2.33. The new addition here
is a sequence generator after the DFE output. The sequence generator generates a choice
for trace-back as described in Section 3.2.4. The DFE output and generated sequence have
different values of B-1. Now, B-1 feedback comes only if the next channel has strong 1/0
and the current channel does not have strong 1/0. After this comparison, we get trace-back
output.
Figure 3.11: Feedback for DFE and Trace-back.
B+2 Comparison
Four 4b
Sequence
DFE
Two 4b
Sequence
B+1Comparison
4b
B-1Comparison after
strong 1/0
Seq_Gen
for TBTB_OP
4bDFE_OP
TB
B-1 B0 B+1 B+2CH0
CH90
CH180
CH270
B-1 B0 B+1 B+2
B-1 B0 B+1 B+2
B-1 B0 B+1 B+2
77
3.4. Experimental Results
The prototype of the architecture implemented in TSMC 65nm is shown in Figure 3.12.
The design takes an area of 300 µm X 560 µm with each channel taking 300 µm X 140 µm
area. This prototype works with 16 Gb/s input data. The pin diagram and test setup of the
chip with all the required instruments are shown in Figure 3.13. Different amount of
channel loss is realized by having different lengths of coaxial cable as discussed in Chapter
2. Each time with different channel loss, the single bit response of the channel is observed
and the dominant tap values are measured. These tap values are used to set the reference
values for the comparators. Tunable current sources of the reference generator of the chip
allow us to do the tuning of references for different channels. The outputs of the sequence
DFE with and without data trace-back is taken separately. Figure 3.14 shows the channel
response for which the receiver was tested. Figure 3.15 shows the input after passive
equalization and the channel output side-by-side. The 16 Gb/s input after passive equalizer
has 4 dominant taps. The generated clock and PRBS checked recovered data eye is given
in Figure 3.16. The BER curves in Figure 3.17 (BER bathtub curve) and Figure 3.18 (2D
color-map of BER showing voltage margin in Y-axis and timing margin in X-axis) show
improvement in BER after the data trace-back. For 16 Gb/s operation, the receiver achieves
a BER of 10-10 when DFE works alone. However, with added trace-back BER of the
receiver improves to 10-12. The BER plots of Figure 3.18 show improvement in voltage
margin of the modified architecture. The design consumes 71.8 mW power with DFE and
data trace-back in action. However, in low loss case, the data trace-back can be turned off
to save power. Without the trace-back, it consumes only 55 mW power. Table 3.4
summarizes overall receiver performance. At higher loss cases (~35 dB), both the trace-
78
back and DFE is applied; whereas in lower loss cases, the system can achieve desirable
performance using DFE only while consuming less power. For 16 Gb/s operation, when
both the trace-back and DFE are active, the receiver consumes 71.8 mW at an efficiency
of 4.4875 pJ/bit. The figure of merit (FoM) during 35 dB operation turns out to be 0.1282
pJ/bit/dB, which is better than the FoM of 0.1375 pJ/bit/dB during DFE-only operation as
it compensates less loss.
Figure 3.12: Implemented prototype in 65nm.
CH90 CH180
Clo
ckin
g
560 µm
30
0 µ
m
CH0 CH180
Clo
ckin
g
RE
F G
EN
AF
E RE
F L
ines
DF
ET
B
PG
EN
CL
K B
uff
er
79
Figure 3.13: Pin diagram and test setup of the prototype.
Figure 3.14: Channel Response at 8 GHz.
OUTP_FINAL
OUTN_FINAL
OUTP_DFE
OUTN_DFE
OU
TP
_F
INA
L_B
0
OU
TN
_F
INA
L_B
0
OU
TP
_D
FE
_B
0
OU
TN
_D
FE
_B
0
CL
K_O
UT
CV
DD
PR
BS
_E
RR
OR
RE
SE
T
DVDD
IREF_H1
IREF_H2
IREF_H0
AVDD
VSS
VS
S
AVDD
VSSDVDD
IREF_H_1
IREF_OUT
VDD_OUT
SR_INSR_CLKSR_OUT
INN
VC
MIN
P
OU
TP
_E
Q
OU
TN
_E
Q
DV
DD
CL
K_IN
P
CL
K_IN
N
Tektronix
AWG
70002A
Tektronix
DPO 71604C
Oscilloscope
Signal
generator
SMB 100A
Agilent
86100D
Infiniium
DCA-X
Oscilloscope
Used for
eye
diagrams
Clock
Generator
Input Data Pattern
Transient Single
bit response is
used for
characterizing
the channel
Channel
Control Bit
Load using
Arduino
Used for
tuning
references
On-Chip
PRBS
Checker
Output
CHIP ID:
ICSAALSE
35 dB Loss at 8 GHz
Ch
an
nel
Res
po
nse
Frequency (GHz)
80
Figure 3.15: Measured 16 Gb/s input eye after passive equalizer (a) and 16-level output
eye of the sequence decoder (b).
Figure 3.16: Measured 4 GHz clock eye (a) and PRBS checked recovered 4 Gb/s data
eye (b).
Figure 3.17: BER bathtub comparing DFE and Trace-back.
(a) 16 Gb/s input eye after passive equalizer (b) 16-level output eye of the sequence decoder
10 ps/div 50 ps/div
(a) 4 GHz Clock Eye (b) Recovered 4 Gb/s Data Eye
40 ps/div 50 ps/div
-0.5 -0.25 0 0.25 0.5-12
-9
-6
-3
0
Clock Phase(UI)
BE
R
DFE
Trace-Back
81
Figure 3.18: 2D color-map of BER showing voltage margin in Y-axis and timing
margin in X-axis of sequence DFE (a) and data trace-back (b).
Table 3.4: Receiver Summary.
Shafik
ISSCC’15 [11]
Hossain
VLSI’16 [12] This Work
Technology TSMC 65 nm TSMC 65 nm TSMC 65 nm
Supply Voltage (V) 1 1.2 1.2
Data Rate (Gb/s) 10 10 16 Gb/s
Active Die Area (mm2) 0.81 0.23 0.17
Timing Margin (UI) 0.2 0.25 0.25
BER 10-10 10-12 10-12 (DFE and TB)
10-10 (DFE only)
Loss Compensation (dB) 25.3 27 35
Power Consumption
(mW) 87 35
71.8 (DFE and TB)
55 (DFE only)
Power Efficiency (pJ/bit) 8.7 3.5 4.4875 (DFE and TB)
3.4375 (DFE only)
FoM (pJ/bit/dB) 0.343 0.13 0.1282 (DFE and TB)
0.1375 (DFE only)
-0.5 -0.25 0 0.25 0.5-0.1
-0.05
0
0.05
0.1
Clock Phase (UI)
Vo
lt(V
)
BER of DFE
-12
-10
-8
-6
-4
-0.5 -0.25 0 0.25 0.5-0.1
-0.05
0
0.05
0.1
Clock Phase (UI)
Vo
lt(V
)
BER of Trace-back
-12
-10
-8
-6
-4
10-6
10-10
10-12
10-4
10-8
(a) (b)
82
Chapter 4.
Burst Mode Optical Receiver with
10ns Lock Time Based on Concurrent
DC Offset and Timing Recovery
Technique
This chapter describes a low power low latency 7-10 Gb/s burst mode DC-coupled receiver
for photonic switch networks. The chapter begins with a discussion of the conventional
optical receivers in section 4.1. Section 4.2 covers the state-of-art DC and timing recovery
methods applied in burst-mode optical receivers and their performance limiting factors.
Section 4.3 discusses overall architecture of the proposed receiver. Analog front-end
comprising of a trans-impedance amplifier (TIA) followed by three stage of amplifiers, DC
recovery loop, timing recovery loop and timing skew correction loop are also discussed in
details in section 4.3. Section 4.4 covers the implementation of the system in IBM 0.13μm
and the measured experimental results. Section 4.5 compares the proposed burst mode
optical receiver with the current state-of-art optical receivers. The implemented trans-
impedance amplifier (TIA) is also compared with the state-of-art TIAs.
83
4.1. Conventional Optical Receiver
The continuous growth of cloud computing, “big data” and social media applications is
placing enormous bandwidth demand on data communication networks. The
interconnection between servers, which has traditionally relied on electrical cables, is
becoming a major bottleneck in today’s data networks, as we approach the limits of copper
wires in terms of speed, loss and power consumption. To meet this demand, optical
interconnects are already finding their place in data centers to connect racks which are only
a few meters apart. Optics can provide large transmission bandwidth, which reduces
latency and fast processing speed. Extending these benefits, cross-point network switches
can also be optical to provide significant increase in cross-sectional bandwidth. In addition,
rapidly reconfigurable optical switching network can significantly improve latency and
throughput [25].
There are electrical transceivers to interface these optical switches. However, unlike
existing SerDes, these optical transceivers need to accommodate different optical power
levels when the cross point is reconfigured. Interestingly, existing passive optical network
(PON) provides such functionalities. Unfortunately, to fit in the data center solution space,
existing PON solutions need to improve significantly: First, in existing PON receivers TIA
and LA are located on one IC and clock and data recovery (CDR) is located on a different
IC ([26], Figure 4.1). It is desirable to integrate them into a single complete receiver and
then to integrate multiple receivers in a single IC. Second, the power efficiency of burst
mode PON transceivers is in the order of 50 to 100 pJ/bit. This efficiency has to improve
to better than 10 pJ/bit considering the cooling cost of data centers.
84
Figure 4.1: Conventional vs. proposed implementation of the optical receiver.
Currently, most of these links are VCSEL based, and their performance is excellent but
lacks in integration density and bandwidth. To meet next generation’s bandwidth demand,
VCSEL has to improve its bandwidth and at the same time, we have to look for alternative
solutions such as silicon-photonic modulators that have higher integration density. Optical
interconnects based on silicon integrated photonic waveguides and devices aim to replace
metal wires with optical dielectric waveguides on the same CMOS chip. They have been
considered as a promising technology that can meet the projected bandwidth demand, while
offering the advantage of being compatible with CMOS electronics.
Energy efficiency of the link has become as critical as its bandwidth. From an application
point of view, there is growing demand for energy efficiency improvement over the entire
traffic of data including idle mode, not just at the peak of data rate. Therefore, transceivers
have to adapt to this bursty nature of data traffic and thereby achieve both the highest
bandwidth and excellent efficiency. When used in such power-down mode, it is required
TX1
TX2
TXN
RX1
RX2
RXN
Burst #1 Burst #2
TIA+LA CDR
IC1 IC2
Existing Solution
Proposed Solution in a single IC
TIA + AMPs + CDR
TIA + AMPs + CDR
TIA + AMPs + CDR
85
that the transceiver would be able to wake-up, receive and transmit data without significant
preamble to keep high burst efficiency. Transceivers require nanosecond scale burst to take
advantage of the dynamically reconfigurable high-speed optical switches based on Silicon
Photonics. After each switching event, the receiver has to support the data stream that may
come from a different transmitter having different modulator with different extinction ratio
and different switching loss. In addition, the receiver has to support un-encoded data stream
where the data pattern desired to be received can have long strings of consecutive identical
data (CID) bits, i.e. long strings of 1’s and 0’s. In an AC coupled link, a long string of 1’s
and 0’s may cause baseline wandering issue. The length of the non-transition period of the
incoming optical signal has to be small compared to the time constant determined by the
capacitance used for coupling between optical and electrical interface to avoid wander
issue. Encoding can avoid DC balancing problem, but it adds latency and degrades the
overall efficiency of the transmission. DC coupled interfaces can easily overcome these
issues. However, the receiver has to compensate for the DC offset created due to switching
event during its fast lock time in this burst mode format.
4.2. Challenges in Burst-Mode Receiver
In conventional implementations ([26], Figure 4.1), TIA and CDR are designed in two
different ICs interfaced with high speed interconnect that mandates sufficient signal swing
to meet the signal integrity requirement. Usually, in a 50-ohm environment, meeting the
swing specification leads to significant power consumption in the limiting amplifier (LA).
86
Figure 4.2: Conventional implementation of burst mode receiver with DC offset
calibration.
In addition, multiple chip implementations have cost penalty. Therefore, in state-of-art DC-
coupled burst mode optical receiver [2], TIA and CDR are integrated on the same chip.
However, lock time still remains an issue that can be broken into two components – DC
offset correction and timing recovery. DC offset correction loop can be based on either
feedback ([2], Figure 4.2(a)) or feedforward ([27], Figure 4.2(b)) concept – in both cases,
a low pass filter senses the DC offset and subtracts that at the input of the receiver.
Feedback can be implemented in analog domain using V-to-I [28] or in digital using DAC
[2]. However, in both cases, the pole frequency of the feedback path (ωp) sets the tradeoff
between lock time and bandwidth. For feedback system, if AF with pole frequency ωf and
ADC are the transfer functions of the forward path and DC recovery path respectively, then
the system transfer function (AFB) can be written as shown in Eq. 4.1. The step response of
the feedback system gives us the lock time and the high pass pole frequency fc defines the
usable bandwidth (Eq. 4.2) [29].
TIA
LPF
(a) DC Recovery using feedback path
To CDR
DC
Calibration
Logic
AF = Forward path TF
ADC = DC recovery TF
TIA
LPF
(b) DC Recovery using feedforward path
To CDR
AF = Forward path TF
ADC = DC recovery TF
87
DCF
FFB
AA
AA
1
where, f
Fs
AA
1,
p
DCs
A
1
1.
4.1
FF
cCR
Af
2
12
where, LPF. theofconstant TimeFFCR
4.2
In general, if the pole (ωp) in the feedback path is at a relatively higher frequency, feedback
loop settles faster, but that also reduces the usable effective bandwidth (Figure 4.3). This
problem can be avoided by shifting the feedback pole at a lower frequency, but that also
increases the lock time. A good compromise can be to place the pole at 1/10th of the
preamble frequency and with additional digital post-processing; this way, it is possible to
reduce lock time to less than 50 cycles of the digital clock [2].
Figure 4.3: Settling time vs. bandwidth of DC recovery loop.
(c) Lock Time and BW Penalty vs. Frequency
0.001 0.01 0.1 1 100
2.5
5
0.001 0.01 0.1 1 100
0.5
1
1.5
0.001 0.01 0.1 1 100
0.5
1
Frequency (MHz)
Lock
Tim
e (m
s)
BW
Pen
alt
y (
GH
z)Feedback lock time
Feedforward lock time
Feedback BW Penalty
1
1.5 5
0.5
2.5
88
For feedforward DC recovery concept, the system transfer function is
pf
FFss
A
1
1
1
1.
4.3
Compared to feedback solution, DC settles faster in this concept and bandwidth penalty is
set by ωp. As a result, the fundamental trade-off between usable bandwidth and settling
time is still significant. In both feedback and feedforward cases of DC recovery, if the
recovered DC information during preamble period is not stored using memory, the system
faces baseline wandering during CID pattern.
Similarly, feedback based timing recovery loops (Figure 4.4(a)) also suffer from longer
lock time issues in burst mode applications. Fortunately, open loop approaches such as
injection lock based timing recovery loops have reduced the lock time to few cycles ([30],
Figure 4.4(b)). However, open loop timing recovery loops suffer from several limitations;
first, VCO drifts away in the absence of injection pulses which leads to poor tolerance to
CID and suboptimal lock position. Second, unlike traditional SerDes, where half-rate or
quarter rate solutions are often used to reduce power, injection locking is usually done at
full rate – therefore, receiver power is usually high. Although injection locking can achieve
Figure 4.4: Conventional timing recovery loop.
Variable
Delay D-FF
Non-
linearity
Pulse
Gen
(a) (b)
BBPD
CP
+
LF
89
fast lock time, receiver initialization time is still gated by DC offset correction as shown in
Figure 4.5. At the beginning, when DC offset is still uncorrected, outputs of the limiting
amplifier also have non 50% duty cycle. That means generated injection pulses are spaced
such that the generated frequency is at an offset from the correct one. Therefore, when
injected with these pulses, VCO does not lock at the correct phase or frequency. Only after
the DC offset is corrected, generated pulses are properly spaced to lock the VCO. Although
existing works have shown very fast burst mode clock recovery [31], without addressing
the DC offset corrections, receiver lock time does not improve.
Figure 4.5: Effect of DC on duty cycle of the data.
TIA
Clock
Recovery
Offset
Correction
Offset
corrected
Offset
TIA Output
T=UI
LA Output
T<UIT>UI
Injection
Pulses
Oscillator
Output
DC Recovery Timing
Recovery
90
This work introduces a burst mode quad-rate 7-10 Gb/s receiver architecture for
dynamically reconfigurable photonic switch network. An entire receiver including TIA,
three stage amplifiers, and CDR are implemented on the same IC; in fact, the inductor-less
receiver fits within 465 µm X 265 µm area. Use of limiting amplifier leads to a power and
area consuming design; therefore, in this design, we eliminate the limiting amplifier to
reduce power. Unlike traditional DC offset correction that relies on sensing DC at the
output, we propose a novel DC offset correction that can directly measure DC offset from
alternating data. Since there is no LPF needed, this DC offset correction loop can settle
significantly faster compared to the conventional approach. To achieve fast wake up, DC
offset correction and timing recovery loops run concurrently, and that reduces the lock time
to less than 10ns. To achieve a robust solution, timing skew adaptation algorithm is used
to get optimum timing margin for the receiver.
4.3. Proposed Burst Mode Receiver
The overall block diagram of the implemented receiver is shown in Figure 4.6. The system
consists of an analog front-end (TIA followed by three stage differential amplifiers), a DC
recovery loop, an injection lock based quad-rate CDR and a timing skew adaptation loop.
91
Figure 4.6: Block Diagram of the receiver with power breakdown.
4.3.1. Trans-impedance Amplifier (TIA)
TIA receives the current from the photodiode and the single-ended output voltage goes to
one of the inputs of differential amplifier chain whereas the other input is recovered DC.
A CMOS common gate amplifier is used as TIA followed by a common source amplifier
with source degeneration (Figure 4.7). So, trans-impedance gain of the common gate stage
is
1
1
1
1 111
11
Gain, impedance-Trans
21
2 R
sCgg
sCgA
mm
m
F
. 4.4
Voltage gain of the common source stage with source degeneration impedance is
)
1(1
Gain, Voltage
2
4
3
2
3
3
sCRg
RgA
m
m
F
.
4.5
So, overall gain of the forward path is
TIA
Sample
&
Hold
S(n) + S(n-T)
Data Comparators +
Timing Skew
Adaptation Logic
4.3 mW
3.6 mW 1.3 mW4.4 mW
2.3 mW
1.8 mW
3.6 mW
7.8 mW
4.2 mW 0.9 mW
S(n-3T)
S(n-2T)
S(n)
S(n-T)
CML to
CMOS
Pulse
Gen
Quarter
Rate
Oscillator
Phase
Rotator
3.6 mW + 3.8mW
5 bit
DAC
SAR
Logic
--- Timing recovery loop
--- DC recovery loop
--- Onetime initialization power
--- Runtime power
4Φ4Φ
Frequency
(f) code
Φ code
92
Figure 4.7: Detailed schematic of TIA and its DC recovery loop.
)1
(1111
11
2
4
3
1
1
1
3
3
21
2
sCRg
RgR
sCgg
sCgA
m
m
mm
m
F
. 4.6
The forward path through M2 serves two purposes: first, the SAR DAC creates the DC
output (OUTN) that corrects the data dependent DC offset. Second, the signal path through
AC coupling capacitor C1 goes through the high pass transfer function,
2
1
111
1
21
1 R
sCgg
gA
mm
m
HP
. 4.7
At the input of the differential amplifier, these two paths appear such that the DC
components of the signals cancel each other but the high-frequency parts of the signals
To
CDR
Boost Path, AHP
I I/2 I/4 I/8 I/16
SAR
C8
R1 R2 R3
R4
C1
M1 M2
M3Vb
Φn-T
S(n-T)S(n) + S(n-T)
OUTP
OUTN
S(n)
Φn
C2
Forward Path, AF
93
combine constructively. By boosting the high-frequency signals, the front-end bandwidth
is extended to 11 GHz without using any inductor.
Figure 4.8 shows simulated gain of the analog front end with and without high pass path.
Figure 4.8: TIA transfer function and its simulated and measured gain.
Figure 4.9: Measured 10 Gb/s output eye of the front end.
0.1 1 10 1248
50
52
54
56
58
60
62
64
66
68
Frequency (GHz)
Gain
(d
B)
Gain W/ Cap Gain W/O Cap Measured Gain
Forward Path
High Pass Path
94
Figure 4.10: Simulated amplifier stage output eye with (left) and without (right) C1 &
C2 for input currents of 100 µA, 250 µA and 600 µA (from top to bottom).
-50 0 50
-0.04
0
0.04
Time (ps)
Am
pli
tud
e (V
)
90
mV
(a) Input current 100 µA with Cap C1 and C2
-50 0 50
-0.04
0
0.04
Time (ps)
Am
pli
tud
e (V
)
(b) Input current 100 µA w/o Cap C1 and C2
75
mV
-50 0 50-0.15
-0.075
0
0.075
0.15
Time (ps)
Am
pli
tud
e (
V)
250
mV
(c) Input current 250 µA with Cap C1 and C2
-50 0 50-0.15
-0.075
0
0.075
0.15
Time (ps)
Am
pli
tud
e (V
)
(d) Input current 250 µA w/o Cap C1 and C2
180
mV
-50 0 50
-0.3
-0.15
0
0.15
0.3
Time (ps)
Am
pli
tud
e (
V)
600
mV
(c) Input current 600 µA with Cap C1 and C2
-50 0 50
-0.3
-0.15
0
0.15
0.3
Time (ps)
Am
pli
tud
e (V
)
(f) Input current 600 µA w/o Cap C1 and C2
420
mV
95
Benefit of the high pass frequency response is visible in the simulated eye diagram shown
in Figure 4.10 for different input currents. Measured TIA output eye is consistent with the
simulated results and indicates the extended bandwidth benefit of the front-end.
TIA has to sense small (~60 μA) current from the photodiode and convert it to voltage. If
the noise component from this TIA front end is high, it will propagate through the whole
front end which may result in incorrect operation of data and clock recovery. One of the
design considerations of this TIA is to have the noise component as low as possible while
getting the most amount of gain and boost out of it. The trade-off is that if we try to increase
the boost in the DC recovery path, it produces higher noise. So, a nominal point was chosen
where we can get a high boost and low noise. The noise sources in this TIA are all the
transistors and resistors handling 7~10 Gb/s high-speed data. Noise from these sources will
appear at both OUTP and OUTN of the TIA as shown in Figure 4.7. However, these two
paths have different gains. The two path gains are derived in Eq. 4.6 and 4.7. For noise
calculation, first, we will show all the noise accumulated at OUTP and OUTN. Then, this
noise at OUTP and OUTN will be referred to input by dividing them using their respective
path gain.
Noise current from any transistor is given by [24]
mMn gKTI 4 2
, . 4.8
Noise generated by MN1 sees two impedances in parallel; one from MN1 and the other from
C1 and MN2. The part of noise current that passes through MN1 appears at OUTP with gains
96
from the TIA common gate stage and the common source with source degeneration stage
(ACS).
22
1
2
1
12
,,N1 111
11
4 ,M todue OUTPat Noise
21
2
11 CS
mm
m
mMOUTPn AR
sCgg
sCggKTV
N
. 4.9
The part of noise current passing through DC path will appear at OUTN node.
2
2
2
1
2
,,N1 111
1
4 ,M todue OUTNat Noise
21
1
11R
sCgg
ggKTV
mm
m
mMOUTNn N
. 4.10
Noise generated by MN2 sees two impedances in parallel; one from MN2 and the other from
C1 and MN1. The noise appearing at OUTP will pass through C1 and MN1. For OUTN, this
noise will come through MN2 and will see the gain of its path.
22
1
2
1
2
,,N2 111
1
4 ,M todue OUTPat Noise
21
2
22 CS
mm
m
mMOUTPn AR
sCgg
ggKTV
N
. 4.11
2
2
2
1
12
,,N2 111
11
4 ,M todue OUTNat Noise
21
1
22R
sCgg
sCggKTV
mm
m
mMOUTNn N
. 4.12
The noise due to MN3 appears at OUTP only. It sees the gain from the common source with
degeneration stage.
22
3
2
,,N3 334 ,M todue OUTPat Noise CSmMOUTPn ARgKTV
N . 4.13
The noise voltage and current from any resistor, R, are given by [24]
97
KTRV Rn 4 2
, 4.14
R
KTI Rn
4 2
, . 4.15
Now, the noise from R1 of TIA appears both at OUTP and OUTN. The noise voltage
generated from R1 sees the gain from the common source with degeneration stage and
shows up at OUTP.
2
1
2
,,1 4 ,R todue OUTPat Noise1 CSROUTPn AKTRV . 4.16
For noise from R1 to appear at OUTN, it has to go through the impedances introduced by
MN1, C1, and MN2. The noise current from R1 will generate a voltage across MN2 which will
see a gain from MN2 and appear at OUTN node.
2
2
2
2
2
1
1
1
2
,,1 111
11
4 ,R todue OUTNat Noise
21
1
1Rg
sCgg
sCg
R
KTV m
mm
m
ROUTNn
. 4.17
Noise from R2 only appears across OUTN.
2
2
,,2 4 ,R todue OUTNat Noise2
KTRV ROUTNn . 4.18
Noise from R3 and R4 only appear in OUTP node.
3
2
,,3 4 ,R todue OUTPat Noise3
KTRV ROUTPn . 4.19
2
3
2
2
4
2
4
4
2
,,4
3
4 11
1
4 ,R todue OUTPat Noise R
gsCR
sCR
R
KTV
m
ROUTPn
. 4.20
98
All the noise from different sources appearing at OUTP and OUTN will be summed up to
get total noise at respective nodes.
.411
1
4
111
1
4
44111
11
4 OUTP,at Noise
22
3
2
3
2
2
4
2
4
4
22
1
2
1
3
2
1
22
1
2
1
12
,
3
321
2
2
21
2
1
CSm
m
CS
mm
m
m
CSCS
mm
m
mOUTPn
ARgKTR
gsCR
sCR
R
KTAR
sCgg
ggKT
KTRAKTRAR
sCgg
sCggKTV
4.21
.111
11
4
111
11
4
4111
1
4 OUTN,at Noise
2
2
2
2
2
1
1
1
2
2
2
1
1
2
2
2
2
1
2
,
21
1
21
1
2
21
1
1
Rg
sCgg
sCg
R
KTR
sCgg
sCggKT
KTRR
sCgg
ggKTV
m
mm
m
mm
m
m
mm
m
mOUTNn
4.22
To get the input referred noise we need to divide noise at OUTP and OUTN with their path
gains, which we get from Eq. 4.6 and 4.7.
2
2
,
2
2
,2
, Noise, ReferredInput HP
OUTNn
F
OUTPn
INnA
V
A
VI . 4.23
Noise estimated by the above equation correlates well with the simulated noise over the
frequency band as shown in Figure 4.11. Consuming only 4.3 mW, the TIA achieves a
sensitivity of -13.98dBm assuming photodetector responsivity of 0.5A/W at BER of 10-12.
99
Figure 4.11: Simulated and calculated input referred noise of TIA.
4.3.2. DC Recovery
DC recovery loop works during the preamble period (101010…. pattern). The differential
voltage from amplifier chain is sampled using the sample and hold (SH) circuit. Two
consecutive samples, S(n) and S(n-T) from two nearby time-interleaved paths, are
compared to get S(n)+S(n-T), which is then fed to SAR logic block (Figure 4.6). The SAR
block is clocked using 1/8th of data rate clock (C8 clock) of the receiver to have the DC
well settled before next decision cycle. The overall operation takes six cycles to complete.
The 1st cycle is used to reset the SAR logic. The other five cycles are used to update the
DAC. Figure 4.12 illustrates the implemented DC recovery technique. In Figure 4.12,
AMP_OUTP and AMP_OUTN are the outputs of the amplifier stage. At the beginning of
each burst, in worst case scenario the DC offset may even fully saturate the amplifier
0.1 0.2 0.4 0.6 0.8 1 2 4 8 10 12.525
30
35
40
45
50
Frequency (GHz)
Input-
refe
rred
curr
ent
nois
e(
pA
/p
Hz
)
Simulated
Calculated
Frequency (GHz)
)H
z(p
A/
2
INn
,I
Nois
e
Refe
rred
Inp
ut
,
100
Figure 4.12: Proposed DC recovery technique.
output. In such scenario, both sampled values appear to be –ve. Therefore, DC offset
continues to reduce until output starts to toggle. During the preamble period, the signal
goes through a transition in each cycle. Therefore, the sampler that is sampling ‘1’ will
have +ve value and the other one that is sampling ‘0’ should have –ve value. Since we are
not using any limiting amplifier, their amplitude will not saturate. When these two samples
are added, their summation indicates the DC offset in the input signal. To design a binary
search algorithm, we only consider the polarity of S(n)+S(n-T). Each time SAR logic
updates the DAC depending on this output. As the algorithm progresses, their amplitudes
will vary until the SAR DAC output DC voltage matches the incoming data dependent
offset. As the DC recovery information is stored digitally during the preamble period, the
system does not face any baseline wandering issue during CID pattern. Figure 4.13 shows
S(n) -ve +ve +ve +ve +ve
S(n-1) -ve -ve -ve -ve -ve
|S(n)| > |S(n-1)| x Yes No Yes No
S(n)+S(n-1) -ve +ve -ve +ve -ve
SAR DCSN DOWN UP DOWN UP DOWN
AMP_OUTP
AMP_OUTN
Offset
TIA output
Burst #1 Burst #2
101
a measured scope shot of DC recovery operation taking only 4.8 ns for incoming 10 Gb/s
data.
Figure 4.13: Measured scope shot of DC recovery operation within 4.8 ns.
4.3.3. Clock Recovery
In the proposed receiver architecture, DC and timing recovery works concurrently. During
DC recovery, as there is no LPF, DC settles faster as the SAR updates the DAC at every
C8 clock. As the DC is moving through different SAR logic, if it settles within the signal
range of the TIA output, injection pulses begin to appear (Figure 4.14). Non 50% duty
cycle amplifier output during DC recovery gives pulses that are not 1-unit interval (UI)
apart. For the injection-locked oscillator (ILO) to lock at the proper frequency, the
separation between pulses has to be an integer multiple of UI. Although the distance
between rising and falling edge pulses depend on DC offset, the distance from one rising
edge to next rising edge is always 2UI. Therefore, rather than using both rising/falling edge,
only rising edge pulses are used to facilitate concurrent operation. When it comes to the
regular data pattern, this distance will always be N.UI where N≥2.
SAR
Operation
4.8ns2ns/div
Preamble Data
102
Figure 4.14: DC recovery without LPF and use of rising edge pulses allowing
concurrent operation of DC and timing recovery.
A ring oscillator is chosen for this work as it provides a wide tuning range with a compact
area requirement. The quadrate operation of the oscillator helps to achieve higher data rate
with better power efficiency in comparison to half rate or full rate clocking. The oscillator
has to operate in 1.75-2.5 GHz range as the data rate is between 7-10 Gb/s. The ring
oscillator has four stages as shown in Figure 4.15. Stage I, which gives Φ0 and Φ180, is
chosen for data pulse injection. The injected pulse corrects the zero crossings of Φ0 and
Φ180. As injection is happening in only one stage of the quadrate oscillator, only the rising
edge pulses can do the correction which are separated by 4UI. The 4UI separated pulses
also have to arrive at the time of zero crossing of Φ0. In this work, a pulse window is created
by having an AND between Φ225 and Φ315. However, any harmonic component of the
pulses generated from data can still go through and provide the phase correction if it falls
AMP_OUTP
AMP_OUTN
TIA output
2UI 2UI 2UI> UI <UI UI
Rising Edge Pulses
DC Recovery
Falling Edge Pulses
Injection
Pulses
103
Figure 4.15: Ring oscillator with pulse filtering and its timing diagram.
within the pulse window. Figure 4.15 illustrates the whole pulse filtering function of the
oscillator. With this pulse filtering in place during the preamble period each window gets
a pulse transferred to the oscillator which in turn makes it lock instantaneously. During DC
recovery operation as the DC goes up and down the lock point of the oscillator may shift,
but it always remains locked during that time enabling concurrent operation.
Φ0
Φ225
Φ315
Pulse Window
Filtered Pulse
Generated Pulse
DATADATAB
Φ0
Φ180
Φ225
Φ45
Φ90
Φ270
Φ315
Φ135
Generated Pulse
Pulse
Filtering
Filtered Pulse
Φ225
Φ315
Pulse Window
1
0 0
104
4.3.4. Timing Skew Correction
There are delays associated with the CML-to-CMOS and the injection pulse generator that
go on to correct the phase of the oscillator. The corrected data phases of the injection locked
oscillator go through buffers and pulse generators to do the sampling of the incoming data.
In this way, the timing margin for data samplers becomes a function of path delay (Figure
4.16). Ideally, this path delay should be 0.5 UI. However, in different corners, this delay
varies which reduces the timing and voltage margin of the data recovery path. To get rid
of this issue, a slope detection technique is applied. The slope detection logic only works
during the no data transition period as the ILO already takes care of transition edges. If the
data is sampled at the middle of the eye (Figure 4.17(a)), the slope between two consecutive
samples should ideally be zero. This is true for both 11 and 00 data sequences. However,
Figure 4.16: Timing Skew compensation.
TIA
Sample
&
Hold
Timing
Skew
Adaptation
LogicS(n-3T)
S(n-2T)
S(n)
S(n-T)
CML to
CMOS
Pulse
Gen
Quarter
Rate
Oscillator
Phase
Rotator
0.5 UI<0.5 UI
>0.5 UI
4Φ
4ΦFrequency
(f) code
Φ code
105
Figure 4.17: Slope Detection.
if the non-transition data is sampled before/after 0.5 UI from the edge, there will be a +ve/-
ve slope. As illustrated in Figure 4.17(b), when the data sequence is 011x, if the slope
between two samples of consecutive 1s is +ve, the sampling phase should be moved to the
right to have the phase in the middle of the eye. However, for 100x data sequence it has to
move right if the slope between two samples of consecutive 0s is -ve. Figure 4.17(c) shows
the situations where the data sampling phase has to move left. The data sequences to
consider here are x110 and x001. This adaptation logic runs after the preamble period
during regular data pattern and updates the phase rotator to have correct sampling phase
with the highest timing margin. To implement this slope detection, we can reuse S/H,
comparator and SAR algorithm used for DC recovery. The comparison between two
consecutive samples, S(n) and S(n-T) during the non-transition data period will give the
slope of the incoming signal. The phase rotator has 5-bit control allowing 32 phase steps
x
1 1
x
x
00
x
Slope
=0
Slope
=0
(a) Sampling at the middle
of the eye
0
1 1
x
1
00
x
Slope
=+ve
Slope
=-ve
(b) Sampling before 0.5
UI from edge
0
11
x
1
00
x
Slope
=-ve
Slope
=+ve
(c) Sampling after 0.5
UI from edge
106
between two data phases. The slope detection algorithm output goes to a majority voter
and C16 (1/16th of data rate) is used to take data out of the majority voter. After majority
voting, the phase rotator control bits are updated using the successive approximation
algorithm. This logic runs once during each burst. There is a frequency (f) code that
controls the frequency of the oscillator. This frequency code ensures that the oscillator is
at the correct frequency. As the frequency is locked, the oscillator can’t drift away during
the CID data pattern.
Figure 4.18: Slope detection logic.
4.4. Implementation and Measurement Results
The implemented die photo is shown in Figure 4.19. The quadrate receiver takes only 465
µm x 265 µm area in 0.13 µm technology. Figure 4.20 shows the test setup with all the
necessary instruments. The receiver recovers DC offset in only 4.8 ns compared to current
state-of-art architecture taking 12.5 ns. The burst mode clock recovery takes 1 ns more (i.e.
5.8 ns) during the preamble period (Figure 4.21). The receiver works at 10 Gb/s consuming
only 41.6 mW out of which 5.1 mW goes for the initial DC recovery, and 3.8 mW goes for
Data Pattern Slope Phase Movement
011x +ve Right
x110 -ve Left
100x -ve Right
x001 +ve Left
Slope Detection
Logic
Φn
Φn-T
S(n-T)
S(n)
Left/Right/
Do Nothing
Data Decision
bits
S(n)-
S(n-T)
107
the timing skew correction. The power efficiency of the receiver during runtime is only
3.27 pJ/bit. During low data rates, the consumed power goes down, but power efficiency
is reduced. The recovered clock has around 10 ps measured peak to peak jitter (Figure
4.23). Figure 4.24 shows the phase noise plot for 7 Gb/s operation giving only 2.3 ps rms
jitter when jitter is integrated from 1 KHz to 1 GHz.
Figure 4.19: Implemented die photo in 0.13 µm.
Figure 4.20: Pin diagram and test setup of the prototype.
ANALOG
FRONT
END
SAR
DAC
SAR
DIGITAL
BLOCK
COMPARATORS
ILOBUF
V2I
Area =
465 µm
x
265 µm
Ph
ase
Rota
tor
TIA_INFEND_OUT
DV
DD
AV
DD
AVDD
CVDD
D0_BUF
CLK0_BUF
VCO_BIAS
INIT
IREF_CHIP
VSS
IRE
F_
AM
PS
VD
C_
OU
T
TIA
_O
UT
TIA
_IN
IRE
F_
TIA
VSS
DVDD
SR_CLK
SR_IN
Tektronix
AWG
70002A
Agilent
86100D
Infiniium
DCA-X
Oscilloscope
Tektronix DPO
71604C
Oscilloscope
Control
Bit Load
using
Arduino
Spectrum
analyzer R&S
FSV 40GHz
Used for eye
diagrams
Used for phase
noise plots
Used for high
speed transient
screenshots
Preamble and
PRBS Data
pattern loading
CHIP ID:
ICGAAEND
108
Figure 4.21: Measured scope shot of DC and timing recovery with preamble and data
pattern.
Figure 4.22: Recovered clock eye.
SAR Operation
4.8 ns
DC & Timing Recovery Locked
Preamble Data
2ns/div
109
Figure 4.23: Recovered clock histogram.
Figure 4.24: Phase noise plot of the recovered clock.
110
Figure 4.25: On-chip PRBS checked recovered 2.5 Gb/s channel data eye.
111
4.5. Comparison with State-of-Art
The proposed optical receiver is compared to existing works in Table 4.1 and Table 4.2.
The existing 10 Gb/s solutions require inductive peaking to meet the gain-bandwidth
product requirement. In the proposed TIA, we have achieved comparable performance
without any inductive peaking. The entire receiver takes only 465 µm x 265 µm area in
0.13 µm technology with comparable performance when compared to state-of-art works.
While most of the works concentrate on either timing or DC recovery, here we have
addressed both challenges with excellent power efficiency. Prior work [2] shows that, for
dynamically reconfigurable optical networks, a receiver lock time of 100 ns is required.
Receiver lock time of 50 ns is highly desirable, which contributes to overall network
performance. In this work, we have a lock time of 5.8 ns only, which improves overall
state-of-art by 6.5X.
Table 4.1: Performance Comparison Summary of the TIA.
RFIC ’09 [32] ASSCC ’07 [33] JSSC ’15 [2] This Work
Technology 0.13 µm CMOS 0.18 µm CMOS 32 nm CMOS 0.13 µm CMOS
Gain (dB Ω) 57 59 46.4 60.9
Bandwidth
(GHz) 10 10 18.4 11.7
Sensitivity
(dBm) N/A N/A -14. 4 -13.98
Power (mW) 1.8 18 2.7 4.3
FoM (GHz
Ω/mW) 3933 495 1431 3018
Inductive
Peaking Yes Yes No No
No. of Stages 1 2 1 2
112
Table 4.2: Performance Comparison Summary of the Receiver.
ISSCC ’12 [28] ISSCC ’11 [12] JSSC ’15 [2] This Work
Data Rate
(Gb/s) 10 1-6 25 7 – 10
Technology 0.13 μm SiGe
BiCMOS 65 nm CMOS 32 nm CMOS 0.13 µm CMOS
Area (mm2) 1.6202
(TIA+LA) 0.0175 0.06 0.123225
DC Recovery +
Timing
Recovery
Operation
DC only Timing Only Cascaded Concurrent
DC Recovery
Technique
Feedback type
AOC circuit
with switchable
loop BW
-
LPF with
Calibration
logic
Successive
Approximation
Algorithm
DC Recovery
Cycle/Time
(Data_rate/8
Clock Cycle)
75 ns - 39 Cycle/
12.5 ns
6 Cycle/
4.8 ns
Clock Recovery
Technique -
Phase
Interpolator
Phase
Interpolator &
Bang Bang PD
Quarter Rate ILO
Clock Recovery
Time (ns) - <0.16-1 18.5 <1
Total Lock
Time (ns) - - 31 5.8
Power
Consumption
(mW)
630 (TIA+LA) 22 (CDR only) 109
TIA+Amps →12.1
Timing
Recovery→11.6
Data Recovery→5.4
DC Recovery → 8.7
Skew Adaptation→3.8
Power
Efficiency
(pJ/bit)
63 (TIA+LA) 3.67 (CDR
only) 4.4
3.27
(Runtime power-
TIA+Amps+CDR)
113
Chapter 5.
Concluding Remarks
The main concentration of this thesis has been to develop the concept and architecture of
low power energy efficient digital receiver. This dissertation has featured a total of three
receivers, out of which two are for wireline application, and the other one is an optical
receiver. The performance summary of the implemented receivers is listed in Table 5.1.
Table 5.1: Performance summary of the implemented receivers.
Receiver1 Receiver2 Receiver3
Application Wireline Wireline Optical
Technology TSMC 65nm TSMC 65nm IBM 0.13µm
Supply Voltage (V) 1.2 1.2 1.2
Area (mm2) 0.23 0.168 0.123225
Area per channel 240 µm X 240 µm 300 µm X 140 µm --
Data Rate (Gb/s) 10 16 7-10
Architecture 4X Sequence DFE 4X Sequence DFE
and Trace-back
4X Time-
interleaved
Clock Recovery Edge and Data
Sampled -- Injection locked
DC Recovery N/A N/A SAR Algorithm
114
Consumed Power
(mW) 35@27 dB
71.8 @ 35 dB
55 @ 25 dB 32.7 (runtime)
Power Efficiency
(pJ/bit) 3.5@ 27 dB
4.4875 @ 35 dB
3.4375@25 dB 3.27
FoM
(pJ/bit/dB) 0.13
0.1282@ 35 dB
0.1375@ 25 dB --
Improvement from
current state-of-art
2.5X in Power
Efficiency [11]
1.94X in Power
Efficiency [11] 1.35X in Power
Efficiency [2] 2.65X in FoM [11] 2.68X in FoM [11]
We have proposed a low power maximum likelihood sequence equalization technique
without ADC-DSP for wireline application. The implemented prototype in TSMC 65nm
receiver has worked at 10 Gb/s achieving a BER of 10-12. The receiver has improved the
state-of-art by consuming only 3.5 pJ/bit while compensating 27 dB channel loss, which
gives it a figure of merit (FoM) of 0.13 pJ/bit/dB. In comparison with the current state-of-
art ([11]), FoM has improved by 2.65X as shown in Table 5.1. The concept, architecture
and measurement results of this work has been published in Symposium on VLSI Circuits:
“"A 35 mW 10 Gb/s ADC-DSP less direct digital sequence detector and equalizer in 65nm
CMOS," A. D. Hossain, Aurangozeb, M. Mohammad and M. Hossain, 2016 IEEE
Symposium on VLSI Circuits (VLSI-Circuits), Honolulu, HI, 2016, pp. 1-2.”
The equalization technique applied for the 10 Gb/s receiver in Chapter 2 has been improved
in Chapter 3 to have a better noise immunity. The receiver prototype with 1-bit data trace-
back has worked at 16 Gb/s demonstrating BER of 10-10 for DFE operation only and BER
of 10-12 when DFE and trace-back worked together. For the data traceback, two additional
115
comparators and additional digital logics have been added in the architecture which results
in higher power consumption of about 71.8 mW. Therefore, the power efficiency turns out
to be 4.4875 pJ/bit; higher than previous architecture but with improved voltage margin.
However, as it could compensate higher loss of up to 35 dB, FoM has been 0.1282 pJ/bit/dB
which was similar to the first receiver. Traditionally error correction capabilities have been
introduced through encoders and decoders such as FEC (forward error correction). These
encoders and decoders not only add overhead, but also add latency to the system that in
turn negatively impacts compute performance. Traditional transceivers by themselves did
not have error corrections capability. To the best of our knowledge, this has been for the
first time that DFE error correction capability has been introduced within SerDes
transceivers and this has opened up the potential for very low overhead and low latency
error correction.
The optical receiver described in Chapter 4 has concentrated both on power and latency for
burst mode application. The receiver implemented in 0.13 µm technology has
demonstrated a lock time of less than 10 ns while consuming only 32.7 mW during runtime.
The recovered clock for the receiver has had around 10 ps measured peak to peak jitter.
The paper containing overall concept and measurement results has been submitted for the
peer review journal of IEEE Transactions on Circuits and Systems I: Regular Papers.
“"Burst Mode Optical Receiver with 10ns Lock Time Based on Concurrent DC Offset and
Timing Recovery Technique," A. D. Hossain, Aurangozeb and M. Hossain, IEEE
Transactions on Circuits and Systems I.”
116
5.1. Future Works
The equalization technique developed in this dissertation can compensate channel loss up
to 35 dB while employing highly digital architecture. Digital designs are inherently
scalable and portable to deep sub-micron CMOS processes. This technique can be
implemented in smaller technology nodes such as 28nm FD-SOI or 14nm FinFET to
achieve higher data rate and better power efficiency. With the increased use of pulse-
amplitude modulation (PAM) in receiver design, this concept of equalization can be
implemented for PAM-4/PAM-8 data streams where SNR is lower than NRZ streams.
The optical receiver architecture provides comparable performance to state-of-art, although
it is implemented in 0.13 µm. This portable architecture is attractive for higher data rates
in scaled technology.
117
Bibliography
[1] T. Toifl et al., “A 2.6 mW/Gbps 12.5 Gbps RX With 8-Tap Switched-Capacitor
DFE in 32 nm CMOS,” IEEE J. Solid-State Circuits, vol. 47, no. 4, pp. 897–910,
Apr. 2012.
[2] A. Rylyakov et al., “A 25 Gb/s Burst-Mode Receiver for Low Latency Photonic
Switch Networks,” IEEE J. Solid-State Circuits, vol. 50, no. 12, pp. 3120–3132,
Dec. 2015.
[3] B. Abiri, A. Sheikholeslami, H. Tamura, and M. Kibune, “A 5Gb/s adaptive DFE
for 2x blind ADC-based CDR in 65nm CMOS,” in 2011 IEEE International Solid-
State Circuits Conference, 2011, pp. 436–438.
[4] E. H. Chen, R. Yousry, and C. K. K. Yang, “Power Optimized ADC-Based Serial
Link Receiver,” IEEE J. Solid-State Circuits, vol. 47, no. 4, pp. 938–951, Apr.
2012.
[5] C. Ting, J. Liang, A. Sheikholeslami, M. Kibune, and H. Tamura, “A blind baud-
rate ADC-based CDR,” in 2013 IEEE International Solid-State Circuits Conference
Digest of Technical Papers, 2013, pp. 122–123.
[6] A. Varzaghani et al., “A 10.3-GS/s, 6-Bit Flash ADC for 10G Ethernet
Applications,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3038–3048, Dec.
2013.
[7] E. Z. Tabasy, A. Shafik, K. Lee, S. Hoyos, and S. Palermo, “A 6 bit 10 GS/s TI-
SAR ADC With Low-Overhead Embedded FFE/DFE Equalization for Wireline
Receiver Applications,” IEEE J. Solid-State Circuits, vol. 49, no. 11, pp. 2560–
2574, Nov. 2014.
118
[8] B. Zhang et al., “3.1 A 28Gb/s multi-standard serial-link transceiver for backplane
applications in 28nm CMOS,” in 2015 IEEE International Solid-State Circuits
Conference - (ISSCC) Digest of Technical Papers, 2015, pp. 1–3.
[9] D. Cui et al., “3.2 A 320mW 32Gb/s 8b ADC-based PAM-4 analog front-end with
programmable gain control and analog peaking in 28nm CMOS,” in 2016 IEEE
International Solid-State Circuits Conference (ISSCC), 2016, pp. 58–59.
[10] S. Rylov et al., “3.1 A 25Gb/s ADC-based serial line receiver in 32nm CMOS SOI,”
in 2016 IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 56–
57.
[11] A. Shafik, E. Z. Tabasy, S. Cai, K. Lee, S. Hoyos, and S. Palermo, “3.6 A 10Gb/s
hybrid ADC-based receiver with embedded 3-tap analog FFE and dynamically-
enabled digital equalization in 65nm CMOS,” in 2015 IEEE International Solid-
State Circuits Conference - (ISSCC) Digest of Technical Papers, 2015, pp. 1–3.
[12] A. D. Hossain, Aurangozeb, M. Mohammad, and M. Hossain, “A 35 mW 10 Gb/s
ADC-DSP less direct digital sequence detector and equalizer in 65nm CMOS,” in
2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), 2016, pp. 1–2.
[13] O. E. Agazzi et al., “A 90 nm CMOS DSP MLSD Transceiver With Integrated AFE
for Electronic Dispersion Compensation of Multimode Optical Fibers at 10 Gb/s,”
IEEE J. Solid-State Circuits, vol. 43, no. 12, pp. 2939–2957, Dec. 2008.
[14] J. Cao et al., “21.7 A 500mW digitally calibrated AFE in 65nm CMOS for 10Gb/s
Serial links over backplane and multimode fiber,” in 2009 IEEE International Solid-
State Circuits Conference - Digest of Technical Papers, 2009, p. 370–371,371a.
[15] H. Yamaguchi et al., “A 5Gb/s transceiver with an ADC-based feedforward CDR
and CMA adaptive equalizer in 65nm CMOS,” in 2010 IEEE International Solid-
State Circuits Conference - (ISSCC), 2010, pp. 168–169.
119
[16] B. Zhang et al., “A 195mW / 55mW dual-path receiver AFE for multistandard 8.5-
to-11.5 Gb/s serial links in 40nm CMOS,” in 2013 IEEE International Solid-State
Circuits Conference Digest of Technical Papers, 2013, pp. 34–35.
[17] T. Beukema et al., “A 6.4-Gb/s CMOS SerDes core with feed-forward and decision-
feedback equalization,” IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2633–
2645, Dec. 2005.
[18] V. Balan et al., “A 4.8-6.4-Gb/s serial link for backplane applications using decision
feedback equalization,” IEEE J. Solid-State Circuits, vol. 40, no. 9, pp. 1957–1967,
Sep. 2005.
[19] R. Payne et al., “A 6.25Gb/s binary adaptive DFE with first post-cursor tap
cancellation for serial backplane communications,” in ISSCC. 2005 IEEE
International Digest of Technical Papers. Solid-State Circuits Conference, 2005.,
2005, p. 68–585 Vol. 1.
[20] N. Kocaman et al., “A 3.8 mW/Gbps Quad-Channel 8.5-13 Gbps Serial Link With a
5 Tap DFE and a 4 Tap Transmit FFE in 28 nm CMOS,” IEEE J. Solid-State
Circuits, vol. 51, no. 4, pp. 881–892, Apr. 2016.
[21] S. Kasturia and J. H. Winters, “Techniques for high-speed implementation of
nonlinear cancellation,” IEEE J. Sel. Areas Commun., vol. 9, no. 5, pp. 711–717,
Jun. 1991.
[22] V. Stojanovic et al., “Adaptive equalization and data recovery in a dual-mode
(PAM2/4) serial link transceiver,” in 2004 Symposium on VLSI Circuits, 2004.
Digest of Technical Papers, 2004, pp. 348–351.
[23] M. Harwood et al., “A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC
with Digital Receiver Equalization and Clock Recovery,” in 2007 IEEE
International Solid-State Circuits Conference. Digest of Technical Papers, 2007,
pp. 436–591.
120
[24] B. Razavi, Design of Analog CMOS Integrated Circuits, 1 edition. Boston, MA:
McGraw-Hill Education, 2000.
[25] R. G. Beausoleil, M. McLaren, and N. P. Jouppi, “Photonic Architectures for High-
Performance Data Centers,” IEEE J. Sel. Top. Quantum Electron., vol. 19, no. 2, pp.
3700109–3700109, Mar. 2013.
[26] K. Hara, S. Kimura, H. Nakamura, N. Yoshimoto, and H. Hadama, “New AC-
Coupled Burst-Mode Optical Receiver Using Transient-Phenomena Cancellation
Techniques for 10 Gbit/s-Class High-Speed TDM-PON Systems,” J. Light.
Technol., vol. 28, no. 19, pp. 2775–2782, Oct. 2010.
[27] T. D. Ridder et al., “10 Gbit/s burst-mode post-amplifier with automatic reset,”
Electron. Lett., vol. 44, no. 23, pp. 1371–1373, Nov. 2008.
[28] X. Yin et al., “A 10Gb/s burst-mode TIA with on-chip reset/lock CM signaling
detection and limiting amplifier with a 75ns settling time,” in 2012 IEEE
International Solid-State Circuits Conference, 2012, pp. 416–418.
[29] S. Galal and B. Razavi, “10-Gb/s limiting amplifier and laser/modulator driver in
0.18- µm CMOS technology,” IEEE J. Solid-State Circuits, vol. 38, no. 12, pp.
2138–2146, Dec. 2003.
[30] M. Hossain and A. C. Carusone, “5 #x2013;10 Gb/s 70 mW Burst Mode AC
Coupled Receiver in 90-nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, no. 3, pp.
524–537, Mar. 2010.
[31] J. Luo, J. Parra-Cetina, P. Landais, H. J. S. Dorren, and N. Calabretta, “Performance
Assessment of 40 Gb/s Burst Optical Clock Recovery Based on Quantum Dash
Laser,” IEEE Photonics Technol. Lett., vol. 25, no. 22, pp. 2221–2224, Nov. 2013.
121
[32] F. Aflatouni and H. Hashemi, “A 1.8mW Wideband 57dBΩ transimpedance
amplifier in 0.13 µm CMOS,” in 2009 IEEE Radio Frequency Integrated Circuits
Symposium, 2009, pp. 57–60.
[33] C.-Y. Wang, C.-S. Wang, and C.-K. Wang, “An 18-mW two-stage CMOS
transimpedance amplifier for 10 Gb/s optical application,” in Solid-State Circuits
Conference, 2007. ASSCC ’07. IEEE Asian, 2007, pp. 412–415.