ee141 system-on-chip test architectures ch. 8 – physical failures - p. 1 1 chapter 8 coping with...
Post on 21-Dec-2015
219 views
TRANSCRIPT
EE1411
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 1
Chapter 8Chapter 8
Coping with Physical Failures, Soft Errors, and Reliability Issues
EE1412
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 2
What is this chapter about?What is this chapter about?
Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors
Focus on Signal Integrity Defect-Based Tests Process Sensors and Adaptive Design Soft Errors
– BISER– Circuit-Level Approaches
Defect and Error Tolerance
EE1413
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 3
Coping with Physical Failures, Soft Errors, Coping with Physical Failures, Soft Errors, and Reliability Issuesand Reliability Issues
Introduction Signal Integrity Manufacture Defects, Process Variations, and
Reliability Soft Errors Defect and Error Tolerance Concluding Remarks
EE1414
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 4
IntroductionIntroduction
Defects Random defects
– Caused by manufacturing imperfections and occur in random places Systematic defects
– Caused by process or manufacturing variations
Defect level (DL) is a function of process yield (Y) and fault coverage (FC)
FCYDL 11
EE1415
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 5
Concept of Signal Integrity Concept of Signal Integrity
Signal integrity is the ability of a signal to generate correct responses in a circuit.
A signal with good integrity stays within safe margins for its voltage amplitude and transition time.
EE1416
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 6
Basic Concept of Integrity LossBasic Concept of Integrity Loss
i
ei
bi i dttfVossIntegrityLIL ))(()(
Integrity Loss: any portion of signal that exceeds amplitude-safe and time-safe margin.
where Vi is one of the acceptable amplitude levels and is a time frame during which integrity loss occurs.
ii eb ,
EE1417
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 7
Sources of Integrity LossSources of Integrity Loss
Interconnects Power Supply Noise Process Variations
EE1418
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 8
Integrity Loss Sensors/Monitors (1)Integrity Loss Sensors/Monitors (1)
Current Sensor
Current sensors are often used to detect the completion of asynchronous circuits.
EE1419
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 9
Integrity Loss Sensors/Monitors (2)Integrity Loss Sensors/Monitors (2)
Power Supply Noise Sensor
The voltage depends on the power/ground bounces: the higher the PSN is, the longer the propagation and the higher the voltage will be. xV
xV
EE14110
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 10
Integrity Loss Sensors/Monitors (3)Integrity Loss Sensors/Monitors (3)
Noise Detector (ND) Sensor
ND sensor is designed to detect integrity loss due to voltage violations.
EE14111
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 11
Integrity Loss Sensors/Monitors (4)Integrity Loss Sensors/Monitors (4)
Integrity Loss Sensor (ILS)
The integrity loss sensor is a delay violation sensor.
EE14112
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 12
Integrity Loss Sensors/Monitors (5)Integrity Loss Sensors/Monitors (5) Jitter Monitor
Jitter is often defined as the time deviation of a signal from its ideal location in time.
EE14113
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 13
Integrity Loss Sensors/Monitors (6)Integrity Loss Sensors/Monitors (6)
A ring oscillator can work as a Process Variation Sensor The variation of delay caused by PV-faults in any of the
inverters in the loop results in deviation in the frequency of the oscillator, which can be detected.
, where is an odd number of inverters and is the delay of one inverter.
)1())(2
(1 2
2
DSeff
tGSoxLoadddinv
RO VL
KVV
T
W
CVNf
invinvRO TNf 1
invN
invT
EE14114
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 14
Readout Architectures (1)Readout Architectures (1)
BIST-Based Architecture
When a noise or delay violation occurs (flag=1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis.
BIST Architecture Readout Circuitry
EE14115
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 15
Readout Architectures (2)Readout Architectures (2) Scan-Based Architecture
At the driving side of an interconnect, pattern generation BSC(PGBSC) is used to generate test patterns. At the receiving side of the interconnect, an observation BSC(OBSC) is used to detect integrity loss.
EE14116
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 16
Readout Architectures (3)Readout Architectures (3) Basic Concept of PV-Test Architecture
On-chip ROs with counters, embedded in a test chip are used to detect process variation by measuring the RO’s frequency shifts.
EE14117
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 17
Manufacture Defects, Process Variations, Manufacture Defects, Process Variations, and Reliabilityand Reliability
100% single stuck-at fault coverage cannot guarantee perfect product quality, because there are remaining defects that are:
Timing-dependent Sequence-dependent
Attributed to timing-dependent, non-single-stuck-at faults
EE14118
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 18
Structural TestsStructural Tests A Defect-Based Test Architecture
Synthesis ATPG Modeling
Gate-level Netlist Structural Tests
Timing Analysis
RC Extraction Library RTL Layout
Path Extractor
Defect-Based Fault Enumeration
Physical Faults
Fault Mapping Critical Path List
Defect-Based Fault Simulator
Fault List Functional Tests Logical Fault List
Defect-Based ATPG Defect-Based Tests
EE14119
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 19
Defect-Based TestsDefect-Based Tests
Small Delay Defect Tests Bridge Defect Tests N-Detect Tests Tests Tests VLV Tests
ddqIDDMinV
EE14120
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 20
Reliability StressReliability Stress Concept of Infant Mortality Methods to screen infant mortality
Method I - Burn-in
Where ttf is time to failure, C is a constant, is the activation energy (eV), k is the boltzman’s constant, and T is an absolute temperature.
Method II - Elevated Voltage Stress
kTEAecttf
AE
EE14121
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 21
Redundancy and Memory RepairRedundancy and Memory RepairRedundancy:
Spare rows, columns, or blocks
Repair schemes:
Pellston Technology [Wuu 2005]: If repeated error are detected, disable cache line (set “not to use” bit)
Perform memory BIST at new operating conditions; exclude failing cells and resize cache (cache size can vary larger or smaller, depending on whether new conditions are more favourable or worse)
EE14122
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 22
Process Sensors and Adaptive designProcess Sensors and Adaptive design
Compare traditional test structures put on the scribe lines and embed additional process sensors on-chip.
On-Chip Process Sensors:
Process Variation Sensor Thermal Sensor Dynamic Voltage Scaling
EE14123
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 23
Process variation SensorProcess variation Sensor
Ring oscillators: Many factors can affect the frequency of the ring oscillator such as
process variation, temperature and voltage. Analog Process Variation Sensor: The analog circuit will be sensitive to different process parameters.
Neither can report the process variation at the specific spot
on the die and unlikely to extract and analyze the data in real time.
EE14124
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 24
Thermal SensorThermal Sensor
□
Vref_diode Vb
Vref_diode
Vb
I1 I2 I3
Vref-1
Vref-n
R2 R1 Vref_TTLEVEL
Vc
+ _
Vref_diode
Tx Detect
MU
X
Δvf
+
_
N
On-chip thermal sensors are the last defence to prevent system crash or permanent damage to the chip.
Thermal sensor example:
Figure 8.14:Thermal sensor example
EE14125
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 25
Dynamic Voltage ScalingDynamic Voltage Scaling
□
DVS
VccNOM
Frequency
Time
fMIN
fMAX
VIDmin
Requestfrequency change
1
2
3
4
Transition 1, 3 in range of 100s of pS
Transition 2, 4 in range of 100s of μS
VIDnom
Figure 8.15: Dynamic voltage scaling scheme
EE14126
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 26
Use sleep transistors and dynamic biasing to save power
Use the adaptive test method for smart binning
DynamicDynamic Voltage Scaling (cont’d)Voltage Scaling (cont’d)
EE14127
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 27
Soft ErrorsSoft Errors
Introduction
Sources of Soft Errors and SER Trends
Coping with Soft Errors
EE14128
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 28
IntroductionIntroduction
Soft errors
Soft errors are transient single-event upsets (SEUs) caused by various type of radiation
Cosmic radiation is the major source of soft errors,especially in memories.
Terrestrial radiation is another source of soft errors.
EE14129
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 29
Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends
If a glitch is induced at the junction (red label) in a memory element, its state can be reversed.
Figure 8.16: Induced soft error on a SRAM
cell
EE14130
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 30
Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends Logic circuits are less susceptible to these glitches
than memories for the following reasons. The glitch must be of sufficient strength to propagate from
the location of the strike. The glitch needs to have a functionally sensitized path to be
latched. The glitch must arrive at a latch during its latching window.
Figure 8.18: Masking factors of soft errors in
combinational logic
EE14131
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 31
Coping with Soft ErrorsCoping with Soft Errors
□
As chips are susceptible to soft errors, many soft error protection schemes targeting chip designs have been proposed.
Fault Tolerance
Error-resilient microarchitectures
soft errroe mitigation
EE14132
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 32
Fault ToleranceFault Tolerance
□
Removing the source of soft errors to improve the reliability of a chip.
Three fundamental fault tolerance schemes: Hardware (spatial) redundancy
– assumption that defects and radiation particles will only hit on a specific device and not another device
Time (temporal) redundancy
– assumption that the radiation strike will not happen on the same circuitry against at a slightly later time
Information redundancy
– using error-detecting code or error-correcting code to represent information contents
EE14133
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 33
□
Fault ToleranceFault Tolerance
Common fault tolerance schemes used in high reliability system Duplicate and compare
– used in mainframes and high-end servers Triple modular redundancy
– used for systems that cannot fail Redundant multithreading
– using error-detecting code or error-correcting code to represent information contents
EE14134
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 34
Error-Resilient MicroarchitecturesError-Resilient Microarchitectures
□
Two representative error-resilient processor microarchitectures DIVA Razor
DIVA Dynamic Implementation Verification Architecture (DIVA) DIVA Checker
– a smaller and simpler shadow processor– contain a functional checker stage (CHK), commit stage (CT),
and a watchdog timer(WT) DIVA Core
– The main processor that fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB)
EE14135
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 35
Error-Resilient MicroarchitecturesError-Resilient Microarchitectures
□
Razor Dynamic voltage scaling (DVS) is one of the most
effective and widely used methods for power-aware computing.
The key idea of Razor is to tune the supply voltage by monitoring the error during circuit of operation; this is accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flip-flop.
This Razor flip-flop is shown in Figure 8.21a.
EE14136
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 36
Error-Resilient MicroarchitecturesError-Resilient Microarchitectures
□
Error
Main Flip-Flop
Shadow Latch
Logic Stage
L2
Q1
Error_L
comparator
0 1
D1
RAZOR FF
Logic Stage
L1
clk_del
clk
Figure 8.21(a) Schematic of the Razorflip-flop
EE14137
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 37
Error-Resilient MicroarchitecturesError-Resilient Microarchitectures
□
Razor A reduced overhead Razor flip-flop with the
metastability detection circuit is illustrated in Figure 8.21b.
Error_L
clk
clk_b
Q
Metastability Detector
D
Inv_n
Inv_p
clk_del
clk_del_b
clk_b
clk
Error_L
Shadow Latch
01
Figure 8.21(b) Reduced overhead Razorflip-flop with metastability detection circuit
EE14138
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 38
Soft Error MitigationSoft Error Mitigation
Soft error mitigation techniques are to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schems.
There are three soft error mitigation methods: (1) Built-In Soft-Error Resilience (BISER) BISER proposed in [Mitra 2005] can be used to allow scan
design to protect a device from soft errors during normal operation.
EE14139
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 39
Soft Error MitigationSoft Error Mitigation Figure 8.22 shows the BISER scan cell design that
reduces the impact of soft errors affecting storage elements by more than 20 times.
..
LA
1DC12DC2
QC11D
Q
LB
PH1
1DC12DC2
Q
O2
Scan portion
System flip-flop
O1
C11D
Q
.CLK
D
UPDATE
CAPTURE
SCASI
SCB
C-element. .
..
Keeper
..
TEST
Q
SO
. PH2
..
LA
1DC12DC2
QC11D
Q
LB
PH1
1DC12DC2
Q
O2
Scan portion
System flip-flop
O1
C11D
Q
.CLK
D
UPDATE
CAPTURE
SCASI
SCB
C-element. .
..
Keeper
..
TEST
Q
SO
. PH2
Figure 8.22: Built-in soft-error resilience(BISER) scan cell
EE14140
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 40
Soft Error MitigationSoft Error Mitigation Circuit-level approaches
(2) Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications.
Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate.
Figure 8.23: Effect of gate resizing on theamplitude/width of SETs [Zhou 2006]
EE14141
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 41
Soft Error MitigationSoft Error Mitigation Circuit-level approaches
(3) Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications.
.
Figure 8.24: Example of rewiring toreduce the soft error failure rate
EE14142
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 42
Defect and Error ToleranceDefect and Error Tolerance
Defect Tolerance Insert redundancy circuitry in a circuit under test The circuit can continue correct operation in the
presence of defects.
Error Tolerance Allow the circuit to continue acceptable operation
in the presence of errors
EE14143
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 43
Random Spot defectsRandom Spot defects Assume a design consists N submodules. Each module has n unique positions where a
defect would cause it to fail its tests.
D defects uniformly distributed over the submodule.
Number of defects in any submodule is independent of the number of defects in other submodules.
EE14144
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 44
Defect ProbabilityDefect Probability Probability that an arbitrary position on a
submodule is associated with a defect is:
p = D / (nN)
Probability of having d defects in a given submodule is:
P(d) = C(n,d)pd(1-p)n-d
where
C(n,d) = n! / (d!(n-d)!)
EE14145
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 45
Poisson DistributionPoisson Distribution
P(d) is binomially distributed, the average number of defects in an arbitrary submodule is:
E(d) = λ = np = D / N
For large n and small p, the binomial distribution can be approximated by Poisson distribution
!dedPd
EE14146
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 46
ExampleExample Assume a submodule is equally likely to
be defect-free or defective:
Thus, λ = 0.693. Effective yield can increase significantly
if the system can accept some defective submodules.
!==dP 0/e0 0
EE14147
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 47
Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Ratefor Various Values of Failure Rate λλ
d λ =0.105
λ =0.223
λ =0.357
λ =0.511
λ =0.693
λ =0.916
λ =1.204
λ =1.609
λ =2.303Y =
0.900.09
Y =0.800.180.02
Y =0.700.250.040.01
Y =0.600.310.080.01
Y =0.500.350.120.03
Y =0.400.370.170.050.01
Y =0.300.360.220.090.030.01
Y =0.200.320.260.140.060.02
Y =0.100.230.270.200.120.050.020.01
01234567
EE14148
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 48
Defect ToleranceDefect Tolerance
SwitchM
M
M
Used to be called redundancy repair
A typical defect-tolerant design is shown on the left Two spares (identical
modules) A switch used to select
one module
EE14149
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 49
Error ToleranceError Tolerance The main Objective of error tolerance is to
increase the effective yield of a process by identifying defective but acceptable chips
This lies in the development of An accurate method to estimate error rate An effective method to predict yield
EE14150
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 50
Fault-Oriented Test MethodologyFault-Oriented Test Methodology Enhance effective yield based on error-rate
analysis Estimate error rate of each modeled fault A set of acceptable faults is identified based on
their error rates
Testing
UnacceptableChips
AcceptableChips
FaultRanking
ICFabrication
EE14151
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 51
Error-Oriented Test MethodologyError-Oriented Test Methodology
Focus on errors produced by defective chips rather than on modeled faults estimate the error rates of
these chips determine the
acceptability of the faulty chips by estimated results
Error-RateEstimation
EstimatedError Rate
ClassificationBased on Estimated
Error Rate
AcceptableChip Set 1
AcceptableChip Set 2
UnacceptableChips
…
TestingGoodChips
ICFabrication
BadChips
EE14152
System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 52
Concluding RemarksConcluding Remarks
Circuit Errors can be caused by manufacturing defects and soft errors.
Design for Manufacturability (DFM) – Fault avoidance schemes to cope with physical failures caused by signal integrity, defects, and process variations during manufacturing.
Design for Reliability (DFR) – Embedded error resilience and defect tolerance circuitry on-chip to tolerate soft errors and manufacturing defects.