ee141 system-on-chip test architectures ch. 8 – physical failures - p. 1 1 chapter 8 coping with...

EE1411

System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 1

Chapter 8Chapter 8

Coping with Physical Failures, Soft Errors, and Reliability Issues

EE1412

What is this chapter about?What is this chapter about?

Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors

Focus on Signal Integrity Defect-Based Tests Process Sensors and Adaptive Design Soft Errors

– BISER– Circuit-Level Approaches

Defect and Error Tolerance

EE1413

Coping with Physical Failures, Soft Errors, Coping with Physical Failures, Soft Errors, and Reliability Issuesand Reliability Issues

Introduction Signal Integrity Manufacture Defects, Process Variations, and

Reliability Soft Errors Defect and Error Tolerance Concluding Remarks

EE1414

IntroductionIntroduction

Defects Random defects

– Caused by manufacturing imperfections and occur in random places Systematic defects

– Caused by process or manufacturing variations

Defect level (DL) is a function of process yield (Y) and fault coverage (FC)

FCYDL 11

EE1415

Concept of Signal Integrity Concept of Signal Integrity

Signal integrity is the ability of a signal to generate correct responses in a circuit.

A signal with good integrity stays within safe margins for its voltage amplitude and transition time.

EE1416

Basic Concept of Integrity LossBasic Concept of Integrity Loss

bi i dttfVossIntegrityLIL ))(()(

Integrity Loss: any portion of signal that exceeds amplitude-safe and time-safe margin.

where Vi is one of the acceptable amplitude levels and is a time frame during which integrity loss occurs.

ii eb ,

EE1417

Sources of Integrity LossSources of Integrity Loss

Interconnects Power Supply Noise Process Variations

EE1418

Integrity Loss Sensors/Monitors (1)Integrity Loss Sensors/Monitors (1)

Current Sensor

Current sensors are often used to detect the completion of asynchronous circuits.

EE1419

Power Supply Noise Sensor

The voltage depends on the power/ground bounces: the higher the PSN is, the longer the propagation and the higher the voltage will be. xV

EE14110

Noise Detector (ND) Sensor

ND sensor is designed to detect integrity loss due to voltage violations.

EE14111

Integrity Loss Sensor (ILS)

The integrity loss sensor is a delay violation sensor.

EE14112

Integrity Loss Sensors/Monitors (5)Integrity Loss Sensors/Monitors (5) Jitter Monitor

Jitter is often defined as the time deviation of a signal from its ideal location in time.

EE14113

A ring oscillator can work as a Process Variation Sensor The variation of delay caused by PV-faults in any of the

inverters in the loop results in deviation in the frequency of the oscillator, which can be detected.

, where is an odd number of inverters and is the delay of one inverter.

)1())(2

tGSoxLoadddinv

invinvRO TNf 1

EE14114

Readout Architectures (1)Readout Architectures (1)

BIST-Based Architecture

When a noise or delay violation occurs (flag=1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis.

BIST Architecture Readout Circuitry

EE14115

Readout Architectures (2)Readout Architectures (2) Scan-Based Architecture

At the driving side of an interconnect, pattern generation BSC(PGBSC) is used to generate test patterns. At the receiving side of the interconnect, an observation BSC(OBSC) is used to detect integrity loss.

EE14116

Readout Architectures (3)Readout Architectures (3) Basic Concept of PV-Test Architecture

On-chip ROs with counters, embedded in a test chip are used to detect process variation by measuring the RO’s frequency shifts.

EE14117

Manufacture Defects, Process Variations, Manufacture Defects, Process Variations, and Reliabilityand Reliability

100% single stuck-at fault coverage cannot guarantee perfect product quality, because there are remaining defects that are:

Timing-dependent Sequence-dependent

Attributed to timing-dependent, non-single-stuck-at faults

EE14118

Structural TestsStructural Tests A Defect-Based Test Architecture

Synthesis ATPG Modeling

Gate-level Netlist Structural Tests

Timing Analysis

RC Extraction Library RTL Layout

Path Extractor

Defect-Based Fault Enumeration

Physical Faults

Fault Mapping Critical Path List

Defect-Based Fault Simulator

Fault List Functional Tests Logical Fault List

Defect-Based ATPG Defect-Based Tests

EE14119

Defect-Based TestsDefect-Based Tests

Small Delay Defect Tests Bridge Defect Tests N-Detect Tests Tests Tests VLV Tests

ddqIDDMinV

EE14120

Reliability StressReliability Stress Concept of Infant Mortality Methods to screen infant mortality

Method I - Burn-in

Where ttf is time to failure, C is a constant, is the activation energy (eV), k is the boltzman’s constant, and T is an absolute temperature.

Method II - Elevated Voltage Stress

kTEAecttf

EE14121

Redundancy and Memory RepairRedundancy and Memory RepairRedundancy:

Spare rows, columns, or blocks

Repair schemes:

Pellston Technology [Wuu 2005]: If repeated error are detected, disable cache line (set “not to use” bit)

Perform memory BIST at new operating conditions; exclude failing cells and resize cache (cache size can vary larger or smaller, depending on whether new conditions are more favourable or worse)

EE14122

Process Sensors and Adaptive designProcess Sensors and Adaptive design

Compare traditional test structures put on the scribe lines and embed additional process sensors on-chip.

On-Chip Process Sensors:

Process Variation Sensor Thermal Sensor Dynamic Voltage Scaling

EE14123

Process variation SensorProcess variation Sensor

Ring oscillators: Many factors can affect the frequency of the ring oscillator such as

process variation, temperature and voltage. Analog Process Variation Sensor: The analog circuit will be sensitive to different process parameters.

Neither can report the process variation at the specific spot

on the die and unlikely to extract and analyze the data in real time.

EE14124

Thermal SensorThermal Sensor

Vref_diode Vb

Vref_diode

I1 I2 I3

Vref-1

Vref-n

R2 R1 Vref_TTLEVEL

Vref_diode

Tx Detect

On-chip thermal sensors are the last defence to prevent system crash or permanent damage to the chip.

Thermal sensor example:

Figure 8.14:Thermal sensor example

EE14125

Dynamic Voltage ScalingDynamic Voltage Scaling

VccNOM

Frequency

VIDmin

Requestfrequency change

Transition 1, 3 in range of 100s of pS

Transition 2, 4 in range of 100s of μS

VIDnom

Figure 8.15: Dynamic voltage scaling scheme

EE14126

Use sleep transistors and dynamic biasing to save power

Use the adaptive test method for smart binning

DynamicDynamic Voltage Scaling (cont’d)Voltage Scaling (cont’d)

EE14127

Soft ErrorsSoft Errors

Introduction

Sources of Soft Errors and SER Trends

Coping with Soft Errors

EE14128

IntroductionIntroduction

Soft errors

Soft errors are transient single-event upsets (SEUs) caused by various type of radiation

Cosmic radiation is the major source of soft errors,especially in memories.

Terrestrial radiation is another source of soft errors.

EE14129

Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends

If a glitch is induced at the junction (red label) in a memory element, its state can be reversed.

Figure 8.16: Induced soft error on a SRAM

EE14130

Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends Logic circuits are less susceptible to these glitches

than memories for the following reasons. The glitch must be of sufficient strength to propagate from

the location of the strike. The glitch needs to have a functionally sensitized path to be

latched. The glitch must arrive at a latch during its latching window.

Figure 8.18: Masking factors of soft errors in

combinational logic

EE14131

Coping with Soft ErrorsCoping with Soft Errors

As chips are susceptible to soft errors, many soft error protection schemes targeting chip designs have been proposed.

Fault Tolerance

Error-resilient microarchitectures

soft errroe mitigation

EE14132

Fault ToleranceFault Tolerance

Removing the source of soft errors to improve the reliability of a chip.

Three fundamental fault tolerance schemes: Hardware (spatial) redundancy

– assumption that defects and radiation particles will only hit on a specific device and not another device

Time (temporal) redundancy

– assumption that the radiation strike will not happen on the same circuitry against at a slightly later time

Information redundancy

– using error-detecting code or error-correcting code to represent information contents

EE14133

Fault ToleranceFault Tolerance

Common fault tolerance schemes used in high reliability system Duplicate and compare

– used in mainframes and high-end servers Triple modular redundancy

– used for systems that cannot fail Redundant multithreading

– using error-detecting code or error-correcting code to represent information contents

EE14134

Error-Resilient MicroarchitecturesError-Resilient Microarchitectures

Two representative error-resilient processor microarchitectures DIVA Razor

DIVA Dynamic Implementation Verification Architecture (DIVA) DIVA Checker

– a smaller and simpler shadow processor– contain a functional checker stage (CHK), commit stage (CT),

and a watchdog timer(WT) DIVA Core

– The main processor that fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB)

EE14135

Razor Dynamic voltage scaling (DVS) is one of the most

effective and widely used methods for power-aware computing.

The key idea of Razor is to tune the supply voltage by monitoring the error during circuit of operation; this is accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flip-flop.

This Razor flip-flop is shown in Figure 8.21a.

EE14136

Main Flip-Flop

Shadow Latch

Logic Stage

Error_L

comparator

RAZOR FF

Logic Stage

clk_del

Figure 8.21(a) Schematic of the Razorflip-flop

EE14137

Razor A reduced overhead Razor flip-flop with the

metastability detection circuit is illustrated in Figure 8.21b.

Error_L

Metastability Detector

clk_del

clk_del_b

Error_L

Shadow Latch

Figure 8.21(b) Reduced overhead Razorflip-flop with metastability detection circuit

EE14138

Soft Error MitigationSoft Error Mitigation

Soft error mitigation techniques are to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schems.

There are three soft error mitigation methods: (1) Built-In Soft-Error Resilience (BISER) BISER proposed in [Mitra 2005] can be used to allow scan

design to protect a device from soft errors during normal operation.

EE14139

Soft Error MitigationSoft Error Mitigation Figure 8.22 shows the BISER scan cell design that

reduces the impact of soft errors affecting storage elements by more than 20 times.

1DC12DC2

Scan portion

System flip-flop

UPDATE

CAPTURE

C-element. .

Keeper

1DC12DC2

Scan portion

System flip-flop

UPDATE

CAPTURE

C-element. .

Keeper

Figure 8.22: Built-in soft-error resilience(BISER) scan cell

EE14140

Soft Error MitigationSoft Error Mitigation Circuit-level approaches

(2) Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications.

Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate.

Figure 8.23: Effect of gate resizing on theamplitude/width of SETs [Zhou 2006]

EE14141

Soft Error MitigationSoft Error Mitigation Circuit-level approaches

(3) Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications.

Figure 8.24: Example of rewiring toreduce the soft error failure rate

EE14142

Defect and Error ToleranceDefect and Error Tolerance

Defect Tolerance Insert redundancy circuitry in a circuit under test The circuit can continue correct operation in the

presence of defects.

Error Tolerance Allow the circuit to continue acceptable operation

in the presence of errors

EE14143

Random Spot defectsRandom Spot defects Assume a design consists N submodules. Each module has n unique positions where a

defect would cause it to fail its tests.

D defects uniformly distributed over the submodule.

Number of defects in any submodule is independent of the number of defects in other submodules.

EE14144

Defect ProbabilityDefect Probability Probability that an arbitrary position on a

submodule is associated with a defect is:

p = D / (nN)

Probability of having d defects in a given submodule is:

P(d) = C(n,d)pd(1-p)n-d

C(n,d) = n! / (d!(n-d)!)

EE14145

Poisson DistributionPoisson Distribution

P(d) is binomially distributed, the average number of defects in an arbitrary submodule is:

E(d) = λ = np = D / N

For large n and small p, the binomial distribution can be approximated by Poisson distribution

!dedPd

EE14146

ExampleExample Assume a submodule is equally likely to

be defect-free or defective:

Thus, λ = 0.693. Effective yield can increase significantly

if the system can accept some defective submodules.

!==dP 0/e0 0

EE14147

Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Ratefor Various Values of Failure Rate λλ

d λ =0.105

λ =0.223

λ =0.357

λ =0.511

λ =0.693

λ =0.916

λ =1.204

λ =1.609

λ =2.303Y =

0.900.09

Y =0.800.180.02

Y =0.700.250.040.01

Y =0.600.310.080.01

Y =0.500.350.120.03

Y =0.400.370.170.050.01

Y =0.300.360.220.090.030.01

Y =0.200.320.260.140.060.02

Y =0.100.230.270.200.120.050.020.01

01234567

EE14148

Defect ToleranceDefect Tolerance

SwitchM

Used to be called redundancy repair

A typical defect-tolerant design is shown on the left Two spares (identical

modules) A switch used to select

one module

EE14149

Error ToleranceError Tolerance The main Objective of error tolerance is to

increase the effective yield of a process by identifying defective but acceptable chips

This lies in the development of An accurate method to estimate error rate An effective method to predict yield

EE14150

Fault-Oriented Test MethodologyFault-Oriented Test Methodology Enhance effective yield based on error-rate

analysis Estimate error rate of each modeled fault A set of acceptable faults is identified based on

their error rates

Testing

UnacceptableChips

AcceptableChips

FaultRanking

ICFabrication

EE14151

Error-Oriented Test MethodologyError-Oriented Test Methodology

Focus on errors produced by defective chips rather than on modeled faults estimate the error rates of

these chips determine the

acceptability of the faulty chips by estimated results

Error-RateEstimation

EstimatedError Rate

ClassificationBased on Estimated

Error Rate

AcceptableChip Set 1

AcceptableChip Set 2

UnacceptableChips

TestingGoodChips

ICFabrication

BadChips

EE14152

Concluding RemarksConcluding Remarks

Circuit Errors can be caused by manufacturing defects and soft errors.

Design for Manufacturability (DFM) – Fault avoidance schemes to cope with physical failures caused by signal integrity, defects, and process variations during manufacturing.

Design for Reliability (DFR) – Embedded error resilience and defect tolerance circuitry on-chip to tolerate soft errors and manufacturing defects.

ee141 system-on-chip test architectures ch. 8 – physical failures - p. 1 1 chapter 8 coping with...

physical failures

chip test architectures

ee141 system

integrity loss sensorsmonitors

integrity loss sensor

good integrity stays

error tolerance slide

process variation sensor

Documents

temporal causal diagrams for diagnosing failures in cyber...

ee141-fall 2010 digital integrated...

ee141-spring 2008 digital integrated...

ee141- fall 2001

propagation delay, power...

ee141-fall 2007 homework #2 feedback digital...

ee141-fall 2008 digital integrated circuits

ee141 exam

ee141- spring...

static complementary...

ee141-spring 2004 digital integrated...

ee141 ifsin - weble.upc.edu · ee141 3 ee141 © digital...

ee141- spring 2004 digital integrated circuits

ee141- spring 2005 digital integrated...

ee141-fall 2012 digital integrated...

ee141-fall 2007 alternate definition for write margin...

designing to prevent disasters: mechanical system failures ...

ee141- spring...

ee141-fall 2012 digital integrated...

ee141-fall 2012 digital integrated...