ee141 system-on-chip test architectures ch. 8 – physical failures - p. 1 1 chapter 8 coping with...

EE1411

System-on-Chip Test Architectures Ch. 8 – Physical Failures - P. 1

Chapter 8Chapter 8

Coping with Physical Failures, Soft Errors, and Reliability Issues

EE1412


What is this chapter about?What is this chapter about?

Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors

Focus on Signal Integrity Defect-Based Tests Process Sensors and Adaptive Design Soft Errors

– BISER– Circuit-Level Approaches

Defect and Error Tolerance

EE1413


Coping with Physical Failures, Soft Errors, Coping with Physical Failures, Soft Errors, and Reliability Issuesand Reliability Issues

Introduction Signal Integrity Manufacture Defects, Process Variations, and

Reliability Soft Errors Defect and Error Tolerance Concluding Remarks

EE1414


IntroductionIntroduction

Defects Random defects

– Caused by manufacturing imperfections and occur in random places Systematic defects

– Caused by process or manufacturing variations

Defect level (DL) is a function of process yield (Y) and fault coverage (FC)

FCYDL 11

EE1415


Concept of Signal Integrity Concept of Signal Integrity

Signal integrity is the ability of a signal to generate correct responses in a circuit.

A signal with good integrity stays within safe margins for its voltage amplitude and transition time.

EE1416


Basic Concept of Integrity LossBasic Concept of Integrity Loss

i

ei

bi i dttfVossIntegrityLIL ))(()(

Integrity Loss: any portion of signal that exceeds amplitude-safe and time-safe margin.

where Vi is one of the acceptable amplitude levels and is a time frame during which integrity loss occurs.

ii eb ,

EE1417


Sources of Integrity LossSources of Integrity Loss

Interconnects Power Supply Noise Process Variations

EE1418


Integrity Loss Sensors/Monitors (1)Integrity Loss Sensors/Monitors (1)

Current Sensor

Current sensors are often used to detect the completion of asynchronous circuits.

EE1419



Power Supply Noise Sensor

The voltage depends on the power/ground bounces: the higher the PSN is, the longer the propagation and the higher the voltage will be. xV

xV

EE14110



Noise Detector (ND) Sensor

ND sensor is designed to detect integrity loss due to voltage violations.

EE14111



Integrity Loss Sensor (ILS)

The integrity loss sensor is a delay violation sensor.

EE14112


Integrity Loss Sensors/Monitors (5)Integrity Loss Sensors/Monitors (5) Jitter Monitor

Jitter is often defined as the time deviation of a signal from its ideal location in time.

EE14113



A ring oscillator can work as a Process Variation Sensor The variation of delay caused by PV-faults in any of the

inverters in the loop results in deviation in the frequency of the oscillator, which can be detected.

, where is an odd number of inverters and is the delay of one inverter.

)1())(2

(1 2

2

DSeff

tGSoxLoadddinv

RO VL

KVV

T

W

CVNf

invinvRO TNf 1

invN

invT

EE14114


Readout Architectures (1)Readout Architectures (1)

BIST-Based Architecture

When a noise or delay violation occurs (flag=1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis.

BIST Architecture Readout Circuitry

EE14115


Readout Architectures (2)Readout Architectures (2) Scan-Based Architecture

At the driving side of an interconnect, pattern generation BSC(PGBSC) is used to generate test patterns. At the receiving side of the interconnect, an observation BSC(OBSC) is used to detect integrity loss.

EE14116


Readout Architectures (3)Readout Architectures (3) Basic Concept of PV-Test Architecture

On-chip ROs with counters, embedded in a test chip are used to detect process variation by measuring the RO’s frequency shifts.

EE14117


Manufacture Defects, Process Variations, Manufacture Defects, Process Variations, and Reliabilityand Reliability

100% single stuck-at fault coverage cannot guarantee perfect product quality, because there are remaining defects that are:

Timing-dependent Sequence-dependent

Attributed to timing-dependent, non-single-stuck-at faults

EE14118


Structural TestsStructural Tests A Defect-Based Test Architecture

Synthesis ATPG Modeling

Gate-level Netlist Structural Tests

Timing Analysis

RC Extraction Library RTL Layout

Path Extractor

Defect-Based Fault Enumeration

Physical Faults

Fault Mapping Critical Path List

Defect-Based Fault Simulator

Fault List Functional Tests Logical Fault List

Defect-Based ATPG Defect-Based Tests

EE14119


Defect-Based TestsDefect-Based Tests

Small Delay Defect Tests Bridge Defect Tests N-Detect Tests Tests Tests VLV Tests

ddqIDDMinV

EE14120


Reliability StressReliability Stress Concept of Infant Mortality Methods to screen infant mortality

Method I - Burn-in

Where ttf is time to failure, C is a constant, is the activation energy (eV), k is the boltzman’s constant, and T is an absolute temperature.

Method II - Elevated Voltage Stress

kTEAecttf

AE

EE14121


Redundancy and Memory RepairRedundancy and Memory RepairRedundancy:

Spare rows, columns, or blocks

Repair schemes:

Pellston Technology [Wuu 2005]: If repeated error are detected, disable cache line (set “not to use” bit)

Perform memory BIST at new operating conditions; exclude failing cells and resize cache (cache size can vary larger or smaller, depending on whether new conditions are more favourable or worse)

EE14122


Process Sensors and Adaptive designProcess Sensors and Adaptive design

Compare traditional test structures put on the scribe lines and embed additional process sensors on-chip.

On-Chip Process Sensors:

Process Variation Sensor Thermal Sensor Dynamic Voltage Scaling

EE14123


Process variation SensorProcess variation Sensor

Ring oscillators: Many factors can affect the frequency of the ring oscillator such as

process variation, temperature and voltage. Analog Process Variation Sensor: The analog circuit will be sensitive to different process parameters.

Neither can report the process variation at the specific spot

on the die and unlikely to extract and analyze the data in real time.

EE14124


Thermal SensorThermal Sensor

□

Vref_diode Vb

Vref_diode

Vb

I1 I2 I3

Vref-1

Vref-n

R2 R1 Vref_TTLEVEL

Vc

+ _

Vref_diode

Tx Detect

MU

X

Δvf

+

_

N

On-chip thermal sensors are the last defence to prevent system crash or permanent damage to the chip.

Thermal sensor example:

Figure 8.14:Thermal sensor example

EE14125


Dynamic Voltage ScalingDynamic Voltage Scaling

□

DVS

VccNOM

Frequency

Time

fMIN

fMAX

VIDmin

Requestfrequency change

1

2

3

4

Transition 1, 3 in range of 100s of pS

Transition 2, 4 in range of 100s of μS

VIDnom

Figure 8.15: Dynamic voltage scaling scheme

EE14126


Use sleep transistors and dynamic biasing to save power

Use the adaptive test method for smart binning

DynamicDynamic Voltage Scaling (cont’d)Voltage Scaling (cont’d)

EE14127


Soft ErrorsSoft Errors

Introduction

Sources of Soft Errors and SER Trends

Coping with Soft Errors

EE14128


IntroductionIntroduction

Soft errors

Soft errors are transient single-event upsets (SEUs) caused by various type of radiation

Cosmic radiation is the major source of soft errors,especially in memories.

Terrestrial radiation is another source of soft errors.

EE14129


Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends

If a glitch is induced at the junction (red label) in a memory element, its state can be reversed.

Figure 8.16: Induced soft error on a SRAM

cell

EE14130


Sources of Soft Errors and SER TrendsSources of Soft Errors and SER Trends Logic circuits are less susceptible to these glitches

than memories for the following reasons. The glitch must be of sufficient strength to propagate from

the location of the strike. The glitch needs to have a functionally sensitized path to be

latched. The glitch must arrive at a latch during its latching window.

Figure 8.18: Masking factors of soft errors in

combinational logic

EE14131


Coping with Soft ErrorsCoping with Soft Errors

□

As chips are susceptible to soft errors, many soft error protection schemes targeting chip designs have been proposed.

Fault Tolerance

Error-resilient microarchitectures

soft errroe mitigation

EE14132


Fault ToleranceFault Tolerance

□

Removing the source of soft errors to improve the reliability of a chip.

Three fundamental fault tolerance schemes: Hardware (spatial) redundancy

– assumption that defects and radiation particles will only hit on a specific device and not another device

Time (temporal) redundancy

– assumption that the radiation strike will not happen on the same circuitry against at a slightly later time

Information redundancy

– using error-detecting code or error-correcting code to represent information contents

EE14133


□

Fault ToleranceFault Tolerance

Common fault tolerance schemes used in high reliability system Duplicate and compare

– used in mainframes and high-end servers Triple modular redundancy

– used for systems that cannot fail Redundant multithreading

– using error-detecting code or error-correcting code to represent information contents

EE14134


Error-Resilient MicroarchitecturesError-Resilient Microarchitectures

□

Two representative error-resilient processor microarchitectures DIVA Razor

DIVA Dynamic Implementation Verification Architecture (DIVA) DIVA Checker

– a smaller and simpler shadow processor– contain a functional checker stage (CHK), commit stage (CT),

and a watchdog timer(WT) DIVA Core

– The main processor that fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB)

EE14135



□

Razor Dynamic voltage scaling (DVS) is one of the most

effective and widely used methods for power-aware computing.

The key idea of Razor is to tune the supply voltage by monitoring the error during circuit of operation; this is accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flip-flop.

This Razor flip-flop is shown in Figure 8.21a.

EE14136



□

Error

Main Flip-Flop

Shadow Latch

Logic Stage

L2

Q1

Error_L

comparator

0 1

D1

RAZOR FF

Logic Stage

L1

clk_del

clk

Figure 8.21(a) Schematic of the Razorflip-flop

EE14137



□

Razor A reduced overhead Razor flip-flop with the

metastability detection circuit is illustrated in Figure 8.21b.

Error_L

clk

clk_b

Q

Metastability Detector

D

Inv_n

Inv_p

clk_del

clk_del_b

clk_b

clk

Error_L

Shadow Latch

01

Figure 8.21(b) Reduced overhead Razorflip-flop with metastability detection circuit

EE14138


Soft Error MitigationSoft Error Mitigation

Soft error mitigation techniques are to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schems.

There are three soft error mitigation methods: (1) Built-In Soft-Error Resilience (BISER) BISER proposed in [Mitra 2005] can be used to allow scan

design to protect a device from soft errors during normal operation.

EE14139


Soft Error MitigationSoft Error Mitigation Figure 8.22 shows the BISER scan cell design that

reduces the impact of soft errors affecting storage elements by more than 20 times.

..

LA

1DC12DC2

QC11D

Q

LB

PH1

1DC12DC2

Q

O2

Scan portion

System flip-flop

O1

C11D

Q

.CLK

D

UPDATE

CAPTURE

SCASI

SCB

C-element. .

..

Keeper

..

TEST

Q

SO

. PH2

..

LA

1DC12DC2

QC11D

Q

LB

PH1

1DC12DC2

Q

O2

Scan portion

System flip-flop

O1

C11D

Q

.CLK

D

UPDATE

CAPTURE

SCASI

SCB

C-element. .

..

Keeper

..

TEST

Q

SO

. PH2

Figure 8.22: Built-in soft-error resilience(BISER) scan cell

EE14140


Soft Error MitigationSoft Error Mitigation Circuit-level approaches

(2) Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications.

Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate.

Figure 8.23: Effect of gate resizing on theamplitude/width of SETs [Zhou 2006]

EE14141


Soft Error MitigationSoft Error Mitigation Circuit-level approaches

(3) Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications.

.

Figure 8.24: Example of rewiring toreduce the soft error failure rate

EE14142


Defect and Error ToleranceDefect and Error Tolerance

Defect Tolerance Insert redundancy circuitry in a circuit under test The circuit can continue correct operation in the

presence of defects.

Error Tolerance Allow the circuit to continue acceptable operation

in the presence of errors

EE14143


Random Spot defectsRandom Spot defects Assume a design consists N submodules. Each module has n unique positions where a

defect would cause it to fail its tests.

D defects uniformly distributed over the submodule.

Number of defects in any submodule is independent of the number of defects in other submodules.

EE14144


Defect ProbabilityDefect Probability Probability that an arbitrary position on a

submodule is associated with a defect is:

p = D / (nN)

Probability of having d defects in a given submodule is:

P(d) = C(n,d)pd(1-p)n-d

where

C(n,d) = n! / (d!(n-d)!)

EE14145


Poisson DistributionPoisson Distribution

P(d) is binomially distributed, the average number of defects in an arbitrary submodule is:

E(d) = λ = np = D / N

For large n and small p, the binomial distribution can be approximated by Poisson distribution

!dedPd

EE14146


ExampleExample Assume a submodule is equally likely to

be defect-free or defective:

Thus, λ = 0.693. Effective yield can increase significantly

if the system can accept some defective submodules.

!==dP 0/e0 0

EE14147


Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Ratefor Various Values of Failure Rate λλ

d λ =0.105

λ =0.223

λ =0.357

λ =0.511

λ =0.693

λ =0.916

λ =1.204

λ =1.609

λ =2.303Y =

0.900.09

Y =0.800.180.02

Y =0.700.250.040.01

Y =0.600.310.080.01

Y =0.500.350.120.03

Y =0.400.370.170.050.01

Y =0.300.360.220.090.030.01

Y =0.200.320.260.140.060.02

Y =0.100.230.270.200.120.050.020.01

01234567

EE14148


Defect ToleranceDefect Tolerance

SwitchM

M

M

Used to be called redundancy repair

A typical defect-tolerant design is shown on the left Two spares (identical

modules) A switch used to select

one module

EE14149


Error ToleranceError Tolerance The main Objective of error tolerance is to

increase the effective yield of a process by identifying defective but acceptable chips

This lies in the development of An accurate method to estimate error rate An effective method to predict yield

EE14150


Fault-Oriented Test MethodologyFault-Oriented Test Methodology Enhance effective yield based on error-rate

analysis Estimate error rate of each modeled fault A set of acceptable faults is identified based on

their error rates

Testing

UnacceptableChips

AcceptableChips

FaultRanking

ICFabrication

EE14151


Error-Oriented Test MethodologyError-Oriented Test Methodology

Focus on errors produced by defective chips rather than on modeled faults estimate the error rates of

these chips determine the

acceptability of the faulty chips by estimated results

Error-RateEstimation

EstimatedError Rate

ClassificationBased on Estimated

Error Rate

AcceptableChip Set 1

AcceptableChip Set 2

UnacceptableChips

…

TestingGoodChips

ICFabrication

BadChips

EE14152


Concluding RemarksConcluding Remarks

Circuit Errors can be caused by manufacturing defects and soft errors.

Design for Manufacturability (DFM) – Fault avoidance schemes to cope with physical failures caused by signal integrity, defects, and process variations during manufacturing.

Design for Reliability (DFR) – Embedded error resilience and defect tolerance circuitry on-chip to tolerate soft errors and manufacturing defects.

ee141 system-on-chip test architectures ch. 8 – physical failures - p. 1 1 chapter 8 coping with...

Documents

physical failures

chip test architectures

ee141 system

integrity loss sensorsmonitors

integrity loss sensor

good integrity stays

error tolerance slide

process variation sensor