design of soft error tolerant memory and logic circuitsee.sharif.edu/~adic/lecture_ser_20.pdf ·...

25
Design of Soft Error Tolerant Memory and Logic Circuits Shah M. Jahinuzzaman PhD Student http://vlsi.uwaterloo.ca/~smjahinu Graduate Student Research Talks, E&CE January 16, 2006 CMOS Design and Reliability Group CMOS Design and Reliability Group

Upload: donhi

Post on 02-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Design of Soft Error Tolerant Memory and Logic Circuits

Shah M. JahinuzzamanPhD Student

http://vlsi.uwaterloo.ca/~smjahinu

Graduate Student Research Talks, E&CEJanuary 16, 2006

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

CMOS Design and Reliability Group

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

http://www.ece.uwaterloo.ca/~cdr/

Outline• What is soft error• Soft error sources and mechanism • Soft error in logic circuits• Soft error in memories• Effect of technology scaling on soft error• Mitigation techniques • Summary

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Transient data upset due to particle strike‘1’ ‘0’ or ‘0’ ‘1’

• Minimum charge required for an upset Qcrit• No damage to hardware - REWRITE or RESET

can restore the changed data• Random in time and space• Affects: latches, flip-flops, memory blocks, and

even combinational logic circuits

What is Soft Error?

1 0 1 1 0 1 0 0 1

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Typically expressed in FIT (failure in time)1 FIT=1failure/109 device-hr

• Sum of typical hard failure rates ≈ 50-200FIT(oxide breakdown, latch-up etc.)

Soft failure rate in unprotected chip ≈ 50,000FIT• Critical reliability concerns:

microprocessors with large cache (e.g., in servers), SRAM based FPGAs and ASICs, aircraft controllers, space-borne electronics, life-support devices such as cardiac defibrillators

Soft Error Rate

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Source: www.fda.gov

• High energy (~MeV) particles– Alpha particles (~ 4-9 MeV)– Cosmic neutrons (~ 10-200 MeV) and – Thermal neutron and 10B in Borophosphosilicate glass

(BPSG)• Only 3.6eV is required to create 1 EHP in Si• BPSG is no more a concern after 0.25μm tech • Alpha particles come form chip packaging

materials • Neutrons come from cosmic rays and are ever

present (background radiation)

Sources of soft error

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Doubly ionized 4He2+ atom• Sources: Pb in solders and U, Th in IC

packaging materials; major concern - solder balls in flip-chip package

• Penetrates 25μm in Si• Can be shielded by ‘epoxy layer’ (not in flip-chip)

Alpha Particle

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Comes from sun or inter-galactic rays

• Generates EHPindirectly through Si recoil

• Cannot be shielded: 1ft concrete can lower neutron flux only by 1.4x

Cosmic Neutron

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Alpha particle deposits 4-16 fC/μmNeutron (Si recoil) deposits 25-150 fC/μm

• Rate limiting SE source in scaled down devices with high purity materials: cosmic neutron

Relative Influence of SE Sources

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Basic Mechanism of SE

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Qcoll=ηQdep, η being collection efficiency• Qcoll generates current transient • Qcoll depends on doping, collection volume, node voltage, carrier mobility etc.

• Qcoll>Qcrit soft error

R. Baumann, IEEE Design and Test of Computers, pp. 258-266, May-June 2005

• Also referred to as Single Event Transient (SET)• Less troublesome - less density compared to

memories and activity dependent • Naturally masked by three mechanisms

– Logical masking– Electrical masking– Latching window masking

• Increasing concern with scaling (90nm, onward)“Robust enterprise platforms in sub-65nm technologies require design with built-in logic soft error protection,” S. Mitra, Intel Corp.

Soft Error in Logic Circuits

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Struck node has to be in controlling state for a transient to pass from input to output

• In order for an error to propagate, there must be a sensitized path along the logic chain

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Logical Masking

A B NAND0 0 10 1 11 0 11 1 0

Output does not depend on B; A is

in controlling state

• Digital circuits have finite bandwidth and rise-fall time.

• Transients with bandwidths higher than the cut-off frequency will be attenuated (amplitude ↓, rise and fall time ↑) and eventually the transients will disappear.

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Electrical Masking

• A transient cannot be latched into a FF/register unless it occurs within the clock window

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Latching Window Masking

Transient has to occur here to be latched

Transient is not latched

• No masking effects, high density Most susceptible to soft error

• Memories to consider: • Main memory (DRAM)• Cache memory (SRAM)

• Soft error changes the stored bits may lead to catastrophic failures of microprocessors, SRAM based FPGAs etc.

Soft Error in Memories

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Higher capacitance (3D, trench like), smaller charge collection area, periodic refresh

decreasing bit error rate, constant system error rate

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Soft Error in DRAM

R. Baumann, IEEE Design and Test of Computers, pp. 258-266, May-June 2005

Word lines

Bit lines

Trench capacitors

Metal bit line

Trench capacitors

Poly word line

• Larger area per bit than DRAM, signal charge stored by two cross-coupled inverter

• Two nodes are prone to particle hit: one node is more sensitive

• Critical reliability issue

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Soft Error in SRAM

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

Scaling Trends of SE

Syst

em S

ER

in S

emic

ondu

ctor

Mem

ory

(FIT

s)

Source: Semico Research Inc. (June 2002)

R. Baumann, IEEE Design and Test of Computers, pp. 258-266, May-June 2005

• Signal charge is reduced: Q=CV, both C and V are scaled

• Particles with lower energy can cause soft error

Process Node Application Soft Error Protection Required

Consumer None

Networking and storage Memory

Military and aerospace Memory and logic

Consumer Memory and logic

Networking and storage Memory and logic

Military and aerospace Memory and logic

Consumer Memory and logic

Networking and storage Memory and logic

Military and aerospace Memory and logic

65nm and below

90nm

180nm to 130nm

Source: iRoC Technologies & www.edn.com

SE Sensitivity with Scaling

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Layout level– Reduction of sensitive area, using extra doping layer

(epitaxial layer can help) or SOI etc.• Circuit level

– Circuit techniques to reduce sensitivity to transients• System level

– Space and time redundancy, Parity protection (only error detection), Error Correction Code (ECC), Error Detection and Correction Code (EDAC)

Existing Mitigation Techniques

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

VDD

WLWLBLB

BL

R

R

Circuit Level Mitigation

Cypress Semiconductors

Ootsuka et. al., IEDM 1998

P. Roche, et. al., IRPS 2004

T. M. Mnich, et. al., IEEE Trans. Nucl. Sci., p. 4620, 1983

System Level Mitigation

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Redundancy, majority voting

• Parity protection, EDAC/ECC

• SRAM in SoC: currently 50%; expected to reach ~90% by the end of the decade

• SE tolerance of SRAM will determine the system reliability

• Scaling and low power approaches for SRAMs are making SE immunity harder to achieve

• Circuit and system level hardening within area-power-performance constraints is essential – motivation of my research

Design of SE Tolerant SRAMs

CMOS Designand

Reliability Group

CMOS Designand 

Reliability Group

• Soft error causes silent data corruption – the probability increases with technology scaling

• Both memory and logic circuits are susceptible• Logic circuits have inherent masking mechanism

– Higher frequency makes them vulnerable• Memories, e.g., SRAM is the most vulnerable• Layout, circuit and system level mitigation

techniques are used• Mitigation techniques incur cost and degrade

performance CMOS Design

and Reliability Group

CMOS Designand 

Reliability Group

Summary

It’s not a Microsoft errorIt’s just a soft error!

THANK YOU

Wait!