cml cse 591: advances in reliable computing aviral shrivastava

27
C M L CSE 591: Advances in Reliable Computing Aviral Shrivastava

Upload: tiffany-miles

Post on 28-Dec-2015

234 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CML

CSE 591: Advances in Reliable Computing

Aviral Shrivastava

Page 2: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Saving Galileo 1978 – Galileo commissioned for Jupiter

exploration 1980 – Design and Architecture decided

Use of AT 2901 for attitude control 1982 – Voyager reaches Jupiter

Intermittent Resets Sulfur ions from Jupiter’s volcanic moon

were being whipped up to high energy by the Jovian gravity.

After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved”

Overheads 5 years, 5 million dollars Sandia National Laboratories was

subcontracted to custom-make radiation hardened AT 2901

Page 3: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu3 CML

Radiation Induced Soft Errors

= 1.64 x 10-

10sec

= 5.10x10-

11sec

Typically

Induced current has a rapid rise time but a more gradual fall time

Page 4: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu4 CML

It started with nuclear tests…

1954-57: Nuclear Tests Electronic anomalies in monitoring equipment

Could not be traced to any hardware fault Equipment worked properly after restart

1962: Wallmark and Marcus (RCA Labs, Princeton) Minimum size and Maximum Packing Density of Non-

Redundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting

microelectronics 1962: Telestar - First communication satellite

July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device

(called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit

100X increase in radiation Rendered the satellite unoperational

worked after reboot

Page 5: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu5 CML

Radioactive Contamination 1978: Intel could not deliver chips to AT&T to

upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging

Packaging modules were contaminated with Uranium from and old uranium mine upstream.

Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge

generated by particle strike to cause a fault.

1986-87: IBM faced problems of radioactive contamination Traced problem to a distant chemical plant that used

radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.

Page 6: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu6 CML

History of Radiation-induced SERs

1979: Zeigler and Lanford presented solid evidence that, the electronic sensitivity to

radiation-induced soft errors could become a nightmare for the future technologies.

Predicted that soft errors due to cosmic radiations would increase with altitude

1995: Baumann et. al. Soft errors caused by Boron-10 isotopes activated by low-energy

atmospheric neutrons. 1996: Normand

Documented strikes in large servers found in error logs Discovered that memory error rates very significantly correlated to

the altitude of the computers – attributed them to soft errors (Z&L) High in servers in Los Alamos, and in fighter planes.

“Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.

Page 7: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Here comes the Sun… 11 year solar cycle of sun-spots

Major solar storms this year and next 109kg/s of material lost by the Sun as

ejected solar wind. Protons (~70%), electrons, ionized helium, less

than 0.5% minor ions. 2x1010 protons/cm2

Loss of satellites

Page 8: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu8 CML

Fault, Error and Failure

FAULT

a physical defect thatoccurs within hw or swcomponentsHW defect, SW bug Physical

Universe

physical entities making up a system

activation

ERROR

a deviation from accuracy or correctnessmanifestation of a fault

InformationalUniverse

units of information(eg: data words)

fault latency

FAILURE

nonperformance ofsome action that is due or expected

malfunction

ExternalUniverse

the user of a systemultimately see the effects

propagation

error latency

[Geffroyand, 02] Jean-Claude Geffroyand Gilles Motet, “Design of Dependable Computing Systems”, KluwerAcademic Publishers, 2002, ISBN 1-4020-0437-0

Page 9: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu9 CML

Electrical MaskingPulse attenuated

by electrical resistance in the

circuit

Pulse still strong enough to be

latched at output

Page 10: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Single Event Latchup

SEL: Single Event Latchup Parasitic circuit elements forming a silicon controlled rectifier

(SCR) Potentially destructive

the device current may destroy the device if not current limited and removed "in time.

Removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations.

SEL probability increases with temperature!

Page 11: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu11 CML

Logical Masking

Value unchanged at the gate

Page 12: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu12 CML

Logical Masking

Error propagated

to the output

Page 13: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu13 CML

Temporal MaskingTransient Fault Soft Error

A transient pulse at the latching window:1) Before tsetup masked (not latched)2) After tsetup, Before thold race condition3) At the latching window not masked (latched)

[Firouzi ROCS 2010]

Page 14: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Soft Error Trends

DRAM System error rate of DRAMs is fairly constant

SRAM Increasing exponentially

Logic Increasing exponentially

Page 15: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu15 CML

Increasing Soft Error Rates

Reducing features sizes and lower supply voltage Decreasing capacitive nodes

and noise margins Q_critical reducing

Exponentially more low-energy particles than high-energy ones

More number of transistors per chip More functionality is moving on-chip Higher probability of error due to more faults.

Increasing clock rates Larger fraction of time between setup and hold times for better

error latching

Page 16: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu16 CML

One Failure per Day per Chip

Soft error rates could increase from one error per year to one error per day in a decade!

[Shivakumar et al 2002]

Page 17: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Processing and Packaging Solutions

Reduce the number of particles that strike Reduce upsets

Use of highly purified fabrication materials Remove traces of boron and heavy

metals Surround by metallic frame

Reduce low-energy particles But neutrons can pass through > 10 ft

of concrete

Process Technology Solutions Partially depleted SOI: no help after

250 nm Fully depleted SOI: very expensive

Page 18: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu

Transistor Level Techniques

□ Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices□ To compensate for electron and hole mobilities

□ Changing this ratio can increase the tolerance

Page 19: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Gate-Level Techniques

Some gates are more vulnerable than others Radiation hardened designs use NAND gates

When all inputs are low, drive of p-stack is low, high leakage of n-transistors rise in the output slow functional failure

Gates vulnerability may change by 5X depending on the state NAND gate

Extremely vulnerable when inputs 10 Not vulnerable when inputs 00

How to synthesize to minimize vulnerability

Page 20: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Circuit-Level Techniques

Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients High temperature coefficients of poly-silicon resistors Difficult to control variation of resistance

Page 21: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CMLCopyright 2005, M. Tahoori

21

D-Cache: Flushing4x reduction

in vulnerability

Page 22: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu Copyright 2005, M. Tahoori

22

D-Cache: Write Policy10x reduction

in vulnerability

Page 23: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CMLCopyright 2005, M. Tahoori

23

D-Cache: Refresh3x reduction

in vulnerabilityusing write-thru

(30x total)

Page 24: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu

Replica Cache

Page 25: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CMLMemoryFNC FC

Main Cache Mini Cache

PPC (Partially Protected Caches) 2 Caches at the same level of memory

hierarchy Main Cache, and the protected mini-

cache Mini-cache

low power, low latency Timing slack to harden it

Compiler maps data to the two caches Map Failure-Critical data to the

protected mini-cache Map Not Failure-Critical data to

unprotected main cache

Intuition is to provide protection to only the FC data In multimedia applications, the

multimedia data is NOT failure critical An error Loss in Quality of Service

How to use PPCs for general applications?

Processor Pipeline

Unprotected Main Cache

Protected Mini Cache

HPC

Processor

Memory ControllerPage Mapping

PPC

FNC FC

Page 26: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Cache Scrubbing Periodically read memory and correct all

single bit errors

Disallows accumulation of temporal double bit errors

Standard technique in main memories (DRAMs)

Page 27: CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

CMLWeb page: aviral.lab.asu.edu CML

Pipeline Protection: Razor

Originally proposed to tolerate process variations Shadow latch clocked with a delayed clock If difference in values latched, raise error

How to use it to detect soft errors?