self-healing asynchronous arraystima.univ-grenoble-alpes.fr/conferences/async/technical... · ·...
TRANSCRIPT
Song Peng - ASYNC06
Self-Healing Asynchronous Arrays
Self-Healing Asynchronous Arrays
Song Peng and Rajit ManoharSong Peng and Rajit Manohar
Computer Systems LaboratoryCornell University, USA
Computer Systems LaboratoryCornell University, USA
Song Peng - ASYNC06
MotivationMotivation
Fault tolerant (FT) VLSI designincrease fabrication yieldimprove circuit reliability
No efficient general FT method for asynchronous circuits yet
no clock → different fault behaviortraditional FT techniques ineffective or inefficient
Fault tolerant (FT) VLSI designincrease fabrication yieldimprove circuit reliability
No efficient general FT method for asynchronous circuits yet
no clock → different fault behaviortraditional FT techniques ineffective or inefficient
Song Peng - ASYNC06
ContributionsContributions
Developed a general design for self-healing asynchronous arrays
suitable for clockless circuitsany number (K) of hard and soft errorsautomatic reconfiguration: self-healingsmall hardware cost, low overheads, and good scalability (with K)
Developed a general design for self-healing asynchronous arrays
suitable for clockless circuitsany number (K) of hard and soft errorsautomatic reconfiguration: self-healingsmall hardware cost, low overheads, and good scalability (with K)
Song Peng - ASYNC06
OutlineOutline
A general self-healing asynchronous design frameworkImplementing self-healing async arraysExperimental evaluationConclusions
A general self-healing asynchronous design frameworkImplementing self-healing async arraysExperimental evaluationConclusions
Song Peng - ASYNC06
FT Design PhilosophiesFT Design Philosophies
Hardwired replication-and-voting (NMR)deadlock complicates voting procedurevoter: performance bottlenecklarge H/W overhead
Reconfigurable fault tolerant designself-checking logic, spare resources, reconfigurationno voting, less H/W costfault recovery time → but little impact in overall
Hardwired replication-and-voting (NMR)deadlock complicates voting procedurevoter: performance bottlenecklarge H/W overhead
Reconfigurable fault tolerant designself-checking logic, spare resources, reconfigurationno voting, less H/W costfault recovery time → but little impact in overall
Song Peng - ASYNC06
General Framework of Self-healing Asynchronous CircuitGeneral Framework of Self-
healing Asynchronous Circuit
ReconfigurationLogic
DeadlockDetection
Fail-stop Circuitof K-FT Graph Topology
Song Peng - ASYNC06
Design OverviewDesign OverviewQuasi-delay-insensitive (QDI) circuitsAsynchronous arrays
valid model for most VLSI modules with identical components (adder, multiplier, FIR)node: a VLSI componentedge: a communication channel between two neighbor componentsK-FT graph: K-FT array with external in/outs
Quasi-delay-insensitive (QDI) circuitsAsynchronous arrays
valid model for most VLSI modules with identical components (adder, multiplier, FIR)node: a VLSI componentedge: a communication channel between two neighbor componentsK-FT graph: K-FT array with external in/outs
Implement Fail-Stop Behavior*Implement Fail-Stop Behavior*PCHB (Precharge Half Buffer) template
QDI circuit with domino stylecan construct any QDI logic
FS-PCHB templatePCHB + self-checking logica stuck-at fault or single event upset → FS-PCHB deadlocks*
PCHB (Precharge Half Buffer) template
QDI circuit with domino stylecan construct any QDI logic
FS-PCHB templatePCHB + self-checking logica stuck-at fault or single event upset → FS-PCHB deadlocks*
* S. Peng and R. Manohar, “Efficient failure detection in pipelined asynchronous circuits”, in DFT 2005
F
Control
f0/f1
fe
In
Ine
PCHB Circuit Diagram
en
Song Peng - ASYNC06
Detect DeadlockDetect Deadlock
A timer (delay element) watches the data channel activity
current-starved cascaded inverter chaina valid transition + the next transition expected→ start timerthe next valid transition not occur for a specific amount of time → timer expires: deadlock
A timer (delay element) watches the data channel activity
current-starved cascaded inverter chaina valid transition + the next transition expected→ start timerthe next valid transition not occur for a specific amount of time → timer expires: deadlock
Song Peng - ASYNC06
Online Self-ReconfigurationOnline Self-Reconfiguration
Reconfiguration overviewpass gates → connections of QDI circuitreconfiguration → pass gate control signals
No fault locationsearch a workable configurationhardware cost reducedlonger fault recovery
little performance impact in overall
Reconfiguration overviewpass gates → connections of QDI circuitreconfiguration → pass gate control signals
No fault locationsearch a workable configurationhardware cost reducedlonger fault recovery
little performance impact in overall
Song Peng - ASYNC06
General Block Diagram of Self-Reconfiguration Logic
General Block Diagram of Self-Reconfiguration Logic
Finite State Machine
Combinational Logic
Deadlock Detector
Reset Logic
Local ResetPass Gate Control Signals
DataChannel
K-FT Array ModelsK-FT Array ModelsFull-duplication model
high redundancy, simple reconfigurationMin-spare model*
min redundancy (K spares), complex reconfigurationSmall-degree model
medium redundancy, medium reconfigurationAll three models
each external in/out: K+1 copies
Full-duplication modelhigh redundancy, simple reconfiguration
Min-spare model*min redundancy (K spares), complex reconfiguration
Small-degree modelmedium redundancy, medium reconfiguration
All three modelseach external in/out: K+1 copies
* S. Peng and R. Manohar, “Fault tolerant asynchronous adder through dynamic self-reconfiguration”, in ICCD 2005.
Song Peng - ASYNC06
Full-Duplication ArrayFull-Duplication Array
Constructionadd K full copies of array: (K+1) in totalonly external in/outs reconfigurable
Self-reconfigurationreconfigure = switch to another copysimple reconfiguration logic
(K+1)-bit one-hot counter
Constructionadd K full copies of array: (K+1) in totalonly external in/outs reconfigurable
Self-reconfigurationreconfigure = switch to another copysimple reconfiguration logic
(K+1)-bit one-hot counter
Song Peng - ASYNC06
Small-Degree ArraySmall-Degree ArrayConstruction
add a medium number (>K) of spare nodesrecursive constructionresulting graph
K+1 treesmax node degree constant: small fanout
Self-reconfigurationsearch a configuration = select a path from all treesreconfiguration logic
one-hot counter, multiple mod-3 counters
Constructionadd a medium number (>K) of spare nodesrecursive constructionresulting graph
K+1 treesmax node degree constant: small fanout
Self-reconfigurationsearch a configuration = select a path from all treesreconfiguration logic
one-hot counter, multiple mod-3 counters
(K+1)-bit,to select a tree
To choose different branches in a tree walk
Small-Degree Array Example: 1-FT 4-Node Array
Small-Degree Array Example: 1-FT 4-Node Array
a c b
: pass-gate
1 0
a c b
end1 d0
deadlock1
2
2
13
4
5
60 1
clk
clk
01 01 01
Song Peng - ASYNC06
Experimental EvaluationExperimental Evaluation
Target circuit: 64-bit QDI adder1-bit adder: FS-PCHB
fine-grained fault toleranceEvaluate three FT graph models
H/W costperformance, energy overheadfault recovery time
Compare with NMR (with voter core)
Target circuit: 64-bit QDI adder1-bit adder: FS-PCHB
fine-grained fault toleranceEvaluate three FT graph models
H/W costperformance, energy overheadfault recovery time
Compare with NMR (with voter core)
Evaluation: H/W CostEvaluation: H/W CostH/W cost: transistor countCritical circuit: self-reconfiguration logic/voterNormalize H/W costs to baseline adder
H/W cost: transistor countCritical circuit: self-reconfiguration logic/voterNormalize H/W costs to baseline adder
571618.411.98.7156990.040.161.80826.310.27.444.6817.30.020.090.2146.416.145.213.011.410.010.040.2123.624.093.162.480.620.010.030.141NMRDUPSMLMINNMRDUPSMLMIN
TotalCriticalK
MIN: min-spare, SML: small-degree, DUP: full-duplication
Song Peng - ASYNC06
Other EvaluationsOther Evaluations
Performance and EnergyHSPICE: TSMC 0.18um, 25ºCnormalize throughputs and energy to baseline adder
Worst fault recovery time total number of configurations in FT arrayO(expected fault recovery time)
Performance and EnergyHSPICE: TSMC 0.18um, 25ºCnormalize throughputs and energy to baseline adder
Worst fault recovery time total number of configurations in FT arrayO(expected fault recovery time)
Evaluation: PerformanceEvaluation: Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
K
MINSMLDUPNMR
Evaluation: EnergyEvaluation: Energy
3
4
5
6
7
8
9
10
11
12
1 2 3 4 5 6 7 8
K
MINSMLDUPNMR
Evaluation:Fault Recovery Time
Evaluation:Fault Recovery Time
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8
K
MINSMLDUP
Song Peng - ASYNC06
ConclusionConclusion
A general design for self-healing QDI circuits
applicable no critical timing assumptiongeneral K hard/soft errors toleratedself-healing automatic reconfigurationefficient low overheads, good scalability
Can be applied to synchronous designas long as fail-stop
A general design for self-healing QDI circuits
applicable no critical timing assumptiongeneral K hard/soft errors toleratedself-healing automatic reconfigurationefficient low overheads, good scalability
Can be applied to synchronous designas long as fail-stop
Song Peng - ASYNC06
Self-Healing Asynchronous Arrays
Self-Healing Asynchronous Arrays
Song Peng and Rajit ManoharSong Peng and Rajit Manohar
Computer Systems LaboratoryCornell University, USA
Computer Systems LaboratoryCornell University, USA
Song Peng - ASYNC06
Backup SlidesBackup Slides
Song Peng - ASYNC06
Quasi-delay-insensitive (QDI) Circuits
Quasi-delay-insensitive (QDI) Circuits
An important class of asynchronous circuits
no gate/wiring timing assumptionother than isochronic forks
data communication by message passinghandshake → causality and event-orderingself-checking potential
An important class of asynchronous circuits
no gate/wiring timing assumptionother than isochronic forks
data communication by message passinghandshake → causality and event-orderingself-checking potential
Song Peng - ASYNC06
Implement Fail-Stop BehaviorImplement Fail-Stop Behavior
Fault Modelingboth hard and soft errorshard error → single stuck-at fault (SSAF)
cover many defects and permanent faultssoft error → single event upset (SEU)
cover most transient faultshigh reliability potential
Fault Modelingboth hard and soft errorshard error → single stuck-at fault (SSAF)
cover many defects and permanent faultssoft error → single event upset (SEU)
cover most transient faultshigh reliability potential
Song Peng - ASYNC06
Baseline QDI Circuit TemplateBaseline QDI Circuit TemplatePre-charge Half Buffer (PCHB)
pre-charge domino logic style → fastcan construct almost all QDI logic
Pre-charge Half Buffer (PCHB)pre-charge domino logic style → fastcan construct almost all QDI logic
Control
DataComputation
X0
X1
Xe
Y0
Y1
Ye
en
Song Peng - ASYNC06
Full-Duplication Array Example: 2-FT 2-Node Array
Full-Duplication Array Example: 2-FT 2-Node Array
1 0 00 1
: pass-gate
1
Min-Spare ArrayMin-Spare ArrayConstruction*
add K spare nodesreplicate each external connectionadd redundant internal connections
Self-reconfigurationreconfigure = pick up another set of N nodes from (N+K) nodesFSM = log2( ) -bit incrementercombinational logic necessary
Construction*add K spare nodesreplicate each external connectionadd redundant internal connections
Self-reconfigurationreconfigure = pick up another set of N nodes from (N+K) nodesFSM = log2( ) -bit incrementercombinational logic necessary
* S. Peng and R. Manohar, “Fault tolerant asynchronous adder through dynamic self-reconfiguration”, in ICCD 2005.
N+K
K
Song Peng - ASYNC06
Min-Spare Array Example:2-FT 2-Node Array
Min-Spare Array Example:2-FT 2-Node Array
Incrementer
Com
binational Logic
: pass-gate
Song Peng - ASYNC06
SummarySummary
All models outperform NMR for fine-grained FT design
smaller overheads, better scalability with KMin-spare model: minimum H/W costFull-duplication model: smallest critical circuit, shortest fault recovery timeSmall-degree model: medium overheads
All models outperform NMR for fine-grained FT design
smaller overheads, better scalability with KMin-spare model: minimum H/W costFull-duplication model: smallest critical circuit, shortest fault recovery timeSmall-degree model: medium overheads