hardware fault tolerance through simultaneous multithreading (part 2)
Post on 01-Feb-2016
67 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
CS717
Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)
Jonathan Winter
2
CS717
3 SMT + Fault Tolerance Papers
• Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999.
• Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000.
• Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.
3
CS717
Outline
1. Background• SMT• Hardware fault tolerance
2. AR-SMT• Basic mechanisms• Implementation issues• Simulation and Results
3. Transient Fault Detection via SMT• Sphere of replication• Basic mechanisms• Comparison to AR-SMT• Simulation and Results
4. Redundant Multithreading Alternatives• Realistic processor implementation• CRT• Simulation and Results
5. Fault Recovery6. Next Lecture
4
CS717
Transient Fault Detection via SMT
• More detailed analysis of Simultaneous and Redundant Threading (SRT)
• Introduces Sphere of Replication concept• Explores SRT design space• Discussion of input replication• Architecture for output comparison• Performance improving mechanisms• More depth in simulation
5
CS717
Sphere of Replication
• Components inside sphere are protected against faults using replication
• External components must use other means of fault tolerance (parity, ECC, etc.)
• Inputs to sphere must be duplicated for each of the redundant processes
• Outputs of the redundant processes are compared to detect faults
• Simple to understand in lockstepping• Larger sphere
– more state to replicate– less input replication and output comparison
6
CS717
Sphere of Replication (part 2)
• Size of sphere of replication– Two alternatives – with and without register file– Instruction and data caches kept outside
7
CS717
Input Replication
• Must ensure that both threads received same inputs to guarantee they follow the same path
• Instructions – assume no self-modification• Cached load data
– Out-of-order execution issue– Multiprocessor cache coherence issues
• Uncached load data – must synchronize• External interrupts
– Stall lead thread and deliver interrupt synchronously– Record interrupt delivery point and deliver later
8
CS717
Cached Load Data - ALAB
• Active Load Address Buffer (ALAB)– Delays cache block replacement or invalidation– ALAB is table with address tag, counter, and
pending-invalidate bit– Counter tracks trailing thread’s outstanding loads– Blocks cannot be replaced or invalidated until
counter is zero– Pending-invalidate set on unevictable block– Leading thread stalls when ALAB is full– Must detect and address deadlocks
9
CS717
Cached Load Data - LVQ
• Load Value Queue (LVQ)– Explicit designation of leading and trailing thread– Only leading thread issues loads and stores– Load addresses and values forward to trailing
thread via LVQ– Trailing thread executes loads in-order and non-
speculatively (why?)– Input replication guaranteed– Design simpler and less pressure on cache– Earlier fault detection– Constrains scheduling of trailing thread loads
10
CS717
Output Comparison
• Store buffer used to verify address and value of stores to be committed– Trailing thread searches for matching entry– Mismatch means fault occurred
• Cached load values require no checking• Uncached load values could have side effects
– Issue non-speculatively, so stall leading thread– Assumes uncached loads are always detected
• Register Check Buffer used to match register writebacks.– 3 register files required: future files + architectural file
11
CS717
Enhancing SRT Performance
• Slack Fetch– Maintain constant lag between thread’s execution– Lead thread updates branch and data predictors– Lead thread prefetches loads– Traditional SMT ICount fetch policy is modified to
maintain slack
• Branch Outcome Queue– Deliver branch outcomes directly to trailing thread– Trailing thread has no control speculation
12
CS717
AR-SMT verses SRT
• AR-SMT only has space redundancy in functional units• SRT can potentially have space redundancy across the pipeline• AR-SMT is trace processor-based while SRT is conventional• Register file of R-stream must be protected• AR-SMT forwards load data values• AR-SMT checks every instruction during fault detection• SRT requires no operating system modifications• AR-SMT doesn’t support uncached loads and stores or
multiprocessor coherence• Delay buffer performs function of register check buffer and
branch outcome queue• All of main memory is in AR-SMT sphere
– Better fault coverage but very costly
13
CS717
Simulation Environment
• Modified Simplescalar “sim-outorder”
• Long front-end pipeline because of out-of-order nature and SMT
• Simple approximation of trace cache
• Used 11 SPEC95 benchmarks
14
CS717
Results
• Again, this paper only analyzes the performance impact of fault tolerance
• Baseline Characterization– ORH-Dual two pipelines, each with half the resources– SMT-Dual replicated threads with no detection hardware
• ORH and SMT-Dual 32% slower than SMT-Single
15
CS717
Slack Fetch & Branch Outcome Queue
• 10%,14%, 15% (27% max) performance improvements for SF, BOQ, and SF + BOQ
• Reduced memory stalls through prefetching
• Prevents trailing thread from wasting resources by speculating
• Performance better with slack of 256 instructions over 32 or 128
16
CS717
Input Replication
• Assumes output comparison performed by oracle
• Almost no performance penalty paid for 64-entry ALAB or LVQ
• With a 16-entry ALAB and LVQ, benchmarks performance degraded 8% and 5% respectively
17
CS717
Output Comparison
• Assumes inputs replicated by oracle• Leading thread can stall if store queue is full• 64-entry store buffer eliminates almost all stalls• Register check buffer or size 32, 64, and 128 entries degrades
performance by 27%, 6%, and 1% respectively
18
CS717
Overall Results
• Speedup of SRT processor with 256 slack fetch, branch outcome queue with 128 entries, 64-entry store buffer, and 64-entry load value queue.
• SRT demonstrates a 16% speedup on average (up to 29%) over a lockstepping processor with the “same” hardware
19
CS717
Multi-cycle and Permanent Faults
• Transient faults could potentially persist for multiple cycles and affect both threads
• Increasing slack fetch decreases this possibility
• Spatial redundancy can be increased by partitioning function units and forcing threads to execute on different groups
• Performance loss for this approach is less than 2%
20
CS717
Conclusions
• Sphere of replication helps analysis of input replication and output comparison
• Keep register file in sphere• LVQ is superior to ALAB (simpler)• Slack fetch and branch outcome queue
mechanism enhance performance• SRT fault tolerance method performs 16%
better on average than lockstepping
top related