process redundancy for future-generation cmp fault tolerance dan gibson ece 753 project presentation

Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation Overview Motivating Example Execution Redundancy A Brief Tutorial Chip Multiprocessors Redundant Processes on CMPs Analytic Results Portcullis Prototype Measured Results Today: Reliable Hardware tcsh(1)%./add = 4 tcsh(2)%./add = 10 tcsh(3)%./add = 4 tcsh(4)% Tomorrow: Faulty Hardware tcsh(1)%./add = 4 tcsh(2)%./add 3 7 Segmentation Fault tcsh(3)%./add = 5 tcsh(4)% What Happened? Transistors are Shrinking SRAM Cell Capacitance Smaller Gates Less Drive Strength Chips are not shrinking Chips are Larger Relative to Transistor Size Long Wires More Crosstalk More Complexity What Happened? Transistors are Shrinking SRAM Cell Capacitance Smaller Gates Less Drive Strength Chips are not shrinking Chips are Larger Relative to Transistor Size Long Wires More Crosstalk More Complexity What Happened? Transistors are Shrinking SRAM Cell Capacitance Smaller Gates Less Drive Strength Chips are not shrinking Chips are Larger Relative to Transistor Size Long Wires More Crosstalk More Complexity What Can Be Done? Build Reliable HW Complexity is Skyrocketing (Mistakes Inevitable) Hurts Performance: Smaller Devices are Faster, but Less Reliable Larger Devices are Reliable, but Slow Accept Unreliable HW, Make Reliable SW OK for MTBF Reasonably Large (e.g. Databases) Not for the Fainthearted Programmer One Solution: Execution Redundancy Run the Same Code Many Times `Decide on the correct result (e.g. vote) Many Flavors of Redundancy 3MR, 5MR Concurrent Modular Redundancy Pair & Spare (eg. early NonStop) Sift-out Etc Key Questions: Where to run redundant code? When to run redundant code? How to run redundant code? Re-Execution Time Inputs Run Once Inputs Run Again Inputs Run N Times Compare Results Where: Same Hardware When: One After Another How: Provide Common Inputs, Compare Outputs. Re-Execution: Fault Detection Time Inputs Run Once Inputs Run Again 2+2=4 Segmentation Fault 4 X Re-Execution: Fault Recovery Time Inputs Run Once Inputs Run Again Inputs Run N Times 4 X 4 2+2=4 Compare Results 2+2=4 Re-Execution: Pros and Cons Simple Tolerates Transient Faults Tolerates Some Intermittent Faults No Redundant Hardware Needed Overhead: (N-1) x 100%, plus checking overhead No Tolerance for Permanent Faults Lock-Step Time Where: On Tightly-Coupled Redundant Hardware When: Cycle-by-Cycle How: Check Every Result Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Outputs Lock-Step: Fault Detection Time Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result 2+2=4 2+2=5 2+2= != 4: Fault! Lock-Step: Fault Recovery Time Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Votes: 4 2 Votes 5 1 Vote Ergo: 2+2= Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Outputs Lock-Step: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults Tolerates Isolated Permanent Faults May Be Invisible to Software Overhead: Requires Frequent Breaks in Execution Slow Expensive (Area, $) Wasteful (Lock-Step Not Needed) Soft Disagreements E.g. Branch Prediction Both IBM and HP use designs like this! Trailing Checker Time Where: On Tightly-Coupled (sometimes) Redundant Hardware When: Ad Hoc How: Second Thread Checks the First Checker Thread Starts Inputs Commence Execution Outputs Trailing Checker: Fault Detection Time Checker Thread Starts Inputs Commence Execution Outputs 2+2=5 2+2=4 Trailing Checker: Fault Recovery Time Checker Thread Starts Inputs Commence Execution Outputs Following Thread Corrects Leading Thread Trailing Checker: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults High Performance Checker Can Be Simpler than Main Thread Overhead: Some HW Redundancy Mutual Interruptability Pipeline and/or Cache Modifications Checker Sometimes Assumed Non-Faulty SMT: Fault Correlation DIVA, some SMT Techniques Insight NMR Techniques have Desirable Properties Well Understood Simple, HW is Isolated Trailing Checker has Desirable Properties Common-case Performance Redundancy is Parallel, but Checks Add Overhead Non-Blocking Synchronization Asynchronous NMR Techniques Asynchronous NMR Time Where: On Loosely-Coupled Redundant Hardware When: Ad Hoc How: Check Some Results (e.g. I/O) Inputs Start N Identical Executions 2+2=4 Log Output 2+2=4 Log Output, Compare Results 2+2=4 Log Output, Compare Results 2+2=4 Asynchronous NMR: Fault Detection Time Inputs Start N Identical Executions 2+2=4 Log Output 2+2=5 Log Output, Compare Results 5 != 4: Fault! 4 5 Asynchronous NMR: Fault Recovery Time Inputs Start N Identical Executions Log OutputLog Output, Compare Results =4 Log Output, Compare Results 4 Votes: 4 2 Votes 5 1 Vote Ergo: 2+2=4 4 Asynchronous NMR: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults Tolerates Permanent Faults High Performance Leverages Parallelism Flexible Overhead: Some HW Redundancy Needed for Performance Software Involvement Flexible Many decisions Synchronization Needed Performance/Complexity Tradeoff Chip Multiprocessors to the rescue! Chip Multiprocessors (CMPs) Many Processors (aka Cores), One Chip Quick Inter-Core Communication Abundant Parallelism (More than SW knows what to do with!) Resource Sharing Caches, Off-Chip Memories On-Chip Interconnect CMPs of Today Intel Core 2 Duo Extreme Edition Two Cores Four-Core Coming (Very) Soon CMPs of Today Sun Niagara (SunFire TX000) Eight Cores, 32 Execution Contexts CMPs of Today Sun Niagara 2 (In the works) Eight Cores, 64 Execution Contexts CMPs of Today IBM Cell Nine Cores: One Beefy PPC Eight SPEs Virtualization: Only 7 Exposed SPEs CMPs of Tomorrow Intel, IBM, Sun, AMD All Have CMPs Today Intel Announced a 100+ Core CMP Technology Scaling alone will Enable 100s of cores inside of 10 years BUT: SW Is Largely Serial Code! How can Future CMPs utilize Core Abundance? Combat Faults with Core-level Redundancy Execution Redundancy on CMPs Combine Asynchronous NMR and CMPs: HW Provides Redundancy Isolation and Detection HW Support for Managed Redundancy SW Manages Concurrency: Synchronization Fault Recovery HW Support for Managed Redundancy Heterogeneous Processing Elements: Aggressive Cores: Scaled to the Limit of Technology Susceptible to Faults Small and Numerous Reliable Cores: Conservatively Sized Much higher MTTF, but Significantly Slower Large and Few in Number HW Support for Managed Redundancy tcsh(1)% add 2 2 HW Support for Managed Redundancy tcsh(1)% add =4 Managing Redundancy Synergy between POSIX Process Boundary and Cores on a CMP OS Runs Only on Reliable Core System Calls (e.g. I/O) Provide a Natural Opportunity to Perform Checks/Voting Already Interruption Overhead Hardware Support Isolate Faults to Aggressive Cores (Error Propagation OK) System Call Proxy to Reliable Core Localized Software- Initiated Reset Software Management OS Manages Redundancy (Redundant Process Control Module = RPCM) RPCM Provides Virtualization HW or SW Fault Detection SW Fault Recovery Fault Detection Scenario 1: Hardware-Detected Fault Scenario 2: Software-Detected Fault 2+2=5 2+2=4 Fault Recovery 2+2=5 2+2=4 1)Reset Affected Core 2)Stop a Non-Faulty Process 3)Move a Copy of the Stopped Process to the Faulty Core 4)Resume Both Virtualization getpid() Processes Expect to Execute Alone! Cannot allow processes to interfere Must ensure identical executions Your PID is 4 getpid() Your PID is 4 Virtualization Processes Expect to Execute Alone! Cannot allow processes to interfere Must ensure identical executions Ok, FD = 3 open(out.txt) Ok, FD = 3 open(out.txt) open(out-0.txt) open(out-1.txt) Pros and Cons Tolerates Transient, Intermittent, and Permanent Faults High Performance Flexible User-Software Invisible Overhead HW Redundancy OS Support Required Redundancy Management Virtualization Fault Propagation is Possible Must Combat with Larger N Analytical Analysis 1 Analytical Analysis 2 Portcullis Prototype Future-CMP Isnt Available Yet Simulating it Would Take a LONG Time Modifying the OS Would Take a LONG Time Punt: Make a User-Level Prototype for Todays Hardware Portcullis Prototype Emulate the RPCM Trap System Calls Detect Faults Hide other Processes Allow the OS To Manage Sharing Provide Virtualization for Redundancy Portcullis Performance 1 Portcullis Performance 2 Portcullis Performance 3 Concluding Remarks Execution Redundancy is a Large Field We Saw Four Techniques Others Exist CMPs Represent a Large Field We Saw ~4 and Designed 1 xProduct of a Two Large Fields = Huge Field! NMR (N-1)*100% Overhead Backup Slides Whats Really Going On? execve(/path/add,2,2) brk(0) // get memory access(/lib/glibc) // find glibc open(/lib/glibc,r) mmap(1, 0xDEADBEEF) // import glibc close() stat(/lib/mylib) open() stat() // find other libraries open() stat() read() fstat() mmap() // import other libraries close() mprotect() // make libs exectuable set_thread_area() write(1,2+2=4) // do output munmap() // cleanup exit_group() Process Setup Process Teardown Whats Really Going On? write( 1, 2+2=4 ); Map File Descriptor 1 to Actual File Descriptor 1 Add 2+2=4 to Output Queue for 1 Compare 2+2=4 Against other Processes output for 1 Tolerable Faults Illegal Opcode Exception OR Eventual Disagree Segmentation Fault, Bus Error, Eventual Disagree Writeback to Wrong Address (e.g. Cache Tag Corruption) Redundant TLB Conservative Tags System Failure

process redundancy for future-generation cmp fault tolerance dan gibson ece 753 project presentation

Documents