practical reports on dependability manifestation of system failure site unavailability system...
TRANSCRIPT
Practical Reports on Dependability
Manifestation of System Failure
• Site unavailability
• System exception /access violation
• Incorrect result
• Data loss/corruption
• Slow down
PAGE UNAVAILABLE
PAGE UNAVAILABLE
System Exception
Performance Slowdown
DOWNTIME
unplanned20 %
planned80 %
15% contribution
DOWNTIME
unplanned20 %
planned80 %
DOWNTIMEunplanned
20 %
planned80 %
other20 %
software/human
80 %
UNPLANNED DOWNTIME
other20 %
software/human
80 %
UNPLANNED DOWNTIMEother20 %
software/human
80 %
software40 %operator
40 %
other20 %
UNPLANNED DOWNTIME
software40 %operator
40 %
other20 %
Software Errors
Triggers
• Resource exhaustion
• Logical errors
• System Overload
• Recovery code
• Failed upgrade
Logical Error
SYSTEM OVERLOAD
Operator Errors
Triggers
• Configurational– Incorrect parameter setting
• Procedural– Omit/inncorect maintainance action
• Miscellaneous
FAILURE
DURATION• Short (minutes)• Long (weeks)
– Implies large fault chains
FREQUENCY
• Permanent (down until problem fixed)
• Transient (resolves without
intervention)
• Intermittent (trasient + occasional)
SCOPE• Entire system
• Parts of the System
Fault Chains
• ”the series of component failures that led up to a user-visible failure”
• Uncoupled– Independent failures
• Tightly Coupled– Cascading/corelated
failure
Non-Malicious Software Failure
• Most Common Causes– Routine maintenance– Software upgrade– System integration
• Other Causes– System overload– Resource exaustsion– Complex fault tolerant routines
”ROUTINE” MAINTAINANCE
• Danske Bank 2003– March 11: routine operation to replace a defective
electrical unit in IBM DB2 disk system– System failure: Disks becomes inaccessable – 6 hours later: system restarted– March 12: Batch systems running incorrectly– Three More errors discovered:
1. Recovery process on several tables won’t start2. Recovery jobs won’t run symultaneously3. Recovery jobs can’t reastablish data in tables
– March 14: All data recovered and system functional