dependability: what is it?staff.cs.upt.ro/~vancusa/fsc/c1.pdf · •rapid advancement of hardware...
TRANSCRIPT
DEPENDABILITY:
WHAT IS IT? Why do we care?
Dependability and security attributes
(Avizienis, et al. 2004)
Dependability the ability to deliver service that can
justifiably be trusted
Availability: readiness for correct service
Security
Reliability: continuity of correct service
Safety: absence of catastrophic consequences on the
user(s) and the environment
Integrity: absence of improper system alterations
Security
Maintainability: ability to undergo modifications and repairs
Confidentiality: the absence of unauthorized disclosure of
information
Security
When it’s a good time for dependability? D
ependab
ility
th
e a
bili
ty t
o d
eliv
er
serv
ice t
hat
can justifiably
be t
ruste
d
Availability: readiness for correct service
Reliability: continuity of correct service
Safety: absence of catastrophic consequences on
the user(s) and the environment
Integrity: absence of improper system alterations
Maintainability: ability to undergo modifications and repairs
• graduating university • you have to deliver passable
data (projects, papers, exams)
• if possible, continuously (without failing (too many) exams)
• while taking care not to kill/maim/etc. the teachers
• and do not mess up the original system (enter the right classroom without any cheating means on you)
• while undergoing repairs (recovering from hangovers).
Note: “The dependability and security specification of a
system must include the requirements for the attributes
in terms of the acceptable frequency and severity of
service failures for specified classes of faults and a
given use environment.
One or more attributes may not be required at all for a
given system.” (Avizienis, et al. 2004)
When it’s good to have security? S
ecu
rity
Availability: readiness for correct service
Integrity: absence of improper system alterations
Confidentiality: the absence of unauthorized disclosure of
information
• a system that deals with ATMs and banks • be ready to provide the
correct service (availability),
• while not any unsanctioned modifications change it, like tiny cameras to catch you typing in your pin (integrity)
• all the while maintaining the identity of the client as private as possible (confidentiality)
Complex hardware/software
• NASA Space Shuttle flies • 500,000 lines of software code on board
• 3.5 million lines of code in ground control and processing
• International Space Station Alpha • millions of lines of software
• innumerable hardware pieces for its navigation, communication, and experimentation
• Telecommunications industry • operations for phone carriers are supported by hundreds of software
systems, with hundreds of millions of lines of source code
• Avionics industry • instruments contain their own microprocessor system with extensive
embedded software
• A massive amount of hardware and complicated software also exists in the Federal Aviation Administration’s Advanced Automation System, the new generation air traffic control system.
“More common” complexity
• Windows
• 1 to 5 million lines of code
• Your car
• Embedded controllers + soft
• Your cell
• Your toaster?
What happens when things go wrong?
Inconvenience
• malfunctions of home appliances
Economic damage
• interruptions of banking systems
Loss of life
• failures of flight systems or medical software
Who’s to blame: sw? hw?
• rapid advancement of hardware technology
• proper development of software technology has failed to keep pace
in all measures, including quality, productivity, cost, and
performance
• the last decade of the 20th century, computer software has already
become the major source of reported outages in many systems
• software assumes a larger burden
• based on a less firm foundation, than hardware
“Famous” (and a bit funny) failures
• NASA Voyager project,
• the Uranus encounter was in jeopardy because of late software deliveries and reduced capability in the Deep Space Network
• Space Shuttle missions
• delayed due to hardware/software interaction problems
• F-16 jet fighter
• “software problems” caused the first flight of the AFTI/F-16 jet fighter to be delayed over a year
• none of the advanced modes originally planned could be used
• ozone hole over Antarctica
• data analysis program had suppressed the anomalous data because it was “out of range.”
• Denver International Airport
• empty more than a year due to software glitches in an automated baggage-handling system
Tragic failures
• Therac-25 radiation therapy machine
• software errors in its sophisticated control systems malfunctioned
and claimed several patients’ lives in 1985 and 1986
• Computer Aided Dispatch system of the London
Ambulance Service broke down right after its installation
• 5000 daily requests to transport patients
• aviation industry
• misunderstandings between computers and pilots have been
implicated in several airline crashes
• in some cases experts hold software control responsible because
of inappropriate reaction of the aircraft to the pilots’ desperate
inquiries during an abnormal flight.
Common failures
• Many software systems and packages
• Bugs
• Microsoft is fearful of “killer bugs” which can easily wipe out all the
profits of a glorious product if a recall is required on the tens of
millions of copies they have sold.
• Who’s seen a blue screen in Windows?
• a fault in a switching system’s newly released software
caused massive disruption of a major carrier’s long-
distance network
• series of local phone outages traced to software problems
Factors that play an instrumental role in
increasing the importance of reliability • Complex and sophisticated products.
• Over the years engineering products have become more sophisticated and complex. For example, in 1935 a farm tractor had 1200 critical parts and in 1990 the number increased to around 2900. Furthermore, today a typical Boeing 747 jumbo jet airplane is composed of around 4.5 million parts, including fasteners.
• High acquisition cost.
• Many engineering products cost millions of dollars (e.g., commercial airplanes, defense systems, and space satellites). Failure of such items could result in loss of millions of dollars.
• The past well-publicized system failures.
• Three examples of these failures are Space Shuttle Challenger disaster, Chernobyl Nuclear Reactor explosion, and Point Pleasant Bridge Disaster. These disasters occurred in January 1986, April 1986, and December 1967, respectively
• Q: Fukushima Nuclear Plant ?
The fundamental chain of dependability
and security threats
Exemple 1: Columbia space shuttle
• NASA communicate:
• “There was a hole of protective material in a wing of the Columbia shuttle. When the shuttle returned into earth atmosphere, the hole cause the overheat of the wing. Lastly, the overheat of the wing caused the explosion of the shuttle”.
Solution:
Fault = hole in the wing • belongs to the
physical universe
• by activation causes an error
Error = overheat of the wing • belongs to the
information universe
• by propagation causes a failure
Failure = explosion • belongs to the
external universe
• by causation may create another fault
Ex 2: Genetic mechanisms drive cancer
progression
Mutation inactivates tumor supressor gene
Cells proliferate Mutation inactivates
DNA repair gene
Mutation of proto-ocogene creates an
oncogene
Mutation inactivates several more tumor suppressor genes
Cancer
Cancer seen as a causal link of faults
Fault = mutation inactivates tumor supressor gene
Error = mutated cells proliferate
Failure = cells cannot stop themselves from turning into something
else
Fault = new mutation (oncogene)
Error = twice mutated cells proliferate
Failure = cells turn into something else
Fault = new mutation (inactivate DNA repair)
Error = thrice mutated cells proliferate
Failure = cells cannot be stoped by the local
control
Fault = new mutation (other tumor suppressors inactivated)
Error = mutated cells proliferate
Failure = Cancer
Ways to dependability
• Fault prevention means to prevent the occurrence or
introduction of faults.
• Fault tolerance means to avoid service failures in the
presence of faults.
• Fault removal means to reduce the number and severity
of faults.
• Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of faults
Do we have that in the human body?
• Fault prevention means to prevent the occurrence or introduction of faults.
• Fault tolerance means to avoid service failures in the presence of faults.
• Fault removal means to reduce the number and severity of faults.
• Fault forecasting means to estimate the present number, the future incidence, and the likely consequences of faults
Fault = mutation inactivates tumor supressor gene
Error = mutated cells proliferate
Failure = cells cannot stop
themselves from turning into
something else
Fault = new mutation
(oncogene)
Error = twice mutated cells
proliferate
Failure = cells turn into
something else
Fault = new mutation
(inactivate DNA repair)
Error = thrice mutated cells
proliferate
Failure = cells cannot be stoped
by the local control
Fault = new mutation (other
tumor suppressors inactivated)
Error = mutated cells proliferate
Failure = Cancer
Note Fault prevention and fault tolerance aim to provide the
ability to deliver a service that can be trusted, while fault
removal and fault forecasting aim to reach confidence in
that ability by justifying that the functional and the
dependability and security specifications are adequate
and that the system is likely to meet them
Readings for this chapter:
• Avizienis, Algirdas, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. "Basic Concepts and Taxonomy of Dependable and Secure Computing." IEEE Transactions on Dependable and Secure Computing 1, no. 1 (2004): 11-33.
• Dhillon, B. S. Reliability, quality, and safety for engineers. Boca Raton, Florida: CRC Press, 2005.
• Lyu, Michael R. "Introduction." In Handbook of Software Reliability Engineering, by Michael R. Lyu, 3 - 25. IEEE Computer Society Press and McGraw-Hill Book Company, 2006.
• Xing, Liudong. ECE591-03 Dependable Computing and Networking Course. Spring 2012. http://www.ece.umassd.edu/Faculty/lxing/Homepage/ECE591-S12-Homepage/ECE591-S12-Lecture.html.
What’s next?
2. Software reliability 1 - introduction
• Software Reliability and System Reliability
• Software Reliability Modeling Survey
• Techniques for Prediction Analysis and Recalibration
• The Operational Profile
3. Software reliability 2 - data analysis and
measurement
4. Software reliability 3 - testing, simulation and
solutions
5. Hardware reliability - parallel to software
6. FMEA/ FMECA, Six Sigma, HAZOP
7. Event & fault tree analysis
Administrative details
• Time:
• course will last 7 weeks
• should take approximately 6-10 hours / week of dedicated time to
complete, with all its readings and assignments
• To pass the class
• Application part:
• Select & present in 5 minutes a project
• List will be sent as a google docs tomorrow
• May be a team, but each individual must speak
• Final exam:
• Present 3 examples that illustrate ideas from any course
• Examples from any domain
• Individually