1. introduction. 2 computing everywhere computing everywhere: – desktop, laptop, cars, cell phones...

Post on 11-Jan-2016

233 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1. Introduction

2

Computing Everywhere

Computing everywhere:• – Desktop, Laptop, Cars, Cell phones

Input devices everywhere:• – Sensors, cameras, microphones

Connectivity everywhere:• – Rapid growth of bandwidth in the interior of the net• – Internet at home and office

Increased reliance on computers is inevitable Computer systems will become invisible only

when they are reliable

3

Fault-Tolerance

Why?• Computers are increasingly being used in

critical applications where system failures may have severe consequences.

How?• By introducing redundancy (extra resources) in

the computer system, e.g., hardware redundancy and software redundancy.

4

Need for Fault Tolerance: Universal

Natural objects:• • Fat deposits in body: survival in starvation• • Duplication of eyes: graceful degradation

upon failure

Man-made objects• • Redundancy in ordinary text• • Asking for password twice during initial set-

up• • Duplicate tires in trucks

5

Mission Specific Approaches High availability systems:

• – Telephone• – Transaction processing: banks/airlines

Long life missions:• – Unscheduled maintenance too costly• – Manned and unmanned space borne systems

Critical applications:• – Real-time industrial control• – Aircraft control systems• – Life support systems

General Purpose Systems:• – CDs: encoding• – Internet: packet retransmission

6

Example of Failures - eBay Crash

eBay: giant internet auction house• – A top 10 internet business• – Market value of $22 billion• – 3.8 million users as of March 1999

June 6, 1999• – eBay system is unavailable for 22 hours with

problems ongoing for several days• – Stock drops by 6.5%, $3-5 billion lost revenues• – Problems blamed on Sun server software

7

Example - Ariane 5 Rocket Crash

Ariane 5 and its payload destroyed about 40 seconds after liftoff

Error due to software bug:• – Conversion of floating point to 16-bit int• – Out of range error generated but not handled

Testing of full system under actual conditions not done due to budget limits

Estimated cost: 120 million $

8

Example - The Therac-25 Failure

Therac-25 is a linear accelerator used for radiation therapy

More dependent on software for safety than predecessors (Therac-20, Therac-6)

Machine reliably treated thousands of patients, but occasionally there were serious accidents, involving major injuries and 1 death.

Software problems:• – No locks on shared variables (race conditions).• – Timing sensitivity in user interface.

9

Example - Tele Denmark

Tele Denmark Internet, ISP August 31, 1999

• – Internet service down for 3 hours• – Truck drove into the power supply cabinet at

Tele Denmark• – Where were the UPSs?

• Old ones had been disconnected for upgrade

• New ones were on the truck!

10

Dependable

Webster Dictionary Dependable: capable of being depended

on: RELIABLE Reliable: suitable or fit to be relied on:

DEPENDABLE Rely:

• 1) to be dependent <the system for which we depend on water>

• 2) to have confidence based on experience <someone you can rely on>

11

Dependability Dependability is that property of a computer

system such that reliance can justifiably be placed on the service it delivers.

Attributes Of Dependability• Reliability• Availability• Safety• Confidentiality• Integrity• Maintainability

Fault tolerance is not a system requirement. Fault tolerance is one of the mechanisms that can be used to provide dependability

12

Motivation Extreme fault tolerance has always been around

• – NASA’s deep space probes• – Medical computing devices (e.g., pacemakers)

But now fault tolerance is becoming more important• – More reliance on computers

Extreme fault tolerance• – Car controllers (e.g., anti-lock brakes), etc.

High fault tolerance• – Commercial servers (databases, web servers), file servers,

etc. Some fault tolerance

• – Desktops, laptops (really!), etc.

13

Reliability R(t) - Unreliability

R(t) is the probability that the system performs as specified without interruption over the entire interval [0,t]. R(t) is conditioned on the system being operational at time t=0.

time t can be very long, e.g. years in case of space applications

Unreliability F(t) is the probability that the system fails at any time in the interval [0,t].

F(t) = 1 - R(t)

14

Availability A(t)

A(t) is the probability that the system is up and running correctly at time t

This is different from reliability.• – Reliability considers the interval [0,t]• – Availability takes an instance of time

examples: transaction processing systems, e.g. reservation systems

15

Reliability vs. Availability

Example: A system that fails, on average, once per

hour but which restarts automatically in ten milliseconds is not very reliable but is highly available

Availability= Uptime/(Uptime+Downtime) = (3600000-10)/(3600000) = 0.9999972

16

Nines of Availability

17

Question ?

Which has higher availability?• (1) two 4 hour outage / year• (2) 1 minute outage / day• A(1) = (365*24-2*4)/(365*24) = 0.9990• A(2) = (24*60-1)/(24*60) = 0.9993

For an Internet-base company such as EBay or AOL, which would be more desirable? Why?

For a Hacker? Need to specify details of acceptable outages

18

Safety S(t) S(t) is the probability that the system does

not fail in the interval [0,t] in such a manner as to cause unacceptable damage or other terrible effects.

S(t) is attribute of a system which either operates correctly or fails in a safe manner.

Safety is a measure of the fail-safe capability of the system

• – system can be unreliable, yet safe• – bias towards safe failure

19

Safety Safety is a property of a system that it will not

endanger human life or the environment

A safety-related system is one by which the safety of equipment or plant is assured

The term safety-critical system is normally used as synonym for a safety-related system, although it may suggest a system of high criticality

20

Confidentiality

Absence of unauthorized disclosure of information

• Microsoft source code vs. Linux source code• Web browsing• Operating Systems Security Model• Files• Medical records• Credit card transaction records• School grades

21

Integrity

Absence of improper system state alterations

• Operating systems:• Files, memory, network packets• Linux kernel backdoor attempt• Database records• Your bank account• File transfer• Did I really get the right version of software X?

22

Maintainability M(t)

M(t) is the probability that a failed system will be restored within a specified period of time t

Restoration process• – locating problem, e.g. via diagnostics• – physically repairing system• – bringing system back to its operational

condition

23

Ex. Dependability Requirements

Telecommunications:• Availability, maintainability

Transportation:• Reliability, availability, safety

Weapons:• Safety

Nuclear systems:• Safety

24

Dependability of Pacemaker What matters for this system?

• Correct computation?• Correct logic?• Usability?

No, The safety of the patient

25

Dependability of Pacemaker General Characteristics Eight-bit processors, moving to 32-bit Software:

• Approximately 30K lines, mostly “C”• Vastly more software in external programmer

Patient data storage example:• 200 samples/sec

Long battery life necessary-device• “sleeps” between heart beats

26

Dependability of Pacemaker Is reliability the goal?

• Typical battery life is five years, but persistent storage is needed

Is availability the goal? How about an availability of 0.99999?

• This corresponds to an average of five minutes per year of downtime

• Death would result if this occurred all at once Is safety the goal? It’s safe when it’s off - or is it?

• Leaving the system off might result in death very quickly.....

27

Security

Security is a combination of attributes:• Integrity• Confidentiality• Availability

Under different situations, these attributes are more or less important:

• Denial of service is an availability issue• Disclosure of information is a confidentiality

28

Performability P(L,t)

P(L,t) is the probability that the system performance will be at or above some level L at time t

Measure of the likelihood that some subset of the function is performed correctly

This differs from reliability, which dictates that all functions are performed correctly

29

Graceful Degradation

The ability of system to automatically decrease its level of performance to compensate for hardware failure and software errors

30

Testability

Testability: ease of detecting presence of a fault

top related