software design for reliability - presentation

8/8/2019 Software Design for Reliability - Presentation

1/42

Software

Design for Reliability(DfR)

George de la Fuente

[email protected]

(408) 828-1105

www.opsalacarte.com

Mike Silverman

[email protected]

(408) 472-3889

www.opsalacarte.com


2/42

Introduction


3/42

(v0.1) 3

The Current SituationThe Current Situation

Engineering teams must balance cost, schedule, performanceand reliability to achieve optimal customer satisfaction.

Cost Schedule

Reliability

Performance

CustomerSatisfaction


4/42

(v0.1) 4

The Need for Design for ReliabilityThe Need for Design for Reliability

Reliability is no longer a separate activityperformed by a distinct group within the

organization.

Product reliabil ity goals, concerns and

activities are integrated into nearly everyfunction and process of an organization.

The organization must factor reliability intoevery decision.


5/42

(v0.1) 5

Design for ReliabilityDesign for Reliability

Design for Reliability is made up of the following4 key steps:

Assess customers situation

Develop goals

Write program plan

Execute program plan

The focus is on developing reliable products andimproving customer satisfaction.


6/42

(v0.1) 6

Different Views of Reliability

Product development teams

View reliability as the domain to addressmechanical and electrical, and softwareissues.

Customers

View reliability as a system-level issue,with minimal concern placed on thedistinction into sub-domains.

Since the primary measure of reliability

is made by the customer, engineeringteams must maintain a balance of bothviews (system and sub-domain) inorder to develop a reliable product.

System

MechanicalReliability

Electrical

Reliability

+

SW

Reliability

+


7/42

(v0.1) 7

Reliability vs. Cost

Intuitively, the emphasis in reliability to achieve areduction in warranty and in-service costs results insome minimal increase in development andmanufacturing costs .

Use of the proper tools during the proper life cyclephase will help to minimize total Life Cycle Cost(LCC).


8/42

(v0.1) 8

Reliability vs. Cost, continued

To minimize total Life Cycle Costs (LCC), an organizationmust do two things:

1. Choose the best tools from all of the tools available andapply these tools at the proper phases of the product lifecycle.

2. Properly integrate these tools together to assure that the

proper information is fed forwards and backwards at theproper times.

This is Design for Reliability.


9/42

(v0.1) 9

Hardware Reliability and Cost Reduction Strategy

A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts

This emphasis results in some increase in development andmanufacturing costs

A good reliability program will strive to optimize the total LifeCycle Costs

COST

RELIABILITY

OPTIMUM

COST POINT RELIABILITYPROGRAM COSTS

TOTAL COSTCURVE

ELEC / MECH

WARRANTY &IN-SERVICE COSTS


10/42

(v0.1) 10

HW and SW Reliability and Cost Reduction Strategy

A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts

This emphasis results in some increase in development andmanufacturing costs

A good reliability program will strive to optimize the total LifeCycle Costs

COST

RELIABILITY

OPTIMUM

COST POINT RELIABILITYPROGRAM COSTS

TOTAL COSTCURVE

ELEC / MECH

WARRANTY &IN-SERVICE COSTS

The softwareimpact on overallwarranty costs isminimal at best?

Why?


11/42

(v0.1) 11

Software Reliability and Cost

Software has no associated manufacturing costs and minimalreplacement costs

Warranty costs and savings are almost entirely allocated to hardware

Software development and test teams are considered fixed product development

costs

If software reliability does not result in a perceived cost savings,why not focus solely on improving hardware reliability in order tosave money?

Sadly, some organizations take just this approach

But since they sell systems, not just hardware, sometimes customers or marketforces cause companies to reconsider this approach

The primary root causes of embedded system failures is usually software, nothardware, by ratios as high as 10:1

The benefits of improved software reliability are not in direct costsavings, but rather in:

Increased availability of software staff for additional development due to reducedoperational maintenance demands

Improved customer satisfaction resulting from increased system reliability


12/42

Software Design

for Reliability


13/42

(v0.1) 13

Software Quality vs. Software Reliability

SoftwareQuality

*ISO9126QualityModel

FACTORS

Maintainab

ility

Functionality

Usability

Reliability

Efficiency

Portability

CRITERIA

suitabilityaccuracy

interoperabilitysecurity

maturityfault tolerancerecoverability

time behaviorresourceutilization

analysabilitychangeabilitystability

testability

adaptabilityinstallabilityco-existencereplaceability

understandabilitylearnabilityoperability

attractiveness Software Quality

The level to whichthe software

characteristicsconform to all the

specifications.

Software Quality

The level to whichthe software

characteristicsconform to all the

specifications.


14/42

(v0.1) 14

Software Reliability Definitions

Examine the key points

Practical rewording of the definition

The customer perception of

the softwares ability to deliver the expected functionality

in the target environmentwithout failing.

The customer perception of

the softwares ability to deliver the expected functionality

in the target environmentwithout failing.

Software reliability is a measure ofthe software failures that are visible to a customer and

prevent a system from delivering essential functionality.

Software reliability is a measure of

the software failures that are visible to a customer and

prevent a system from delivering essential functionality.


15/42

(v0.1) 15

Terminology - Defect

A flaw in software requirements, design or source code thatproduces unintended or incomplete run-time behavior

Defects of commission

Incorrect requirements are specified

Requirements are incorrectly translated into a design model

The design is incorrectly translated into source code

The source code logic is flawed

Defects of omission

Not all requirements were used in creating a design model The source code did not implement all the design

The source code has missing or incomplete logic

Defects are static and can be detected and removed without

executing the source code

Defects that cannot trigger software failures are not tracked ormeasured for reliability purposes

These are quality defects that affect other aspects of software quality such assoft maintenance defects and defects in test cases or documentation

Defect

There are amongst the most difficult class of defects to detect


16/42

(v0.1) 16

Terminology - Fault

The result of triggering a software defect byexecuting the associated source code

Faults are NOT customer-visible

Example: memory leak or a packet corruption that requiresretransmission by the higher layer stack

A fault may be the transitional state that results in a failure

Trivially simple defects (e.g., display spelling errors) do not

have intermediate fault states

Defect

Fault


17/42

(v0.1) 17

Terminology - Failure

A customer (or operational system) observation ordetection that is perceived as an unacceptable departureof operation from the designed software behavior

Failures are the visible, run-time symptoms of faults

Failures MUST be observable by the customer or anotheroperational system

Not all failures result in system outages

NOTE: for the remainder of this presentation, the term failure willrefer only to failure of essential functionality, unless otherwise stated.

Defect

Fault

Failure


18/42

(v0.1) 18

Summary of Defects and Failures

There are 3 types of run-time defects

1. Defects that are never executed (so they dont triggerfaults)

2.Defects that are executed and trigger faults that doNOT result in failures

3. Defects that are executed and trigger faults that resultin failures

Practical Software Reliability focuses solely ondefects that have the potential to causefailures by:

1. Detecting and removing defects that result in failuresduring development

2. Implementing fault tolerance techniques to

prevent faults from producing failures or

mitigating the effects of the resulting failures

Defect

Fault

Defect

Failure

Defect

Fault

Failure

Defect

Fault


19/42

(v0.1) 19

Software Availability

System outages caused by software can be attributed to:

1. Recoverable software failures

2. Software upgrades

3. Unrecoverable software failures

NOTE: Recoverable software failures are the most frequent software cause ofsystem outages

For outages due to recoverable software failures, availability isdefined as:

___MTTF___A(T) = (MTTF + MTTR)

where,

MTTF is Mean Time To [next] Failure

MTTR (Mean Time To [operational] Restoration) is still the duration of theoutage, but without the notion of a repair time. Instead, it is the time untilthe same system is restored to an operational state via a system reboot orsome level of software restart.


20/42

(v0.1) 20

Software Availability (continued)

A(T) can be increased by either:

Increasing MTTF (i.e., increasing reliability) using software reliability practices

Reducing MTTR (i.e., reducing downtime) using software availability practices

MTTR can be reduced by:

Implementing hardware redundancy (sparingly) to mask most likely failures

Increasing the speed of failure detection (the key step)

Software and system recovery speeds can be increased by implementing Fast

Fail and software restart designs Modular design practices allow software restarts to occur at the smallest

possible scope, e.g., thread or process vs. system or subsystem

Drastic reductions in MTTR are only possible when availability is part of theinitial system/software design (like redundancy)

Customers generally perceive enhanced software availability as asoftware reliability improvement

Even if the failure rate remains unchanged


21/42

(v0.1) 21

System Availability Timeframes

Timeframe vs. Mission Downtime

(Unavailability Range)

AvailabilityAvailability Class

99.99999% (7 nines) to

99.9999999% (9 nines)

99.9999% (6 nines)

99.999% (5 nines)

99.99% (4 nines)

99.9% (3 nines)

99% (2 nines)

90% (1 nine)

(7) Ultra-Availability

(6) Very-High-Availability

(5) High-Availability

(High-reliabilityproducts)

(4) Fault Tolerant(better commercialsystems)

(3) Well-managed

(2) Managed

(good web servers)

(1) Unmanaged

13.14 minutes52.6 mins/year

1.31minutes5.3 mins/year

7.88 seconds31.5 secs/year(2.6 mins/5 years)

2.19 hours8.8 hours/year

(525.6 mins/year)

0.79 seconds3.2 secs/year

to

31.5 millisecs/year

(15.8 secs/5 years or less)

21.9 hours

9.13 days

Timeframe = 3 months

3.65 days/year

(5,256 mins/year)

36.5 days/year

(52,560 mins/year)

Timeframe = 1 year


22/42

(v0.1) 22

Software Availability

a)System restart

b)100 seconds (average)

c)11 mins & 40 secs

d)1 failure/mission

a)System reboot

b)4 minutes (average)

c)14 minutes

d)1 failure/mission

1)Manual detection (No S/Wmechanism)

2)Reboot/restart via user intf

3)(avg detection time) 10 min.

a)System restart

b)100 seconds (average)

c)130 secondsd)7 failures/mission

a)System reboot

b)4 minutes (average)

c)4 mins & 30 secsd)3 failure/mission

1)Slow, automatic det. Mech.

2)Health message protocols

w/retries3)30 seconds

a)Board restart

b)30 secondsc)40 seconds

d)24 failures/mission

a)Board reboot

b)90 secondsc) 100 seconds


1)Fast, automatic detection

mechanism2)Watchdog timer reset

3)5-10 seconds

a)Thread/Driver restartb)(average) 50 mils

c)1.05 seconds


a) Board rebootb) 90 seconds

c) 91 seconds

d) 10 failures/mission

1)Very fast, automatic detectionmechanism

2)Thread or Driver monitors

3)Up to 1 second

a) Fast Recovery Action

b) Recovery Speed

c) MTTR

d) Mission failure rate

a) Slow Recovery Action

b) Recovery Speed

c) MTTR

d) Mission failure rate

1)Software Failure DetectionMechanism

2) Implementation Example

3)Failure Detection Speed


23/42

(v0.1) 23

Reliable Software Characteristics Summary

Operates within the reliability specification that satisfies customerexpectations

Measured in terms of failure rate and availability level

The goal is rarely defect free or ultra-high reliability

Gracefully handles erroneous inputs from users, other systems,and transient hardware faults

Attempts to prevent state or output data corruption from erroneous inputs

Quickly detects, reports and recovers from software and transienthardware faults

Software provides system behavior as continuously monitoring, self-diagnosingand self-healing

Prevents as many run-time faults as possible from becoming system-levelfailures


24/42

(v0.1) 24

Software Design for Reliability (DfR)

What are the common options for software DfR?

1. Use formal methods

2. Follow an approach consistent with Hardware Reliability Programs

3. Improve reliability through software process control

4. Improve reliability by focusing on traditional software development bestpractices

Formal Methods

Methodologies that utilize mathematical modeling of a systems requirementsand/or design in order to prove correctness

Primarily used for safety-critical systems that require very high degrees of

Confidence in expected system performance

Quality audit information

Targets of low or zero failure rates

Formal methods are not applicable to every software project

Cannot be used for all aspects of system design (e.g., user interface design)

Do not scale to handle large and complex system development


25/42

(v0.1) 25

Hardware Reliability Practices

Software and hardware development practices are stillfundamentally different

Architecture and design level modeling tools are not prevalent

Provide limited simulation verification

Especially if a real-time operating system is required

2-way code generation is a common feature

Software engineers find it difficult to complete a design with coding

All software faults are from design, not manufacturing or wear-out

Software is not built as an assembly of preexisting components

Off-the-shelf software components do not provide reliability characteristics

Most reused software components are modified and not re-certified beforereuse

Developing designs with very small defect-densities is the Achilles' heal

No true mechanisms available for general accelerated software testing

Increasing processor speeds is usually implemented to find timing failures


26/42

(v0.1) 26

Hardware Reliability Practices

(continued)

Extending software designs after product deployment is commonplace

Software updates are the preferred avenue for product extensions andcustomizations

Software updates provide fast development turnaround and have little or nomanufacturing or distribution costs

Hardware failure analysis techniques (FMEAs and FTAs) are rarely used withsoftware

Software engineers find it difficult to adapt these techniques below thesystem level


27/42

(v0.1) 27

Software Process Control Methodologies

Software process control assumes a correlation betweendevelopment process maturity and latent defect density in thefinal software [Jones-1]

If the current process level does not yield the desired software

reliability, process audits and stricter process controls areimplemented

Reliability improvement under this type of controlled methodology may increasevery slowly

5.8

3.4

2.1

1.09

0.39

DeliveredDefects/ KSLOC

124.8

62.4

31.2

15.6

7.8

InherentDefects/ KSLOC

5

1

2

3

4

CMM Level


28/42

(v0.1) 28

Defect Removal Potential of Best Practices

25% to 40%Integration test

20% to 40%Performance test

25% to 55%System testing

25% to 35%Acceptance test (1 customer)

15% to 45%Unit testing

45% to 60%Design inspections

15% to 30%Regression test

45% to 60%Code inspections

Efficiency RangeDefect Removal Technique

[Jones-2]


29/42

(v0.1) 29

Impact of Integrating Multiple Best Practices

MedianDefectEfficiency

FormalTesting

Formal SQAProcesses

CodeInspections/ Reviews

DesignInspections

/ Reviews

40

%

n

n

n

n

45

%

n

Y

n

n

53

%

Y

n

n

n

57

%

n

n

Y

n

60

%

n

n

n

Y

65

%

Y

Y

n

n

85

%

n

n

Y

Y

99

%

Y

Y

Y

Y

Most development organizations will see greater improvements in software reliability

with investments in design and code reviews than further investments in testing.

[Jones-2]


30/42

(v0.1) 30

Best in Class Results by Application

[Jones-2]

92%Outsourced Software

72%Web Software

73%IT Software

96%Military Software

90%Commercial Software

95%Embedded Software

94%System Software

Best in Class

Defect Removal EfficiencyApplication Type


31/42

(v0.1) 31

Phase Containment of Defects and Failures

Typical Behavior

TestingDesign Coding MaintenanceReqts


Defect

Origin

Defect

Discovery

Goal of Phase Containment



Defect

Origin

Defect

Discovery

Surprise!


32/42

(v0.1) 32

Typical Defect Reduction Behavior

SysBld SysBldSysBldSysBldSysBld#1 #2 #3 #4 #5

System Test

150

50

100

200


33/42

(v0.1) 33

DfR Phase Containment Behavior

150

50

100

200

Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field

Test #1 #2 #3 #4 #5 Failures

Development System Test Deployment

Goals are to:

(1) Reduce field failure by finding most of the failures in-house

(2) Find the majority of failures during development


34/42

(v0.1) 34

Requirements Phase DfR Practices

Software reliability requirements

Identify important software functionality

Essential, critical, and non-essential

Explicitly define acceptable software failure rates

Specify any behavior that impacts software availability

Acceptable durations for software upgrades, reboots and restarts

Define any operating cycles that apply to the system and the software inorder to define opportunity for software restarts or rejuvenation

Ex: maintenance or diagnostic periods, off-line periods, shutdown

periods


35/42

(v0.1) 35

Design Phase DfR Practices

The software designs should evolve using a multi-tiered approach

1. System Architecture

Identify all essential system-level functionality that requires software

Identify the role of software in detecting and handling hardware failure modes byperforming system-level failure mode analysis

2. High-level Design (HLD) Identify modules based on their functional importance and vulnerability to failures

Essential functionality is most frequently executed

Critical functionality is infrequently executed but implements key systemoperations (e.g., boot/restart, shutdown, backup, etc.)

Vulnerability points that might flag defect clusters (e.g., synchronizationpoints, hardware and module interfaces, initialization and restart code, etc.)

Identify the visibility and access major data objects outside of each module

3. Low-level Design (LLD)

Define availability behavior of the modules (e.g., restarts, retries, reboots,redundancy, etc.)

Identify vulnerable sections of functionality in detail

Functionality is targeted for fault tolerance techniques

Focus on simple implementations and recovery actions


36/42

(v0.1) 36

Design Phase DfR Practices

For software engineers, the highest ROI for defect and failuredetection and removal is low level design (LLD)

LLD defines sufficient module logic and flow control details to allow analysis basedon common failure categories and vulnerable portions of the design

Failure handling behavior can be examined in sufficient detail

LLD bridges gap between traditional design specs and source code

Most design defects that were previously caught during code reviews willnow be caught in the LLD review

More likely to correctly fix design defects since the defect is caught in thedesign phase

Most design defects found after the design phase are not properly fixedbecause the scheduling costs are too high

Design defects require returning to the design phase to correct andreview the design and then correcting, re-reviewing and unit testing the

code!

LLD can also be reviewed for testability

The goal of defining and validating all system test cases as part of an LLDreview is achievable


37/42

(v0.1) 37

Code Phase DfR Practices

Code reviews should be carried out in stages to remove the mostdefects

Properly focused design reviews coupled with techniques to detect simple codingdefects will result in shorter code reviews

Code reviews should focus on implementation, and not design, issues

Language defects can be detected with static and dynamic analysis tools

Maintenance defects caught with coding standards pre-reviews

Requiring authors to review their own code significantly reduces simple codedefects and possible areas of code complexity

The inspection portion of a review tries to identify missing exception handlingpoints

Software failure analysis will focus on the robustness of the exception handlingbehavior

Software failure analysis should be performed as a separate code inspectiononce the code has undergone initial testing


38/42

(v0.1) 38

Test Phase DfR Practices

Unit testing can be effectively driven using code coveragetechniques

Allows software engineers to define and execute unit testing adequacyrequirements in a manner that is meaningful and easily measured

Coverage requirements can vary based on the critical nature of a module

System-level testing should measure reliability and validate asmany customer operational profiles as possible

Requires that most of the failure detection be performed prior to system testing

System integration becomes the general functional validation phase


39/42

(v0.1) 39

Reliability Measurements and Metrics

Definitions

Measurements data collected for tracking or to calculate meta-data (metrics)

Ex: defect counts by phase, defect insertion phase, defect detection phase

Metrics information derived from measurements (meta-data)

Ex: failure rate, defect removal efficiency, defect density

Reliability measurements and metrics accomplish several goals

Provide estimates of software reliability prior to customer deployment

Track reliability growth throughout the life cycle of a release

Identify defect clusters based on code sections with frequent fixes

Determine where to focus improvements based on analysis of failure data

SCM and defect tracking tools should be updated to facilitate the

automatic tracking of this information Allow for data entry in all phases, including development

Distinguish code base updates for critical defect repair vs. any other changes,(e.g., enhancements, minor defect repairs, coding standards updates, etc.)


40/42

Conclusion


41/42

(v0.1) 41

Conclusions

Software developers can produce commercial software withhigh reliability by optimizing their best practices for defectremoval

Most software organizations are not aware that this type ofsoftware DfR is a viable option, which is why much of ourinitial contact is made via the hardware organizations

Final Point: You dont need more tools, more testing ormore processes to develop highly reliable software. Theanswer lies within the reach of your teams best practices!

R f


42/42

(v0.1) 42

References

[Jones-1], Jones, C., Conflict and Litigation Between Software Clients andDevelopers, March 4, 1996

[Jones-2], Jones, C., Software Quality in 2002: A Survey of the State of the

Art, July 23, 2002

software design for reliability - presentation

Documents