software design for reliability - presentation

Upload: thiyagarajan-ganesan

Post on 10-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Software Design for Reliability - Presentation

    1/42

    Software

    Design for Reliability(DfR)

    George de la Fuente

    [email protected]

    (408) 828-1105

    www.opsalacarte.com

    Mike Silverman

    [email protected]

    (408) 472-3889

    www.opsalacarte.com

  • 8/8/2019 Software Design for Reliability - Presentation

    2/42

    Introduction

  • 8/8/2019 Software Design for Reliability - Presentation

    3/42

    (v0.1) 3

    The Current SituationThe Current Situation

    Engineering teams must balance cost, schedule, performanceand reliability to achieve optimal customer satisfaction.

    Cost Schedule

    Reliability

    Performance

    CustomerSatisfaction

  • 8/8/2019 Software Design for Reliability - Presentation

    4/42

    (v0.1) 4

    The Need for Design for ReliabilityThe Need for Design for Reliability

    Reliability is no longer a separate activityperformed by a distinct group within the

    organization.

    Product reliabil ity goals, concerns and

    activities are integrated into nearly everyfunction and process of an organization.

    The organization must factor reliability intoevery decision.

  • 8/8/2019 Software Design for Reliability - Presentation

    5/42

    (v0.1) 5

    Design for ReliabilityDesign for Reliability

    Design for Reliability is made up of the following4 key steps:

    Assess customers situation

    Develop goals

    Write program plan

    Execute program plan

    The focus is on developing reliable products andimproving customer satisfaction.

  • 8/8/2019 Software Design for Reliability - Presentation

    6/42

    (v0.1) 6

    Different Views of Reliability

    Product development teams

    View reliability as the domain to addressmechanical and electrical, and softwareissues.

    Customers

    View reliability as a system-level issue,with minimal concern placed on thedistinction into sub-domains.

    Since the primary measure of reliability

    is made by the customer, engineeringteams must maintain a balance of bothviews (system and sub-domain) inorder to develop a reliable product.

    System

    MechanicalReliability

    Electrical

    Reliability

    +

    SW

    Reliability

    +

  • 8/8/2019 Software Design for Reliability - Presentation

    7/42

    (v0.1) 7

    Reliability vs. Cost

    Intuitively, the emphasis in reliability to achieve areduction in warranty and in-service costs results insome minimal increase in development andmanufacturing costs .

    Use of the proper tools during the proper life cyclephase will help to minimize total Life Cycle Cost(LCC).

  • 8/8/2019 Software Design for Reliability - Presentation

    8/42

    (v0.1) 8

    Reliability vs. Cost, continued

    To minimize total Life Cycle Costs (LCC), an organizationmust do two things:

    1. Choose the best tools from all of the tools available andapply these tools at the proper phases of the product lifecycle.

    2. Properly integrate these tools together to assure that the

    proper information is fed forwards and backwards at theproper times.

    This is Design for Reliability.

  • 8/8/2019 Software Design for Reliability - Presentation

    9/42

    (v0.1) 9

    Hardware Reliability and Cost Reduction Strategy

    A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts

    This emphasis results in some increase in development andmanufacturing costs

    A good reliability program will strive to optimize the total LifeCycle Costs

    COST

    RELIABILITY

    OPTIMUM

    COST POINT RELIABILITYPROGRAM COSTS

    TOTAL COSTCURVE

    ELEC / MECH

    WARRANTY &IN-SERVICE COSTS

  • 8/8/2019 Software Design for Reliability - Presentation

    10/42

    (v0.1) 10

    HW and SW Reliability and Cost Reduction Strategy

    A product-driven focus on reliability in order to reduceproduct hardware and mechanical warranty and in-servicecosts

    This emphasis results in some increase in development andmanufacturing costs

    A good reliability program will strive to optimize the total LifeCycle Costs

    COST

    RELIABILITY

    OPTIMUM

    COST POINT RELIABILITYPROGRAM COSTS

    TOTAL COSTCURVE

    ELEC / MECH

    WARRANTY &IN-SERVICE COSTS

    The softwareimpact on overallwarranty costs isminimal at best?

    Why?

  • 8/8/2019 Software Design for Reliability - Presentation

    11/42

    (v0.1) 11

    Software Reliability and Cost

    Software has no associated manufacturing costs and minimalreplacement costs

    Warranty costs and savings are almost entirely allocated to hardware

    Software development and test teams are considered fixed product development

    costs

    If software reliability does not result in a perceived cost savings,why not focus solely on improving hardware reliability in order tosave money?

    Sadly, some organizations take just this approach

    But since they sell systems, not just hardware, sometimes customers or marketforces cause companies to reconsider this approach

    The primary root causes of embedded system failures is usually software, nothardware, by ratios as high as 10:1

    The benefits of improved software reliability are not in direct costsavings, but rather in:

    Increased availability of software staff for additional development due to reducedoperational maintenance demands

    Improved customer satisfaction resulting from increased system reliability

  • 8/8/2019 Software Design for Reliability - Presentation

    12/42

    Software Design

    for Reliability

  • 8/8/2019 Software Design for Reliability - Presentation

    13/42

    (v0.1) 13

    Software Quality vs. Software Reliability

    SoftwareQuality

    *ISO9126QualityModel

    FACTORS

    Maintainab

    ility

    Functionality

    Usability

    Reliability

    Efficiency

    Portability

    CRITERIA

    suitabilityaccuracy

    interoperabilitysecurity

    maturityfault tolerancerecoverability

    time behaviorresourceutilization

    analysabilitychangeabilitystability

    testability

    adaptabilityinstallabilityco-existencereplaceability

    understandabilitylearnabilityoperability

    attractiveness Software Quality

    The level to whichthe software

    characteristicsconform to all the

    specifications.

    Software Quality

    The level to whichthe software

    characteristicsconform to all the

    specifications.

  • 8/8/2019 Software Design for Reliability - Presentation

    14/42

    (v0.1) 14

    Software Reliability Definitions

    Examine the key points

    Practical rewording of the definition

    The customer perception of

    the softwares ability to deliver the expected functionality

    in the target environmentwithout failing.

    The customer perception of

    the softwares ability to deliver the expected functionality

    in the target environmentwithout failing.

    Software reliability is a measure ofthe software failures that are visible to a customer and

    prevent a system from delivering essential functionality.

    Software reliability is a measure of

    the software failures that are visible to a customer and

    prevent a system from delivering essential functionality.

  • 8/8/2019 Software Design for Reliability - Presentation

    15/42

    (v0.1) 15

    Terminology - Defect

    A flaw in software requirements, design or source code thatproduces unintended or incomplete run-time behavior

    Defects of commission

    Incorrect requirements are specified

    Requirements are incorrectly translated into a design model

    The design is incorrectly translated into source code

    The source code logic is flawed

    Defects of omission

    Not all requirements were used in creating a design model The source code did not implement all the design

    The source code has missing or incomplete logic

    Defects are static and can be detected and removed without

    executing the source code

    Defects that cannot trigger software failures are not tracked ormeasured for reliability purposes

    These are quality defects that affect other aspects of software quality such assoft maintenance defects and defects in test cases or documentation

    Defect

    There are amongst the most difficult class of defects to detect

  • 8/8/2019 Software Design for Reliability - Presentation

    16/42

    (v0.1) 16

    Terminology - Fault

    The result of triggering a software defect byexecuting the associated source code

    Faults are NOT customer-visible

    Example: memory leak or a packet corruption that requiresretransmission by the higher layer stack

    A fault may be the transitional state that results in a failure

    Trivially simple defects (e.g., display spelling errors) do not

    have intermediate fault states

    Defect

    Fault

  • 8/8/2019 Software Design for Reliability - Presentation

    17/42

    (v0.1) 17

    Terminology - Failure

    A customer (or operational system) observation ordetection that is perceived as an unacceptable departureof operation from the designed software behavior

    Failures are the visible, run-time symptoms of faults

    Failures MUST be observable by the customer or anotheroperational system

    Not all failures result in system outages

    NOTE: for the remainder of this presentation, the term failure willrefer only to failure of essential functionality, unless otherwise stated.

    Defect

    Fault

    Failure

  • 8/8/2019 Software Design for Reliability - Presentation

    18/42

    (v0.1) 18

    Summary of Defects and Failures

    There are 3 types of run-time defects

    1. Defects that are never executed (so they dont triggerfaults)

    2.Defects that are executed and trigger faults that doNOT result in failures

    3. Defects that are executed and trigger faults that resultin failures

    Practical Software Reliability focuses solely ondefects that have the potential to causefailures by:

    1. Detecting and removing defects that result in failuresduring development

    2. Implementing fault tolerance techniques to

    prevent faults from producing failures or

    mitigating the effects of the resulting failures

    Defect

    Fault

    Defect

    Failure

    Defect

    Fault

    Failure

    Defect

    Fault

  • 8/8/2019 Software Design for Reliability - Presentation

    19/42

    (v0.1) 19

    Software Availability

    System outages caused by software can be attributed to:

    1. Recoverable software failures

    2. Software upgrades

    3. Unrecoverable software failures

    NOTE: Recoverable software failures are the most frequent software cause ofsystem outages

    For outages due to recoverable software failures, availability isdefined as:

    ___MTTF___A(T) = (MTTF + MTTR)

    where,

    MTTF is Mean Time To [next] Failure

    MTTR (Mean Time To [operational] Restoration) is still the duration of theoutage, but without the notion of a repair time. Instead, it is the time untilthe same system is restored to an operational state via a system reboot orsome level of software restart.

  • 8/8/2019 Software Design for Reliability - Presentation

    20/42

    (v0.1) 20

    Software Availability (continued)

    A(T) can be increased by either:

    Increasing MTTF (i.e., increasing reliability) using software reliability practices

    Reducing MTTR (i.e., reducing downtime) using software availability practices

    MTTR can be reduced by:

    Implementing hardware redundancy (sparingly) to mask most likely failures

    Increasing the speed of failure detection (the key step)

    Software and system recovery speeds can be increased by implementing Fast

    Fail and software restart designs Modular design practices allow software restarts to occur at the smallest

    possible scope, e.g., thread or process vs. system or subsystem

    Drastic reductions in MTTR are only possible when availability is part of theinitial system/software design (like redundancy)

    Customers generally perceive enhanced software availability as asoftware reliability improvement

    Even if the failure rate remains unchanged

  • 8/8/2019 Software Design for Reliability - Presentation

    21/42

    (v0.1) 21

    System Availability Timeframes

    Timeframe vs. Mission Downtime

    (Unavailability Range)

    AvailabilityAvailability Class

    99.99999% (7 nines) to

    99.9999999% (9 nines)

    99.9999% (6 nines)

    99.999% (5 nines)

    99.99% (4 nines)

    99.9% (3 nines)

    99% (2 nines)

    90% (1 nine)

    (7) Ultra-Availability

    (6) Very-High-Availability

    (5) High-Availability

    (High-reliabilityproducts)

    (4) Fault Tolerant(better commercialsystems)

    (3) Well-managed

    (2) Managed

    (good web servers)

    (1) Unmanaged

    13.14 minutes52.6 mins/year

    1.31minutes5.3 mins/year

    7.88 seconds31.5 secs/year(2.6 mins/5 years)

    2.19 hours8.8 hours/year

    (525.6 mins/year)

    0.79 seconds3.2 secs/year

    to

    31.5 millisecs/year

    (15.8 secs/5 years or less)

    21.9 hours

    9.13 days

    Timeframe = 3 months

    3.65 days/year

    (5,256 mins/year)

    36.5 days/year

    (52,560 mins/year)

    Timeframe = 1 year

  • 8/8/2019 Software Design for Reliability - Presentation

    22/42

    (v0.1) 22

    Software Availability

    a)System restart

    b)100 seconds (average)

    c)11 mins & 40 secs

    d)1 failure/mission

    a)System reboot

    b)4 minutes (average)

    c)14 minutes

    d)1 failure/mission

    1)Manual detection (No S/Wmechanism)

    2)Reboot/restart via user intf

    3)(avg detection time) 10 min.

    a)System restart

    b)100 seconds (average)

    c)130 secondsd)7 failures/mission

    a)System reboot

    b)4 minutes (average)

    c)4 mins & 30 secsd)3 failure/mission

    1)Slow, automatic det. Mech.

    2)Health message protocols

    w/retries3)30 seconds

    a)Board restart

    b)30 secondsc)40 seconds

    d)24 failures/mission

    a)Board reboot

    b)90 secondsc) 100 seconds

    d)9 failures/mission

    1)Fast, automatic detection

    mechanism2)Watchdog timer reset

    3)5-10 seconds

    a)Thread/Driver restartb)(average) 50 mils

    c)1.05 seconds

    d)914 failures/mission

    a) Board rebootb) 90 seconds

    c) 91 seconds

    d) 10 failures/mission

    1)Very fast, automatic detectionmechanism

    2)Thread or Driver monitors

    3)Up to 1 second

    a) Fast Recovery Action

    b) Recovery Speed

    c) MTTR

    d) Mission failure rate

    a) Slow Recovery Action

    b) Recovery Speed

    c) MTTR

    d) Mission failure rate

    1)Software Failure DetectionMechanism

    2) Implementation Example

    3)Failure Detection Speed

  • 8/8/2019 Software Design for Reliability - Presentation

    23/42

    (v0.1) 23

    Reliable Software Characteristics Summary

    Operates within the reliability specification that satisfies customerexpectations

    Measured in terms of failure rate and availability level

    The goal is rarely defect free or ultra-high reliability

    Gracefully handles erroneous inputs from users, other systems,and transient hardware faults

    Attempts to prevent state or output data corruption from erroneous inputs

    Quickly detects, reports and recovers from software and transienthardware faults

    Software provides system behavior as continuously monitoring, self-diagnosingand self-healing

    Prevents as many run-time faults as possible from becoming system-levelfailures

  • 8/8/2019 Software Design for Reliability - Presentation

    24/42

    (v0.1) 24

    Software Design for Reliability (DfR)

    What are the common options for software DfR?

    1. Use formal methods

    2. Follow an approach consistent with Hardware Reliability Programs

    3. Improve reliability through software process control

    4. Improve reliability by focusing on traditional software development bestpractices

    Formal Methods

    Methodologies that utilize mathematical modeling of a systems requirementsand/or design in order to prove correctness

    Primarily used for safety-critical systems that require very high degrees of

    Confidence in expected system performance

    Quality audit information

    Targets of low or zero failure rates

    Formal methods are not applicable to every software project

    Cannot be used for all aspects of system design (e.g., user interface design)

    Do not scale to handle large and complex system development

  • 8/8/2019 Software Design for Reliability - Presentation

    25/42

    (v0.1) 25

    Hardware Reliability Practices

    Software and hardware development practices are stillfundamentally different

    Architecture and design level modeling tools are not prevalent

    Provide limited simulation verification

    Especially if a real-time operating system is required

    2-way code generation is a common feature

    Software engineers find it difficult to complete a design with coding

    All software faults are from design, not manufacturing or wear-out

    Software is not built as an assembly of preexisting components

    Off-the-shelf software components do not provide reliability characteristics

    Most reused software components are modified and not re-certified beforereuse

    Developing designs with very small defect-densities is the Achilles' heal

    No true mechanisms available for general accelerated software testing

    Increasing processor speeds is usually implemented to find timing failures

  • 8/8/2019 Software Design for Reliability - Presentation

    26/42

    (v0.1) 26

    Hardware Reliability Practices

    (continued)

    Extending software designs after product deployment is commonplace

    Software updates are the preferred avenue for product extensions andcustomizations

    Software updates provide fast development turnaround and have little or nomanufacturing or distribution costs

    Hardware failure analysis techniques (FMEAs and FTAs) are rarely used withsoftware

    Software engineers find it difficult to adapt these techniques below thesystem level

  • 8/8/2019 Software Design for Reliability - Presentation

    27/42

    (v0.1) 27

    Software Process Control Methodologies

    Software process control assumes a correlation betweendevelopment process maturity and latent defect density in thefinal software [Jones-1]

    If the current process level does not yield the desired software

    reliability, process audits and stricter process controls areimplemented

    Reliability improvement under this type of controlled methodology may increasevery slowly

    5.8

    3.4

    2.1

    1.09

    0.39

    DeliveredDefects/ KSLOC

    124.8

    62.4

    31.2

    15.6

    7.8

    InherentDefects/ KSLOC

    5

    1

    2

    3

    4

    CMM Level

  • 8/8/2019 Software Design for Reliability - Presentation

    28/42

    (v0.1) 28

    Defect Removal Potential of Best Practices

    25% to 40%Integration test

    20% to 40%Performance test

    25% to 55%System testing

    25% to 35%Acceptance test (1 customer)

    15% to 45%Unit testing

    45% to 60%Design inspections

    15% to 30%Regression test

    45% to 60%Code inspections

    Efficiency RangeDefect Removal Technique

    [Jones-2]

  • 8/8/2019 Software Design for Reliability - Presentation

    29/42

    (v0.1) 29

    Impact of Integrating Multiple Best Practices

    MedianDefectEfficiency

    FormalTesting

    Formal SQAProcesses

    CodeInspections/ Reviews

    DesignInspections

    / Reviews

    40

    %

    n

    n

    n

    n

    45

    %

    n

    Y

    n

    n

    53

    %

    Y

    n

    n

    n

    57

    %

    n

    n

    Y

    n

    60

    %

    n

    n

    n

    Y

    65

    %

    Y

    Y

    n

    n

    85

    %

    n

    n

    Y

    Y

    99

    %

    Y

    Y

    Y

    Y

    Most development organizations will see greater improvements in software reliability

    with investments in design and code reviews than further investments in testing.

    [Jones-2]

  • 8/8/2019 Software Design for Reliability - Presentation

    30/42

    (v0.1) 30

    Best in Class Results by Application

    [Jones-2]

    92%Outsourced Software

    72%Web Software

    73%IT Software

    96%Military Software

    90%Commercial Software

    95%Embedded Software

    94%System Software

    Best in Class

    Defect Removal EfficiencyApplication Type

  • 8/8/2019 Software Design for Reliability - Presentation

    31/42

    (v0.1) 31

    Phase Containment of Defects and Failures

    Typical Behavior

    TestingDesign Coding MaintenanceReqts

    TestingDesign Coding MaintenanceReqts

    Defect

    Origin

    Defect

    Discovery

    Goal of Phase Containment

    TestingDesign Coding MaintenanceReqts

    TestingDesign Coding MaintenanceReqts

    Defect

    Origin

    Defect

    Discovery

    Surprise!

  • 8/8/2019 Software Design for Reliability - Presentation

    32/42

    (v0.1) 32

    Typical Defect Reduction Behavior

    SysBld SysBldSysBldSysBldSysBld#1 #2 #3 #4 #5

    System Test

    150

    50

    100

    200

  • 8/8/2019 Software Design for Reliability - Presentation

    33/42

    (v0.1) 33

    DfR Phase Containment Behavior

    150

    50

    100

    200

    Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field

    Test #1 #2 #3 #4 #5 Failures

    Development System Test Deployment

    Goals are to:

    (1) Reduce field failure by finding most of the failures in-house

    (2) Find the majority of failures during development

  • 8/8/2019 Software Design for Reliability - Presentation

    34/42

    (v0.1) 34

    Requirements Phase DfR Practices

    Software reliability requirements

    Identify important software functionality

    Essential, critical, and non-essential

    Explicitly define acceptable software failure rates

    Specify any behavior that impacts software availability

    Acceptable durations for software upgrades, reboots and restarts

    Define any operating cycles that apply to the system and the software inorder to define opportunity for software restarts or rejuvenation

    Ex: maintenance or diagnostic periods, off-line periods, shutdown

    periods

  • 8/8/2019 Software Design for Reliability - Presentation

    35/42

    (v0.1) 35

    Design Phase DfR Practices

    The software designs should evolve using a multi-tiered approach

    1. System Architecture

    Identify all essential system-level functionality that requires software

    Identify the role of software in detecting and handling hardware failure modes byperforming system-level failure mode analysis

    2. High-level Design (HLD) Identify modules based on their functional importance and vulnerability to failures

    Essential functionality is most frequently executed

    Critical functionality is infrequently executed but implements key systemoperations (e.g., boot/restart, shutdown, backup, etc.)

    Vulnerability points that might flag defect clusters (e.g., synchronizationpoints, hardware and module interfaces, initialization and restart code, etc.)

    Identify the visibility and access major data objects outside of each module

    3. Low-level Design (LLD)

    Define availability behavior of the modules (e.g., restarts, retries, reboots,redundancy, etc.)

    Identify vulnerable sections of functionality in detail

    Functionality is targeted for fault tolerance techniques

    Focus on simple implementations and recovery actions

  • 8/8/2019 Software Design for Reliability - Presentation

    36/42

    (v0.1) 36

    Design Phase DfR Practices

    For software engineers, the highest ROI for defect and failuredetection and removal is low level design (LLD)

    LLD defines sufficient module logic and flow control details to allow analysis basedon common failure categories and vulnerable portions of the design

    Failure handling behavior can be examined in sufficient detail

    LLD bridges gap between traditional design specs and source code

    Most design defects that were previously caught during code reviews willnow be caught in the LLD review

    More likely to correctly fix design defects since the defect is caught in thedesign phase

    Most design defects found after the design phase are not properly fixedbecause the scheduling costs are too high

    Design defects require returning to the design phase to correct andreview the design and then correcting, re-reviewing and unit testing the

    code!

    LLD can also be reviewed for testability

    The goal of defining and validating all system test cases as part of an LLDreview is achievable

  • 8/8/2019 Software Design for Reliability - Presentation

    37/42

    (v0.1) 37

    Code Phase DfR Practices

    Code reviews should be carried out in stages to remove the mostdefects

    Properly focused design reviews coupled with techniques to detect simple codingdefects will result in shorter code reviews

    Code reviews should focus on implementation, and not design, issues

    Language defects can be detected with static and dynamic analysis tools

    Maintenance defects caught with coding standards pre-reviews

    Requiring authors to review their own code significantly reduces simple codedefects and possible areas of code complexity

    The inspection portion of a review tries to identify missing exception handlingpoints

    Software failure analysis will focus on the robustness of the exception handlingbehavior

    Software failure analysis should be performed as a separate code inspectiononce the code has undergone initial testing

  • 8/8/2019 Software Design for Reliability - Presentation

    38/42

    (v0.1) 38

    Test Phase DfR Practices

    Unit testing can be effectively driven using code coveragetechniques

    Allows software engineers to define and execute unit testing adequacyrequirements in a manner that is meaningful and easily measured

    Coverage requirements can vary based on the critical nature of a module

    System-level testing should measure reliability and validate asmany customer operational profiles as possible

    Requires that most of the failure detection be performed prior to system testing

    System integration becomes the general functional validation phase

  • 8/8/2019 Software Design for Reliability - Presentation

    39/42

    (v0.1) 39

    Reliability Measurements and Metrics

    Definitions

    Measurements data collected for tracking or to calculate meta-data (metrics)

    Ex: defect counts by phase, defect insertion phase, defect detection phase

    Metrics information derived from measurements (meta-data)

    Ex: failure rate, defect removal efficiency, defect density

    Reliability measurements and metrics accomplish several goals

    Provide estimates of software reliability prior to customer deployment

    Track reliability growth throughout the life cycle of a release

    Identify defect clusters based on code sections with frequent fixes

    Determine where to focus improvements based on analysis of failure data

    SCM and defect tracking tools should be updated to facilitate the

    automatic tracking of this information Allow for data entry in all phases, including development

    Distinguish code base updates for critical defect repair vs. any other changes,(e.g., enhancements, minor defect repairs, coding standards updates, etc.)

  • 8/8/2019 Software Design for Reliability - Presentation

    40/42

    Conclusion

  • 8/8/2019 Software Design for Reliability - Presentation

    41/42

    (v0.1) 41

    Conclusions

    Software developers can produce commercial software withhigh reliability by optimizing their best practices for defectremoval

    Most software organizations are not aware that this type ofsoftware DfR is a viable option, which is why much of ourinitial contact is made via the hardware organizations

    Final Point: You dont need more tools, more testing ormore processes to develop highly reliable software. Theanswer lies within the reach of your teams best practices!

    R f

  • 8/8/2019 Software Design for Reliability - Presentation

    42/42

    (v0.1) 42

    References

    [Jones-1], Jones, C., Conflict and Litigation Between Software Clients andDevelopers, March 4, 1996

    [Jones-2], Jones, C., Software Quality in 2002: A Survey of the State of the

    Art, July 23, 2002