metrics to evaluate vendor-developed software based on ...€¦ · metrics help track aspects of an...

Metrics to evaluatevendor-developedsoftware based on testcase execution results

by K. BassinS. BiyaniP. Santhanam

Various business considerations have led agrowing number of organizations to rely onexternal vendors to develop software for theirneeds. Much of the day-to-day data fromvendors are not available to the vendee, andtypically the vendee organization ends up with itsown system or acceptance test to validate thesoftware. The 2000 Summer Olympics in Sydneywas one such project in which IBM evaluatedvendor-delivered code to ensure that allelements of a highly complex system could beintegrated successfully. The readiness of thevendor-delivered code was evaluated basedprimarily on the actual test execution results.New metrics were derived to measure the degreeof risk associated with a variety of test casefailures such as functionality not enabled, badfixes, and defects not fixed during successiveiterations. The relationship of these metrics tothe actual cause was validated through explicitcommunications with the vendor and thesubsequent actions to improve the quality andcompleteness of the delivered code. This paperdescribes how these metrics can be derived fromthe execution data and used in a softwareproject execution environment. Even though wehave applied these metrics in a vendor-relatedproject, the underlying concepts are useful tomany software projects.

Appropriate use of software metrics is vital to anysoftware project.1–3 From the software release

management point of view, there have been exam-ples of metrics used in various companies.1,4–6 Thesemetrics help track aspects of an ongoing softwareproject, such as changing requirements, rates of find-ing and fixing defects, and growth in size and com-plexity of code. From the testing point of view,6,7 themetrics typically focus on the quantity and quality

of software, the progress of testing, and the effec-tiveness of testing. Examples of typical metrics in-clude product or release size over time, defect dis-covery rate over time, defect backlog over time, testprogress over time (plan, attempted, actual), and per-centage of test cases attempted. Although each ofthese metrics has merit, overall they were insufficientto address the special requirements imposed by theIBM project for the 2000 Summer Olympics, held inSydney, Australia, particularly with regard to eval-uating components developed by an external ven-dor.

The status of these metrics is typically reported ei-ther as measured against expected goals in the sched-ule or against a previous release with similar con-tent for comparison. In the case of the 2000 SummerOlympics, data from prior Olympics were either notavailable or not sufficiently similar in content to beuseful. Thus, the assessment of progress being made,the stability of the product in terms of functional con-tent and code quality, and the effectiveness of testactivities had to depend on being measured againstexpected goals in the current project rather than onhistorical evidence.

The trend toward increased outsourcing of softwaredevelopment demands the regular use of new met-rics and methodology to assure adequate quality andschedule integrity. One of the challenges a vendee

�Copyright 2002 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 41, NO 1, 2002 0018-8670/02/$5.00 © 2002 IBM BASSIN, BIYANI, AND SANTHANAM 13

organization faces is the evaluation of the deliveredcode in terms of functionality, performance, etc. Be-cause of the implicit risks in software projects, thecontractual commitments for quality and complete-ness are generally at a high level, and typically thevendee organization defines and executes its own sys-tem or acceptance test to validate expectations. Of-ten the code is delivered in an incremental fashionthrough multiple iterations, and many of the detailedoperational data from the vendor are not availableto the vendee.

This paper presents a set of metrics that can be de-rived by the vendee organization based on the eval-uation of its own test cases on code delivered to it,without depending on a sparse amount of informa-tion typically provided by the vendors. These met-rics can significantly enhance the strength of vendeeorganizations in accepting vendor code by givingthem objective measures of the vendor’s perfor-mance. Though problems typical to vending softwaredevelopment are known, there appears to be noth-ing in the literature describing the derivations andvalidation of quantitative metrics from actual testcase execution records.

Background of project

The 2000 Summer Olympic Games were consideredto be the largest single sporting event in the world.Table 1 shows some of the facts that reflect the scopeof this endeavor. In addition to capturing and report-ing event results, IBM’s role in those Summer Olym-

pics had expanded to include managing and mon-itoring the information technology (IT) infrastructureand many other aspects of a total of 39 venues. Thisrole included everything from keeping track of eachof the 15836 athletes, coaches, and officials to man-aging food service for the 5.5 million spectators atthe games. Preparing for the Olympic Games hasbeen described as building a billion dollar companywith 200000 employees in less than four years, thentearing it down in 90 days. One critical challenge wasthe development, test, and integration of the venueresults systems. The diagram in Figure 1 is a rep-resentation of the overall architecture and the com-plex data flows and interfaces to and from venue re-sults. These systems are responsible for capturing andcollecting the results of each sporting event and re-porting these results to 15000 accredited media per-sonnel (journalists and broadcasters), sport officials,coaches, participants, and 3.7 billion remote spec-tators.

Project description. The focus of this paper is on thedevelopment and testing of the venue results compo-nents. These components were developed by a vendorin Spain, and the testing and integration of these andother components were executed by the IBM GamesSystem Center (GSC) in Spain. Specific responsibilitieswere defined at the beginning as follows:

● Design of the venue results components was pri-marily the responsibility of the vendor, with IBMproviding input and participating in design reviews.

● The vendor was responsible for coding and early

Table 1 The scope and complexity of Olympics events

Winter Games Summer Games

Number of medal events 68 300Different sports 7 28Competition venues 12 39Accreditation venues 25 30Concurrent events 9 7CIS (Commentator Information

System) screen formats535 800

Languages supported English, French, Japanese English, FrenchTiming vendor Seiko (Nagano) Swiss T (Sydney)Timing performance Subsecond SubsecondReport types 2,000 3,000INFO users 84,000 260,000INFO terminals 1,500 2,000News records 4,000 10,000Biographical records 8,000 35,000Historical records 500,000 1,500,000Average INFO requests per day 710,000 6,500,000Internet hits per day 56M peak 874.5M peak

BASSIN, BIYANI, AND SANTHANAM IBM SYSTEMS JOURNAL, VOL 41, NO 1, 200214

testing (including code inspections, unit test, andsimple functional testing for each sport).

● GSC was responsible for multiple test activities: ver-ification test (straightforward tests intended to ver-ify that the component was sufficiently stable toproceed to subsequent test activities), function test

(thorough testing within each component in termsof basic functional requirements), function inte-gration test (testing of integrated components toensure functional interaction and interfaces meetrequirements), and system integration test (test-ing of fully integrated system).

Figure 1 Major applications and critical data flows and timing requirements for 2000 Summer Olympics (trigger in this figurenot related to ODC trigger)

WNPA SYSTEMSPORTS-RELATEDPARTICIPANTS

DB

CO

PY

TO

HO

ST

DB

CO

PY

FR

OM

HO

ST

BIO

GR

AP

HIC

AL

INF

OR

MAT

ION

PR

INT

RE

QU

ES

T

RE

SU

LTS

DAT

A F

OR

PR

INT

RE

SU

LTS

DAT

A F

OR

PR

INT

RESULTS-INFOTRIGGER

RESULTS DATA

DATA

(WNPA)

ORTO

SCOREBOARD PRINTRECEIVER

CIS RESULTSDATA

INTERNET

GLOBALCOMMUNICATIONSEMAIL

INTERNATIONALFEDERATIONS

TEAMCAPTAINS

NAOCSPORT

ACCREDITATIONSYSTEM

CENTRAL RESULTSSYSTEM HOST

SYSTEMS OPERATIONS CENTER

WNPA FEED

INFO DATACOLLECTION

NAOCCOMMITTEE

INFO INFORMATIONMANAGEMENT

SEIKOTIMING

CIS

OFFICIALSVENUE RESULTS

IBM SYSTEMS JOURNAL, VOL 41, NO 1, 2002 BASSIN, BIYANI, AND SANTHANAM 15

● Acceptance testing, performance test, and a moreextensive and final system integration test wereconducted by the team in Sydney, Australia.

Late in 1998 it was decided that advanced analysisand decision support, which included the use of Or-thogonal Defect Classification (ODC),8–10 was re-quired for the evaluation of progress and risk asso-ciated with testing and integrating components.

Metrics used prior to work reported in this paper.Many tools and methodologies to evaluate qualitywere already in place. Quality plans had been de-veloped to define acceptable outcomes. Inspec-tions of a wide variety of artifacts, including require-ments, design, code, use cases, and object and classdefinitions, were defined as key process steps. Sta-tus reviews were an integral part of all major check-points during the test and integration activities. Manyprocess and product evaluation metrics were alreadybeing used by the vendor and GSC. The vendor re-sponsible for developing the venue results compo-nents was highly respected and experienced in deliv-ering these types of applications, having participatedin such high-profile (albeit single venue) internationalevents as the Wimbledon tennis competition. Thevendor had implemented a set of metrics based onRational Rose** technology, including such object-oriented measurements11 as attributes per class, as-sociations per class, and number of children per class.In addition, they were systematically measuring codecomplexity primarily by means of cyclomatic com-plexity.12 Given these considerations, why was therea perceived need for additional decision support?

Need for additional metrics. One concern was thecyclomatic complexity metrics used by the vendor.These metrics were able to detect variances in com-plexity and implied risk; however, the measurementscould not be taken until after the design and codewere complete, resulting in the risk that there wouldbe insufficient time to address exposures and stillmeet the demanding schedule. Of greater signifi-cance was the fact that applying these measurementsto the venue results components independently fromthe rest of the system precluded an evaluation of howthe components would behave when integrated. Thisfact was particularly relevant with regard to themultivenue Olympics system, whose architecture andcomponents consisted of many platforms, compo-nents, and technology.

A key inhibiting factor was that the plan called fornot only incremental deliveries of the vendor prod-

uct overall but for individual functions for each sportcomponent to be delivered incrementally on inde-pendent schedules. This delivery routine had impli-cations in terms of defining and executing a test planbased not on the full functional content of a sportproduct upon delivery, but rather on those functionalelements that were presumed to be part of the de-livery. Thus, it was extremely difficult to define anexpectation in terms of conventional metrics with re-gard to the overall product.

The extent to which defect backlog over time wasa concern had implications with regard to both prod-uct stability and test effectiveness, and it ultimatelycould inhibit progress. Unfortunately, it was diffi-cult to identify a trend for a model that called fordefect fixes to be held for relatively long, inconsis-tent intervals and delivered in large batches, as wasthe case in the 2000 Summer Olympics. The abilityto utilize an S-curve model to track progress was alsoimpacted by the frequency and interval inconsistencyof incremental code and fix deliveries. An S-curvemodel is ideal for an environment in which defectdiscovery is random and test effort is steadily exe-cuted. In the 2000 Summer Olympics, testing for eachsport would be halted for long periods of time untilthe new increment had been received.

Organizational churn was also a factor. Up to 80 per-cent of the test team members changed every fewweeks as attention was directed to each individualsport and corresponding specialists were brought in.Thus there was a relatively small core team of testexperts able to provide continuity in terms of assess-ing status, identifying weaknesses, and putting ac-tions in place to address them.

Because of the severe schedule stress, interaction be-tween the technical team and the analyst was keptat an absolute minimum. This limitation meant thatthe evaluations had to be based almost entirely onwhat could be derived by available data. Fortunately,the collection of data by the test team was very thor-ough and, with a few exceptions, complete. The fol-lowing section describes the data that were collectedduring the project and identifies those elements thatwere available for the specific analysis described inthis paper.

Data

Although there were gaps in terms of defect data,as previously described, there was nevertheless agreat deal of valuable data collected throughout the


project. These data originated from three sources:vendor, GSC, and test activities.

Vendor metrics, data, and measurements. The ven-dor captured measurements that enabled its analystto assess productivity and complexity, and to pro-ject quality for each sport. The vendor initially pro-vided GSC with complexity values based on RationalRose as well as a projection of quality for each sport,although not the underlying data. The measurementsand projections were helpful in the initial planningstages. As time progressed, however, decisions weremade to deviate from the original plan in terms oftechnology and function, creating a dynamic envi-ronment difficult to size and measure accurately. Notany of the actual results for defect detection and re-moval activities, the defect record counts, nor thedefect records themselves were made accessible toGSC. Additionally, information regarding the re-quired fixes for defects that were discovered and re-ported by IBM (GSC and Australia) was not availableto IBM.

GSC metrics, data, and measurements. The qualityplan of GSC described many process- and product-related metrics designed to ensure a high standardof quality in terms of all deliverables as well as main-taining the planned schedule.

The process-related metrics were primarily target-ing the assessment of current status and, to some de-gree, anticipating changing needs for resource andskill. Because of the nature of the Olympics project(i.e., it was a single engagement, executed once), con-tinuous improvement over multiple generations ofa product was not a priority in terms of the process-related metrics. It was, however, important to be ableto identify process weaknesses and exposures overthe series of incremental product deliveries. Processmetrics included summary-level evaluations of effort,staffing, schedule, successful completion of an activ-ity, cost, and productivity. These metrics could beused early in the project to anticipate resource andskill requirements, and during the project they couldbe used to provide executive-level views of status.They proved to be inadequate, however, at moregranular views in isolating cause and effect and pro-viding sufficient insight to enable management andtechnical teams to address deficiencies quickly.

Product-related metrics included many that are typ-ical of software development activities such as sys-tem size, defect density, defect removal rate, incom-ing defect rate, and defect closure rate.

The success of these metrics was, by definition,greatly affected by the availability of accurate infor-mation regarding component and increment sizes.As the project content and design evolved, the orig-inal form of these metrics was inadequate for keep-ing pace with the changes. The set of metrics wasenhanced through the implementation of ODC,8–10

which enables evaluation based on distributions ofsemantic attributes rather than defect rates, and assuch are not as dependent on the accuracy of prod-uct size measurements as metrics based on defectrates per kloc (thousands of lines of code). Unfor-tunately, since defect data were not complete, it wasnecessary to find other measurements that would “fillin the gaps” and improve the accuracy of the anal-ysis.

Test data. Fortunately, the data collected in plan-ning and executing test activities were extensive—much more so than a typical project, and morethoroughly than many large, complex projects.Considerable effort was invested in the early stagesto plan and evaluate the comprehensiveness of thevarious test activities. ODC was used innovatively13

in classifying and analyzing test cases. This use madeit possible to evaluate the completeness of theplanned effort across many dimensions (for exam-ple, depth and variety of testing relative to each sport,event reporting with regard to media and contentrequirements, and adherence to test strategy) andto set expectations for test execution. The investmentin terms of capturing data in test cases and their as-sociated execution and defect records proved to beinvaluable in terms of analysis that could be per-formed throughout the project.

Categorization. All data were organized first by a maincategory based on 39 competition venues plus 12 gen-eral information categories. The main categorieswere subdivided into scenario groups. Typically, thesegroups represented different events within a sport(such as men’s singles within tennis), as well as cer-tain additional categories, such as installation, con-figuration, and dynamic information. Another cat-egorization, called function type, was used to describethe functionality under test, such as different typesof data entry, and reports before, during, and afterthe event. Many scenarios and test cases were ex-ecuted during multiple test activities (i.e., verifica-tion test, function test, etc.) to ensure that functionscould be successfully invoked as a component be-came more complete, the configurations increasedin complexity, and the environment evolved to bemore realistic.


Hierarchical structure. The planned test effort, as wellas the current status of successful execution, can beunderstood and evaluated by examining the hierar-chy of records representing it. Figure 2 depicts thishierarchy and illustrates some typical sequences. TestCase 1 shows a successful execution under Activity1 with no defects found. Test Case 2 shows an un-successful execution under Activity 1 with its asso-ciated defect. A subsequent execution under thesame activity verifies the fix. Test Case 3 shows thatthe first execution under Activity 1 was successful,whereas the execution in Activity 2 failed. This fail-ure may be caused by a more complex environmentduring Activity 2 or by regression of the code. TestCase 4 was executed for the first time in Activity 3and revealed two defects. In Test Case 3 and TestCase 4, for simplicity we have not shown the exe-cution records associated with the successful retest-ing of the fix.

Test case data. The data collected for each test caseincluded the categorization outlined above, the testactivities during which the test case was to be exe-cuted, and some ancillary information such as theauthor and last modification time. In addition, anODC trigger8 was assigned to each test case. ODC de-fines a trigger as the environment, catalyst, or spe-cial conditions that must exist in order for a defect

to surface. The description or instruction normallyassociated with a test case includes this information.For example, the test setup would specify a certainhardware or software configuration, or an explicitcondition such as simulation of a heavy workload.These descriptions map to specific definitions of ODCtriggers. Thus, the focus or intention of the test caseis understood when it is defined and can be repre-sented by an ODC trigger at that time. Later duringtest execution, if a defect was revealed, the triggerwould be automatically copied to any defect recordsassociated with that test case. The trigger values inthe defect records were validated against the descrip-tion of the failure for accuracy. Reference 13 de-scribes this use of ODC in the Olympics project, pro-viding details of the method, validation, and resultsachieved.

Execution data. When a test case execution was at-tempted for a given release (e.g., increment deliv-ery), an execution record was created. Each execu-tion record included an execution history consistingof an entry for each time the test case was attemptedwithin the same release and the result of each at-tempt. These entries would reflect the date and timeof the attempt and the status (pass, completed witherrors, fail, not implemented, or blocked). A statusof “failed” or “completed with errors” would result

Figure 2 Hierarchy and relationship of test scenarios, test cases, execution records, and defectsT

ES

T A

CT

IVIT

Y 1

TEST SCENARIO

TEST CASE 1 TEST CASE 2 TEST CASE 3 TEST CASE 4

EXECUTIONRECORD 2

EXECUTIONRECORD 2

EXECUTIONRECORD 1

DEFECTDEFECTDEFECT

EXECUTIONRECORD 1

EXECUTIONRECORD 1

EXECUTIONRECORD 1

TE

ST

AC

TIV

ITY

1

TE

ST

AC

TIV

ITY

1

TE

ST

AC

TIV

ITY

3

TE

ST

AC

TIV

ITY

1

TE

ST

AC

TIV

ITY

2

DEFECT


in the generation of an associated defect record. Astatus of “not implemented” indicated that the testcase attempt did not succeed because the targetedfunction had not yet been implemented. “Blocked”status was used when the test case attempt did notsucceed because access to the targeted area wasblocked by code that was not functioning correctly.Defect records would not be recorded for these lat-ter two statuses. Additional information in the ex-ecution record included the test activity, pointers toany defects found during execution, and other an-cillary information.

Defect data. For each defect found during testing,the associated execution and test case informationwas recorded along with defect description, sever-ity, priority, status (open, closed, canceled, etc.), opendate, close date, etc. In addition, defects were clas-sified according to ODC, providing information aboutthe semantics of the defect, including test activity,trigger, and impact. Trigger has been previously de-fined. Activity refers to the task being performedwhen the defect was uncovered (for example, func-tion test, performance test, or system test). Impactreflects the expected affect the defect would have onan end user should it have escaped the test (for ex-ample, reliability or usability). Aside from these at-tributes known when the defect was uncovered, ad-ditional defect attributes were classified based on thefix information: target, defect-type, and qualifier.ODC target reflects the high-level view of whatneeded to be fixed (such as design, code, or docu-mentation). Defect-type expresses the complexityand scope of the fix, as well as its characteristics (forexample, simple defects such as initializing a vari-able or complex defects such as timing or serializa-tion). Qualifier describes the defect-type in terms ofwhether the defect was an incorrect, missing, or ex-traneous element. Since fix information was not pro-vided by the vendor, it was necessary to classify fixattributes using the discovery information and a re-lationship model based on comparable projects andexperience.

Data quality. The overall quality of the data was verygood. As is often the case with large data sets, notall records were “clean.” Some invalid values creptin, and it was necessary to apply some filters to elim-inate them. Defect records were validated for accu-racy. Inconsistency in date formats used was anotherproblem area. Although the time stamps were gen-erated automatically by the test environment, vari-ation in local settings caused a mix of two-digit andfour-digit years as well as a mixture of U.S. format

(month-day-year) and European format (day-month-year) often within the same sport. Although thisproblem could have been avoided with better inputcontrols, avoidance is not always easy in a multi-national environment, and careful handling of datedata after the fact solved most of the problems.

Metrics

The collection of data is critical, but the value thedata provide can only be achieved through use ofthe data in analysis and feedback. In addition to over-all project goals (for example, quality and deliveryrequirements), the depth of captured data made itpossible to define goals and associated metrics to sup-port technical decisions.

Goals associated with the metrics. Through analy-sis of test case, execution, and defect records it waspossible to define a new set of metrics that could suc-cessfully relate cause and effect relationships and iso-late exposures and opportunities. The set of metricswas applied to the overall project but also to indi-vidual sports or test activities, making it possible toidentify the critical opportunities or exposures byteam. Goals that could be supported included:

● Establish clear and precise checkpoint exit crite-ria for each sport increment.

● Quickly identify actions that could be put in placeto target exposures and mitigate risk.

● Track and evaluate the degree to which the exe-cuted actions were successful with regard to sched-ule and effectiveness.

● Anticipate the impact of projected product weak-nesses and rework on the current and subsequenttest activities.

● Enhance the team’s ability to successfully nego-tiate the resolution of exposures with the vendor.

Metrics based on test data. To capture a good over-view of the test execution process, both planned andexecuted, we extracted information for each maincategory, subcategorized two ways: by scenario groupand by function type. The following discussion de-scribes the basic and derived metrics that were used.

Metrics based on counts. The first group of metrics,presented below, represents a sample of basic recordcounts used. Some were useful in the initial evalu-ation of the planned test effort, as well as during testto evaluate progress and risk. The related metricson which we focused included:


● Number of scenarios● Number of scenarios by function type● Number of test cases defined● Number of test cases executed● Total number of execution records (over multiple

releases)● Number of defect records● Number of test cases with failures but no associ-

ated defect records

The percentage of test cases attempted was used as anindicator of progress relative to the completeness ofthe planned test effort.

The number of defects per executed test case was anindicator of the quality of code as it progressedthrough the series of test activities. Since the com-prehensiveness of the planned test cases had alreadybeen validated,13 this metric could also be used toevaluate the effectiveness of the test effort associ-ated with each activity in uncovering defects.

The number of failing test cases without defect recordswas an indicator of the completeness of the defectrecording process.

Metrics derived from execution history. The availabil-ity of execution results for each test case and the timesequence gave us a unique opportunity to considersome aspects not often scrutinized. Each of theseaspects and the preceding list of metrics were inputfor the evaluation of risk at critical intervals in termsof product stability as well as test progress and ef-fectiveness. Specifically, we considered the follow-ing:

Success rate: The percentage of test cases that passedat the last execution was an important indicator ofcode quality and stability.

Persistent failure rate: The percentage of test casesthat consistently failed or completed with errors (butdid not change from fail to error or vice versa) wasan indicator of code quality. It also enabled the iden-tification of areas that represented obstacles to pro-gress through test activities.

Defect injection rate: We used the percentage of testcases whose states went from pass to fail or error,fail to error, or error to fail, as an indicator of thedegree to which inadequate or incorrect code changeswere being made.

Code completeness: The percentage of test execu-tions whose status remained “not implemented” or“blocked” throughout was important in terms of eval-uating the completeness of the coding of componentdesign elements. It was also useful to measure progress.

Statistically significant cases. A critical requirementwe were asked to address was decision support formanagement, project management, and technicalteams. Since it is not easy to make sense from a ta-ble with a large number of metrics for each subcat-egory, we highlighted the statistically “extreme” val-ues of the following metrics as possibly needingattention. Some of the key, statistically significantcases are identified in this list.

● Low percentage of test cases executed● High number of defects per test case● High unrecorded defects● Low success rate● High persistent failure rate● High defect injection rate● Low code completeness

Identifying values as “significant” was based on thestatistical distributions appropriate to the respectivemetrics. For the number of defects per test case, aPoisson model was used, with the expected numberof defects in a subcategory proportional to the num-ber of test cases. A subcategory was flagged as sig-nificant if the actual number of defects was withinthe upper five percent of the Poisson distribution.For all other metrics (based on percentages), the sig-nificance calculation was based on the hyper geo-metric distribution of the counts within a subcate-gory. For metrics where a high value would be ofconcern, values in the upper five percent of the dis-tribution were flagged as significant. For metrics withundesirable low values, the values in the lower fivepercent of the distribution were flagged as signifi-cant.

Defect data summaries. We generated the follow-ing summaries for each sport using the ODC data.

Summary of defect state by severity: The focus ofattention here is the possible presence of a largenumber of open defects of high severity. Priority infixing and retesting was to have been given accord-ing to the severity of the defects, and this metric wasone mechanism by which exceptions could be iden-tified.

Summary of defects by test activity: Having first es-tablished an expectation for each activity based on


the test case classification and analysis, these sum-maries were used as one element of evaluating theeffectiveness of each test activity in terms of defectremoval.

Summary of defects by trigger: This metric providesevidence of how comprehensive the test effort as-sociated with a particular activity has been. A broadrange of triggers would indicate that the code hasbeen exercised in a diverse manner, both simple andcomplex.

Summary of defects by impact: This summary allowsan assessment of the nature of the effect the defectsare likely to have had if they would have escapeddiscovery by testing. A broad range of impacts re-flected in the defect data suggests that the effort wascomprehensive in terms of testing for such areas asreliability, usability, and capability.

Summary of defect type versus qualifier: Analysis ofthe relationships between defect type and qualifieruncovers weaknesses in explicit areas of require-ments, design, and coding activities. These exposurescan be targeted in subsequent increments by the de-velopers. In addition, this information helps testmanagement understand the implications of a retest-ing effort in terms of schedule and resource con-straints based on the implied scope of the fault.

The defect data summaries, along with the new met-rics based on the execution data, provided detailedinsight into various parts of the system. This insightwas used to guide the test teams toward more effec-tive defect discovery and evaluation of risk to theoverall plan. In addition, the analysis pinpointed fo-cus areas for early prevention or removal of defectsby the vendor development team and provided ac-curate evaluations of progress.

Application of metrics

The decision of where the metrics should be appliedto the ongoing project was dictated by when and howcritical decisions would be made. Some of these de-cision points had been long established as part ofthe original test plan. Others evolved as the projectcame closer to the completion date and any expo-sure could mean the difference between success andfailure. The comprehensiveness of the planned testeffort had been previously verified, as alluded to ear-lier in this paper and described more fully in Ref-erence 13. This verification was a key element in theanalysis of progress, completeness, and effectiveness,

and formed the baseline and expectation againstwhich results could be measured. We will begin witha discussion of the formal checkpoints and how thenew metrics were integrated into decision support.

Exit criteria. The project had several major check-points when progress and status were to be evalu-ated in terms of four criteria:

1. There may not be any open Severity 1 problems.2. There should not be any open Severity 2 prob-

lems.3. Analysis must demonstrate that the test effort has

been sufficiently effective and comprehensive(against the planned objectives for this check-point).

4. Analysis must demonstrate that the componenthas reached a sufficient stability level in terms ofdesign (functional content) and code (reliabili-ty).

Criteria 1 and 2—Severities. The first two criteriawould appear to be quite clear: there either are, orare not, Severity 1 and Severity 2 defects. However,since the checkpoint represents a moment in time,but the analysis of risk does not, it was necessary tolook beyond the current open defects. One obvioussolution would be to build a metric that captures de-fect trends over time by severity. Unfortunately, theincremental delivery schedule was such that therewould be relatively long periods of inactivity with re-gard to defects, followed much later by a substantialcode drop including fixes for many defects. Standardgrowth models were simply not effective in this envi-ronment. Taking a series of “snapshots,” however,using the applicable metrics outlined in the previ-ous section, subset by severity, enabled specific trendsto be identified. The relevant metrics and their ap-plication to these criteria are described in the fol-lowing subsections.

The analysis pinpointed focusareas for early prevention

or removal of defects bythe vendor development team

and provided accurateevaluations of progress.


Success rate. This rate is the percentage of test casesthat passed at the last execution. In a mature orga-nization, a typical set of metrics for performing theanalysis associated with measuring test progresswould likely include one that calculates the percent-age of the planned test activity that has been com-pleted. Although this calculation may appear to besatisfied by a straightforward, simple metric, thereare mitigating factors that must be considered. Inthis project, for example, the same test cases werefrequently used in two and sometimes three test ac-tivities. It was not sufficient to count each unique testcase only once. Even basing the same metric on eachtest activity was not sufficient to provide an accurateview of progress.

In the example reflected in Table 2, both Compo-nent A and Component B should have progressedat least 60 percent through the overall test effort atthe time the data were analyzed. Using the first met-ric (percent of unique test cases attempted), we cansee that both components are at risk, with Compo-nent A having only achieved 43 percent, while Com-ponent B is only 30 percent of the overall test effort.The second metric (percent of test cases attemptedby activity) provides additional insight but leaves theobserver perhaps even more confused as to the riskassociated with each component. It does seem clearthat Component A has progressed further than Com-ponent B and that risk is apparently only associatedwith the last test activity, function verification test(FVT). For Component B, it seems that both func-tion test (FT) and FVT are at risk, perhaps FVT lessso than FT, though this seems counterintuitive. Whenwe use the metric of test case attempts that have suc-cessfully passed, however, we see that the risk as-sociated with both components is actually signifi-cantly higher than the first two metrics would suggest.

Defect injection rate. The percentage of test cases thatshowed an execution state sequence of pass to fail,

fail to error, or error to fail was interpreted as theextent to which defects were injected while attempt-ing to fix a defect or the attempt to fix a defect waseither incomplete or incorrect. The fluctuations inthis rate and fluctuations in the volumes of Severity1 or Severity 2 defects were studied to determinewhether there was a correlation. If a relationship didexist, it was used to project the change in volumesfor these severities in future test case executions.

Defects per test case, by severity. The rates at whichtest cases revealed defects and how those rateschanged in the course of multiple “snapshots” wasan indicator of whether the product was becomingmore stable. Measuring the rates, subset by sever-ity, enabled this insight to be used in projecting fu-ture risk relative to each severity. Figure 3A and Fig-ure 3B show both the cumulative rates and currentperiod rates relative to Component A and Compo-nent B, respectively. We can see that overall, the de-fects of Component A are primarily low severity,although the most recent period revealed a fewSeverity 2 defects. It is a very different picture withComponent B, where we see a predominance of Se-verity 2 defects, and the most recent period is con-tinuing this distribution.

Defect state, by severity—summary. This view ismainly concerned with revealing the possible pres-ence of a significant volume of open defects of highseverity.

Criterion 3—Test effectiveness. Test effectiveness re-quires measurements across many dimensions. It isnot only necessary to evaluate the extent to whichthe planned effort has been successfully executed,but it is also important to validate that the plan wassufficiently comprehensive and that the actual resultsmatched the plan. The validation of the plan was per-formed13 initially and on an ongoing basis as changeswere made to product content and associated test

Table 2 Results of three metrics intended to represent progress shown relative to two components when approximately 60percent of testing should have been completed

Component A(%)

Component B(%)

Percent of unique test cases attempted 43 30Percent by activity

Percent of DVT attempted 95 100Percent of FT attempted 100 27Percent of FVT attempted 6 32

Percent of test execute � pass 24 14


efforts. Thus, the validated plan was used as the ex-pectation against which the results were compared.Since the plan could be translated into aggregatesof domain-specific test cases and their associated ex-ecution and defect records, it was possible to trackprogress and evaluate the effectiveness of a test rel-ative to each domain. In addition to the question oftest effectiveness, it is imperative that the test man-ager understand whether the test resource is beingused efficiently, whether there are obstacles that willinhibit or delay the plan execution, and how theseobstacles can be managed.

The metrics described in the following subsectionswere used to analyze these areas of concern.

Success rate. This metric is as critical, if not more so,for evaluating the effectiveness of a test as it was forassessing risk associated with severity. The implica-tion for both Component A and Component B atthe checkpoint reflected in Table 2 is that very littleprogress had been made and that there was still along way to go.

Persistent failure rate. This metric, as a percentage oftest cases that consistently failed, provides insightinto one reason why the success rate is reflecting ex-posures for both components. Excessive and persist-ent failure rates result in rework for testers andwasted effort. These discoveries are not new butrather are old defects not yet addressed. In compar-ing our two components in Table 3, we can see thatthis situation was a very real concern for both com-ponents and continues to be a factor for ComponentB.

Defect injection rate. This metric is also relevant to testeffectiveness and provides additional evidence ofwasted effort. Both the persistent failure rate anddefect injection rate were used to calculate the de-gree to which the schedule had to be inflated and,in conjunction with other metrics, was used to as-sess the risk of not retesting specific areas.

Code completeness. This metric captures another as-pect of wasted effort on the part of testers, that is,the extent to which test cases fail because the func-tion has not yet been implemented or the successfulexecution of the test case is blocked because otherpertinent function is not available. Table 4 shows thatas of the later period, this was not an exposure foreither component. In previous periods, it was a con-sideration for Component A but had not ever beenan exposure for Component B.

Percentage of test cases attempted overall, percentageof test cases attempted by subset. These two simplemetrics, although not reflecting as accurate a viewof progress as other metrics, do serve to indicate the

Figure 3A Results of the execution of test cases in exposing defects at two different stages during development for Component A

Figure 3B Results of the execution of test cases in exposing defects at two different stages during development for Component B


changing focus and to what extent the componenthas been verified. The percentage of test cases at-tempted by test activity, scenario, and schedule re-flect to what extent each is being targeted, and thenumber of test cases organized by these subsets al-lows the overall plan to be scrutinized in terms ofits comprehensiveness.

Failing test cases without defect records. Initially theresults of this metric caused concern, especially inthe area of severity and product stability since it ap-peared that defect records were not being writtenagainst all defects. Eventually it was verified thatthese unrecorded defects were, in fact, duplicates ofother defects. The metric is not without value, how-ever, since it shows the extent to which the test teamhas wasted effort attempting to test a function thathas known defects not yet fixed. As we see in Table5, this was a greater concern for Component A thanComponent B in the previous periods and contin-ues to be for the most recent period.

Defect data summaries. ODC-based analysis of the de-fect data was another critical element of the risk as-sessments. A summary by each ODC attribute re-vealed specific weaknesses or strengths pertinent totesting. For example, examining the relative distri-butions of the “trigger” attribute showed whetherthe majority of the effort had been focused on sim-ple, basic testing, or whether test cases spanning sim-ple to complex use cases had been attempted. Theimplication of distributions in which simple casesdominate is that the component has not been per-vasively tested, it may not be stable, and the test ef-fort to date has not yet been effective. The impli-cation of distributions that show populations acrossall triggers relevant to the test activity being per-formed (as defined by the test team at the beginningof the project) is that testing has been comprehen-sive and the product is relatively stable at least withregard to executing basic function. Another impor-tant contribution of the ODC-based analysis was toverify the extent to which added value was provided

Table 3 Percentage of test cases having a pattern of multiple “fail” or “complete with errors” and the most current status isnot “pass”

Component A(%)

Component B(%)

Previous periods (cumulative) 25 70Most recent period 0 52

Table 4 Percentage of test cases that failed because the function had not yet been implemented or the execution isblocked since another function is not available

Component A(%)

Component B(%)

Previous periods (cumulative) 9 2Most recent period 0 0

Table 5 Counts of unrecorded (duplicate) defects potentially camouflage wasted effort on the part of testers in rediscoveries

Component A Component B

Previous periods (cumulative)Count of total defects 201 (100%) 356 (100%)Count of recorded defects 106 (53%) 317 (89%)Count of unrecorded defects 95 (47%) 39 (11%)

Most recent periodCount of total defects 29 (100%) 101 (100%)Count of recorded defects 15 (52%) 101 (100%)Count with unrecorded defects 14 (48%) 0 (0%)


by those test cases executed across multiple test ac-tivities. Were these test cases serving as little morethan a regression test, i.e., an insurance policy thatthe same failures were not recurring? If this werethe case, we would expect to see similar relationshipsbetween trigger values and defect-types in any testactivity in which the test case was attempted. Or, asthe product became more stable and the underlyingenvironment became more complex, would the sametest case be able to reveal additional faults? Resultsfor the majority of test case attempts showed thatthe behavior resembled a regression test—not muchthat was new was revealed. However there was, infact, some evidence in specific areas that the rela-tionship of triggers to defect types did change some-what with regard to the redundant test cases fromearly test activities to later ones. Additional inves-tigating should be performed to understand the im-plications more fully.

Criterion 4—Product stability. Metrics discussed inthe previous section were found to be sufficient forevaluating the stability of each component. Thosethat were pertinent included the following:

Success rate: The degree to which the execution at-tempts were completing without failure or error andhow this rate changed across the checkpoints was anindicator of code quality and design stability.

Persistent failure rate, defect injection rate: Testcases that consistently failed, when viewed from theperspective of functional scenarios or test activity,enabled the identification of specific focus areas thatmay not have been well understood by the develop-ment team.

Code completeness: This metric represents the ex-tent to which the design of the product is incomplete(“not implemented” or “blocked”). Although someomissions were planned for and expected (since theplan called for the incremental delivery of functionwithin each component), an unusually high rate orone that continued beyond the date of code dropscould be easily identified and scrutinized.

Defects per test case, summaries by defect-type, qual-ifier, and impact: The rate of defects per test wasexpected to decrease, and exceptions were easilyidentified. The combinations of the ODC attributesdefect-type, qualifier, and impact revealed a greatdeal with regard to high-level, detailed, and codingexposures associated with a particular component.Components with a large proportion of function, in-

terface, timing or serialization, or relationship defect-types, particularly if they were “missing” elements,would be considered high risk as the delivery datedrew near. If algorithms were the highest priority,that suggested detail design was the weakness. Inthese cases, further investigation was performed to

identify other characteristics of the design concerns.Since fixes associated with algorithm defects are rel-atively small, risk is considered to be less both interms of product stability and the impact to testing.Assignments and checking, particularly if incorrect,are usually associated with coding oversights. Thesehave the lowest degree of risk since the fixes are verysmall, and there would be much less impact to testin terms of delays.

An overall risk value, as well as values for each in-dividual criterion, were included in the assessments.These values were on a scale of 0 (no risk) to 5 (high-est risk). The individual criterion risk values werederived by calculating the risk indicated by each ofthe measurements associated with that criterion,weighted by the importance of each measurementrelative to that criterion. For example, the defect in-jection rate would only be important as a risk mea-surement relative to Severity 1 criteria if there wasa correlation between the rate of defect injection andthe volume of Severity 1 defects. In contrast, the de-fect injection rate would always be critical in eval-uating risk associated with product stability. Theoverall risk value was calculated by a weighted av-erage of the criteria values. The two criteria, basedon severity since that exposure could be addressedrelatively swiftly, were weighted as less significantthan the criteria of test effectiveness and product sta-bility. The primary purpose of the overall value wasto provide a consistent mechanism for comparing riskacross all of the components and to measure pro-gress from one checkpoint to the next.

Analysis not only quantified anddefined risk at critical checkpoints

but provided importantinformation for resource, schedule,

and product decisions.


Decision support

The analysis guided by exit criteria was a key ele-ment of decision support—at a summary level forexecutives and management, and at a more detailedlevel for the test manager. This analysis not onlyquantified and defined risk at these critical check-points, but the details of the analysis provided im-portant information for resource, schedule, andproduct decisions.

The checkpoint analysis was also used on several oc-casions as the basis for negotiation between IBM andthe vendor to resolve issues. A situation arose withregard to one component in which the vendor be-lieved significant progress had been made but thetest team did not agree. The vendor felt that muchof the risk associated with the component had al-ready been mitigated but had not yet been reflectedin the test results because of the time needed to in-tegrate the code increments. Analysis based on thesemetrics was able to show that even after allowing forthe progress reported by the vendor, explicit expo-sures remained. As a result of the negotiations, thedeveloper for the component worked side by sidewith the testers to resolve the issues, and the com-ponent was stabilized very quickly.

In addition to the in-depth checkpoint analysis, as-sessments and outlook reports were provided on aweekly basis. The reports included defect data sum-maries, defect projections, delivery outlooks, and sta-tistically significant values that represented poten-tial concerns. The reports were used to check statusand progress by executives, management, and solu-tion managers. Solution managers were responsiblefor ensuring that any issues or concerns relative toone or more sports were resolved expediently.

In fact, this analysis was instrumental in verifying theneed for additional testing for certain componentsbefore their delivery to IBM for formal testing. Anagreement was reached for IBM testers to participatein “joint tests” at the vendor’s site, ensuring that thecomponents were adequately stable before deliveryto the IBM test organization. The impact of the earlytests was incorporated into the measurements, andthe success of these efforts was evaluated in termsof subsequent test results. In each case, the compo-nents were shown to be significantly more stable thantheir predecessors.

Validation

In any commercial software development environ-ment, a controlled experiment on establishing causeand effect is not feasible. In the Olympics project,which involved hundreds of people in many locationsacross at least three continents, there was enormouscomplexity in execution, making any attempt for acontrolled experiment impossible. Nevertheless, val-idation of the approach described in this paper wasperformed, to the extent possible, in terms of twokey aspects: verification of the correctness and com-pleteness of data and the evaluation of the useful-ness of the metrics.

A benefit of performing analysis at such a detailedlevel was that any specific problem with the data wasquickly exposed. The introduction of the metric,number of failing test cases without defect records,is one example. The process required that a test casethat failed or completed with errors had to have atleast one associated defect record. It became clearduring the analysis that this process was not alwaysbeing followed and that it was necessary to under-stand the degree to which that was the case as wellas to interpret the broader implications. Althoughthe metric revealed that the situation was pervasivein some areas, discussions of the results with the tech-nical teams led to the understanding that the unre-corded defects should be interpreted as duplicatesof known defects. Thus, the usefulness of the metricshifted from evaluating the accuracy of the data toexposing an important concern with regard to wastedtest effort and its impact on an already compressedschedule. It could be argued that in-depth analysisoften serves the dual purpose of verifying the dataand producing meaningful interpretations of the re-sults achieved. In the case of the Olympics project,when inconsistencies or gaps in the data were re-vealed, explicit actions were taken to address the in-consistencies and fill the gaps with alternative met-rics.

Validation of the metrics was accomplished through-out the course of the project in terms of understand-ing the value of the metric in decision-making, ofverifying the results against expectation, and of on-going discussions with management and technicalleaders. The selection, adaptation, and introductionof metrics were made based on their usefulness incritical decision support. The earliest metrics werea key factor in defining the plan and expectationsand were verified against the documented plans andinterviews with management and technical leads. The


early metrics were followed by metrics that weremainly intended to show progress in terms of prod-uct content and stability, to show thoroughness oftest, and to identify specific exposures.

Initially the metrics were reviewed with managementand technical leads to ensure that they addressed spe-cific requirements. As the project progressed, theanalysis was reviewed at critical checkpoints with thekey participants. Checkpoint risk assessments werevalidated through subsequent results and throughconfirmation from the technical leaders and man-agers based on their observations and hands-on ex-periences. For the most part, the assessmentsmatched the perception of the key participants, andon occasion areas of risk not previously known wererevealed. The metrics in the last six months of theproject, while continuing to focus on product con-tent and stability, also targeted schedule and deliv-ery dates. Status reports, defect projections, deliv-ery date outlook, and associated risk assessmentswere all provided and tracked on a weekly basis. Theanalysis was done independently based almost en-tirely on available data, with as little interaction withthe teams as possible (given the severe schedule pres-sure on them). The weekly reviews were importantnot only to demonstrate progress and risk, but toidentify and validate the analysis and address anyanomalies that surfaced.

Results. Where actions were taken to target eitherproduct stability or test effectiveness concerns, theimpact of those actions was estimated, and corre-sponding actual results were tracked. The assess-ments were consistent with the results observed atthe time the assessment was made and until the nextcode drop took place. Since these metrics were avail-able on a weekly basis, it was possible to anticipate—with only a few anomalies—any evidence of miti-gated or increased risk. Where there were anomaliesbetween the assessment and apparent actual results,it was discovered that additional test effort had takenplace in partnership with the vendor, with defectsbeing recorded only at the vendor site. Test case andrelated execution records were not captured in theIBM database for these special, joint tests. However,defect counts were provided and accounted for thedifference between predicted and actual results.

In addition to the checkpoint risk assessments,weekly component reports were generated for thesolution managers. These reports provided informa-tion relevant to each sport and also reported excep-tions such as the statistically significant cases de-

scribed earlier. The Appendix contains an exampleof this report.

In the previous section examples were provided ofdecisions that were supported by the metrics de-scribed in this paper.

The collective set of metrics was an integral part ofmanagement decision support at all levels, and assuch it is not possible to enumerate all of the waysin which the analysis was used. However, we can re-port some of the key contributions:

● They formed the set of metrics against which ev-ery activity and checkpoint exit was evaluated foreach sport component and increment.

● They were the key factor in delivering defect andschedule projections as well as ongoing weekly sta-tus and risk assessments.

● They were cited as important tools in negotiationswith the vendor, particularly in terms of risk mit-igation actions.

Reaction. The executive and management team re-sponded very favorably to the assessments and eval-uations. They were satisfied with both the accuracyand timeliness of the assessments and were able touse the reports in regular status meetings to reviewstatus with the vendor and to demonstrate the abil-ity of GSC to manage the project to the Olympic Com-mittee.

The test manager in particular expressed appreci-ation for receiving insight not available to him byany other means. He indicated that the code complete-ness, defect injection rate, and persistent failure rateswere especially useful in managing schedule risks andnegotiating with the vendor for additional early test-ing.

Feedback received from solution managers indicatedthat they felt the summary reports and prioritizedexposure lists enabled them to quickly target or ver-ify areas of concern and mitigated to at least someextent the complexity of their responsibilities. Theyalso appreciated the fact that the assessment was de-rived independently from the key participants in theproject.

The test leaders were pleased to have quantified, ir-refutable confirmation of their perceptions. In ad-dition, they acknowledged that the assessments re-vealed exposures that were previously not known tothem. They also expressed appreciation for the fact


that they considered this method to be minimally in-trusive and that it had not imposed additional over-head of any significance.

Value

The Summer Olympic Games in Sydney were con-sidered to be a resounding success as an informa-tion technology project. According to GSC manage-

ment, the analysis and metrics described in this paperwere instrumental in helping them successfully man-age one of the world’s most demanding projects andwere tremendously helpful in allowing them to gaininsight to process and schedule corrections that wererequired. In addition, there are implications to theindustry as a whole. Any project that has been de-fined with developers in one organization and testersin another, whether or not it is a vendor-vendee re-lationship, could benefit from the execution recordmetrics as a means of verifying the quality levels af-ter delivery. If acceptance test is an activity that isperformed, these metrics provide a mechanism forperforming such an evaluation in a quantifiable man-ner.

In the case of the Olympics project, the tester pop-ulation was extraordinarily dynamic because expertshad to be brought in for each sport as each incre-ment was being tested. Many software organizationscan be characterized as having a good deal of move-ment within the test organization for a variety of rea-sons. Distributed development and test also sharesimilar concerns in terms of distant and diversifiedpopulations. These metrics were perceived as non-obtrusive by the test leaders, and consistency of datacapture was not an issue in spite of the cultural di-versity across the team. Each component was deliv-ered incrementally through many builds as planned.Even when planned for, evaluating product stabilityand test effectiveness in conjunction with these typesof development models is extremely complex. It be-comes increasingly difficult with each decision tochange functional or content deliveries. The use ofthese metrics at regular intervals throughout the test

activities makes it possible to evaluate the risk as-sociated with these characteristics in spite of the com-plexity inherent in an iterative model.

We described a means by which the value of redun-dant test cases was measured. Many software devel-opment teams are attempting to strike a balance be-tween defining adequate but not excessive regressiontest suites and ensuring that the test cases are com-prehensively testing the product across pertinent en-vironments. Examining the ODC defect types re-vealed by regression suites or redundant test casesprovides a means of identifying the value of the in-vestment and suggests whether additional test casesare needed.

It bears noting that the investment in collecting andvalidating data during planning and execution of theOlympics project was not trivial and not typical ofmost projects. The classification of test cases was nec-essary in order to evaluate the completeness of theplanned test effort and establish the baseline againstwhich results could be measured. The metrics de-scribed in this paper require a database containingscenarios, test cases, execution records, and defectrecords, with their hierarchical associations. In ad-dition to the collection of data, a significant com-mitment was also made in terms of validating andanalyzing the data. Although the scope and complex-ity of the Olympics project justified this invest-ment, it would not be feasible or appropriate for allprojects.

Conclusions

The 2000 Summer Olympics was an extraordinarilychallenging software project due to many circum-stances that characterized it. A highly complex soft-ware architecture was required in order to addressmany demands inherent in such an enormous un-dertaking, including tracking results of competitionvenues and events, extracting and incorporating bio-graphical information on athletes, coaches, and of-ficials, and delivering these results and informationin many formats, on demand, to multitudes of me-dia representatives, athletes, and spectators.

The specialized knowledge required to test each ofthe sports components resulted in a high degree ofmovement in terms of the test personnel, leaving onlya small core team of testers to ensure continuity andunderstand the status of the test effort. Having ac-cess to reliable metrics to support their determina-tions was a valuable ingredient to their success. The

The collective set of metricswas an integral part of

management decision supportat all levels.


delivery schedule was aggressive and final, and therisk associated with exceptions had to be clearly un-derstood and addressed swiftly. Portions of the proj-ect were outsourced to vendors who delivered theircomponents following an iterative model in manyincrements—a model not easily managed using stan-dard software metrics. These and many other con-siderations resulted in the need for a new set of met-rics that could quickly, but effectively, evaluate andmeasure risk, progress, product stability, and test ef-fectiveness, without requiring extensive input and in-volvement from the vendor or the testers. The testmetrics described in this paper were highly success-ful in addressing these needs and were found to beminimally intrusive.

Although the 2000 Summer Olympics could not beconsidered an average or typical project, the met-rics developed to address specific needs in that proj-ect are applicable to a wide range of projects acrossthe software industry. Projects in which various roles(such as developers and testers) are managed in sep-arate organizations, including those that have out-sourced portions of their product to third-party ven-dors, are often characterized by incomplete data andinformation. It has been shown that test-based met-rics can be used with a high degree of success whenlittle if any preceding data are available. Projectscharacterized as dynamic, in terms of a significantlychanging workforce, would benefit from these met-rics, which rely on input that is standard and easyto understand, requiring little, if any, specializedtraining. These metrics are particularly useful inprojects that have defined an iterative model withincremental code deliveries, where a standard set ofmetrics such as S-curve are often inadequate for mea-suring progress. They were also shown to be usefulin measuring the value or deficiencies inherent inusing the same test case across multiple test activ-ities.

The combination of ODC-based metrics and a newset of test execution-based metrics collectively ad-dress the measurement and evaluation of complex,dynamic projects, particularly those with vendor-pro-videdcomponentsdelivered incrementally.Specifically,the metrics provide decision support in quantifiedterms with regard to assessing progress, analyzingtest effectiveness and product stability, and calculat-ing the degree of risk associated with each of thesetopics.

Acknowledgments

We want to thank Tom Furey, Patricia Cronin, andLois Dimpfel of the Olympics executive team for theopportunity to participate in such an exciting en-deavor. We appreciate the management supportfrom Toni Plana Castillo and Joe Ryan of the GSC,Spain, for close partnership during the critical timesof the testing activities. We thank David Perez Cor-ral and Fernando Dominguez Celorrio of the GSCfor integrating ODC in the Quality ManagementFramework System and for the data capture and en-ablement of the Lotus Notes*-based infrastructurethat was critical to the success of this project. Wethank Roy Bauer for the use of Figure 1 and the datafor Table 1.

Appendix: Solution manager report forSport S

Summary:

The overall assessment places Sport S at high risk.

Criteria HighRisk

MediumRisk

LowRisk

There may be no openSeverity 1 defects.Volume of open Severity 1

defects XTrend X

There may be no openSeverity 2 defects.Volume XTrend X

Is test progressing sufficiently?Test plan progress

(against schedule) XTest effectiveness

Spread of ODC triggervalues X

Spread of ODC defecttypes X

Is product adequately stable?Defect type volumes XSpecific trends X

Overall* DVT � 100% complete, FT � 27% complete, FIT �

32% complete* 65% of attempted executions concluded successfully

Detail• Event: Woman’s Qualification (551) has a high defect

rate—167/88 (number of defects per number of testcases)

• Event: Woman’s Qualification (551) has a highrepeated failure pattern—170/239

• Test Case Title: After-Event Reports (90) has a highrepeated failure pattern—139/159

• Test Case Title: Sports Level TV Graphics (51) has avery low percentage attempted test cases—7/136


*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of Rational Software Cor-poration.

Cited references

1. R. B. Grady and D. L. Caswell, Software Metrics: Establishinga Company-wide Program, Prentice-Hall, Inc., EnglewoodCliffs, NJ (1987).

2. K. H. Moeller and D. J. Paulish, Software Metrics, Chapmanand Hall, London (1993).

3. N. E. Fenton and S. L. Pfleeger, Software Metrics—A Rigor-ous and Practical Approach, PWS Publishing Co., Boston, MA(1997).

4. M. K. Daskalantonakis, “A Practical View of Software Mea-surement and Implementation Experiences Within Mo-torola,” IEEE Transactions on Software Engineering 18, No.11, 998–1010 (1992).

5. G. Stark, R. C. Durst, and C. W. Vowell, “Using Metrics inManagement Decision Making,” Computer 27, No. 9, 42–48(September 1994).

6. D. M. Marks, Testing Very Big Systems, McGraw-Hill, NewYork (1992).

7. S. H. Kan, J. Parrish, and D. Manlove, “In-Process Metricsfor Software Testing,” IBM Systems Journal 40, No. 1, 220–241 (2001).

8. R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S.Moebus, B. K. Ray, and M.-Y. Wong, “Orthogonal DefectClassification: A Concept for In-Process Measurements,”IEEE Transactions on Software Engineering 18, No. 11, 943–956 (1992).

9. IBM Research, Center for Software Engineering, http://www.research.ibm.com/softeng.

10. K. Bassin, T. Kratschmer, and P. Santhanam, “EvaluatingSoftware Development Objectively,” IEEE Software 15, No.6, 66–74 (1998).

11. M. Lorenz and J. Kidd, Object-Oriented Software Metrics: APractical Guide, Prentice Hall, Englewood Cliffs, NJ (1994).

12. T. J. McCabe, Structured Testing, IEEE Computer Society,Los Alamitos, CA (1983).

13. K. Bassin, S. Biyani, and P. Santhanam, “Evaluating the Soft-ware Test Strategy for the 2000 Sydney Olympics,” Proceed-ings of the IEEE 12th International Symposium on SoftwareReliability Engineering (November 2001).

Accepted for publication October 1, 2001.

Kathryn Bassin IBM Research Division, Thomas J. Watson Re-search Center, P.O. Box 218, Yorktown Heights, New York 10598(electronic mail: [email protected]). Ms. Bassin holds a B.S.degree from SUNY Brockport and an M.B.A. degree from Bing-hamton University. She is a senior software engineer with the Cen-ter for Software Engineering at the Watson Research Center. Shehas been with IBM since 1981 and, prior to joining IBM Researchin 1993, worked on a wide range of products, performing in avariety of capacities, including management, development, test,and service. Her research interests, likewise, span many areas,most notably software metrics and modeling, and the applicationof scientific methods to define relationships and influences acrossthe life of a software product. Ms. Bassin developed the ButterflyModel, which provides a comprehensive view of software devel-opment, linking design, development, test, and customer usage.

Named after the popular analogy associated with chaos theory,the Butterfly Model exploits the use of categorical data includingODC to make rational, objective assessments of the software pro-cess and product.

Shriram (Ram) Biyani Dr. Biyani passed away since the orig-inal submission of this paper, in a hiking accident. He obtaineda B.Sc. degree from Nagpur University, an M.S. degree from theIndian Agricultural Research Institute, a Ph.D. degree in statis-tics from Iowa State University and an M.S. degree in computerscience from North Carolina State University. He joined IBMResearch in 1992 and was involved in statistical consulting, re-search, and tools development. His primary interests in softwareengineering were in software quality and test design. Earlier, hehad provided statistical support to the quality assurance area atIBM East Fishkill and contributed to the development ofALORS2 (A Library of Reliability Specifications). He had alsocontributed to AGSS (A Graphical Statistical System), developedby the Watson Research Center. Before joining IBM, Dr. Biyanialso served as a faculty member at the University of Minnesotaand East Carolina University.

Padmanabhan Santhanam IBM Research Division, Thomas J.Watson Research Center, P.O. Box 218, Yorktown Heights, New York10598 (electronic mail: [email protected]). Dr. Santhanamholds a B.Sc. degree from the University of Madras, India, anM.Sc. degree from the Indian Institute of Technology, Madras,an M.A. degree from Hunter College, The City University of NewYork, and a Ph.D. degree in applied physics from Yale Univer-sity. He joined IBM Research in 1985 and has been with the Cen-ter for Software Engineering, which he currently manages, since1993. He has worked on deploying Orthogonal Defect Classifi-cation across IBM software laboratories and with external cus-tomers. His interests include software metrics, structure-basedtesting algorithms, automation of test generation, and realisticmodeling of processes in software development and service. Dr.Santhanam is a member of the ACM and a senior member of theIEEE.


metrics to evaluate vendor-developed software based on ...€¦ · metrics help track aspects of an...

Documents