best practices for highly effective test design; part 2 ......best practices for highly effective...

STAT T&E Center of Excellence 2950 Hobson Way – Wright-Patterson AFB, OH 45433

Best practices for highly effective test design; Part 2 – Beginners’ guide to

design of experiments in T&E

Luis A. Cortes, PE

The goal of the STAT T&E COE is to assist in developing rigorous,

defensible test strategies to more effectively quantify and characterize

system performance and provide information that reduces risk. This

and other COE products are available at www.AFIT.edu/STAT.

This page intentionally left blank.

STAT COE Report 14-2014

i

Abstract

Test & Evaluation (T&E) produces knowledge about the true capability of a system

by comparing the analysis of empirical observations obtained from stimulating a

system to requirements and standards. The ultimate goal is to transform the

knowledge gained from testing into “decision-quality-information” to inform the

system engineering process and key acquisition decisions. A practical and efficient

method for transforming knowledge into information is design of experiments

(DOE). DOE is the systematic integration of well-defined and structured scientific

strategies for gathering empirical knowledge about a system and transforming it

into information by using statistical methods. Coleman & Montgomery (1993)

proposed guidelines for design of experiments. This document presents the

guidelines as best practices for planning, designing, executing, and analyzing a test

within the Department of Defense’s T&E framework: (1) formulating test objectives;

(2) selecting the measures to be analyzed; (3) defining the test space; (4) choosing a

test design; (5) performing the test; (6) conducting statistical data analysis; and

(7) formulating conclusions and recommendations. The use of these practices will

result in robust and disciplined T&E strategies that can yield better interpretation

of test observations and, consequently, a better understanding of the state of the

system capabilities and the risks associated with the decisions to be made.

Key words: design of experiments, DOD test and evaluation, test design


ii



iii

Table of Contents

4.0 Best Practices for Effective Test Design ............................................................................ 1

4.1 Formulate a clear test objective ...................................................................................... 1

4.2 Identify the evaluation measures ................................................................................... 4

4.3 Define the test space ....................................................................................................... 5

4.4 Determine a test design structure .................................................................................. 7

4.5 Randomize the test sequence .........................................................................................12

4.6 Use statistical techniques for the analysis ....................................................................13

4.7 Inform the decision ........................................................................................................14

5.0 Summary...........................................................................................................................15

Appendix A - Paradigm of well-defined tests ..........................................................................16

References ................................................................................................................................18


iv



1

Best practices for highly effective test design;

Part 2 – Beginners’ guide to design of experiments in T&E

4.0 Best Practices for Effective Test Design

Coleman and

Montgomery (1993)

provide guidelines

for design of

experiments. Figure

6 illustrates the

guidelines in the

context of the plan-

design-execute-

analyze paradigm.

These guidelines, or

best practices for

design of experiments, are foundation for test design.

4.1 Formulate a clear test objective

“If I had an hour to solve a problem and my life depended on the solution, I would

spend the first 55 minutes determining the proper question to ask, for once I know

the proper question, I could solve the problem in less than five minutes.”

Einstein, Albert

Identifying the type of problem being studied and formulating a clear and

well understood objective are the first steps of a well-designed test. This practice,

as fundamental as it sounds, sometimes is ignored because the type of problem

being studied (and consequently the type of test required) is unclear or because it is

Figure 6. Guidelines for design of experiments


2

just hard to write it down. In addition to a clear and well understood objective, a

clearly defined problem statement has a defined solution path and, often, a defined

expected solution.

The types of problems can be classified in several ways. One way is to classify

the problem as a state problem or as a process problem. In a state problem, the

interest is in the given state of a system at a particular time. In a process problem,

the interest is in the change of a system over time. For instance, the interest could

be in measuring the influence that several factors have on the reaction time of a

weapon system for a particular computer program baseline prior to a deployment

(state problem) or in measuring the change in two different baselines as a result of

added capabilities (process problem).

Similarly, a problem can be classified as a screening, characterization,

optimization, or validation problem. In a screening problem, the interest is in

identifying the few significant factors, from the trivial many, that affect the

response of a system. Some people use the terms screening and characterization

interchangeably. Other people refer to a characterization problem as one in which

the interest is in developing a precise response surface or mathematical model. An

optimization problem is one in which the interest is to determine the region of the

significant factor space that leads to the best possible performance. A validation

problem is one in which the interest is to establish that a system perform as

advertised. The distinctions between the types of problems leads to different

objectives and types of experimental designs.

For example, the results of a test to determine the probability of kill of a

missile, depending on how the problem objective are phrased, could be used to

design a new missile (or to optimize some of its existing functions), or to design a

new fire control radar (or to optimize some of its existing functions), or both.


3

An effective test design requires systems engineers, testers, and analysts to

have a common and clear understanding of the scope of the test: what is to be

tested, why is going to be tested, how will the test be carried out, where and when

will the test be carried out, who will carry out the test, what data is to be collected,

how is the data to be analyzed, and what is the criteria for evaluating the data.

These types of questions are typical of every test, and their answer may involve

identifying regions for which previous test results are known, the identification of

regions where it is impossible to test, the complexity of the relationship between

factors and the response, the type of response expected, the type of possible

interactions, how to control potential sources of natural and induced variation, how

to recover if test runs cannot be carried out, how many individuals are required to

carry out the test, how much time is available for testing, and what information is

to be extracted from the test. The formulation of the problem should consider

requirements such as test areas, supporting forces, threat systems or simulators,

new instrumentation, hours of operation, environmental conditions, maintenance

demonstrations, testing profiles, and other unique test requirements. The CTIs and

COIs are part of the problem statement. Clear, testable, and measurable test

objectives can be derived from systems engineering requirements, operational

requirements, previously identified deficiencies, added capabilities, continuing

programmatic assessments, verification and validation requirements (V&V), and

tactics, techniques, and procedures (TTP).

Guidelines

Identify the type of problem.

Formulate a clear, concise, testable, and measureable objective.

Answer the “what, when, who, where, why, and how” questions pertaining to

the objective.


4

4.2 Identify the evaluation measures

“Today's scientists have substituted mathematics for experiments, and they wander

off through equation after equation, and eventually build a structure which has no

relation to reality.” Nikola Tesla

This is an extremely important and critical step in the process. The

responses are the output variables that are observed or measured by the test.

Suitable measures to assess or evaluate the objectives include KPP, KSA, CTP,

MOE, MOS, MOP, mission related parameters, or any other appropriate system

characteristic or responses to stimuli. All of the responses need to be completely

documented to give validity to the test.

It is also important to define the metrology system prior to carrying on a

test—what to measure, how to measure it, where to measure, etc. The methods of

measuring the response, whether directly or indirectly, produce variability in the

response. This variability may have contributions from the system, materials,

parts, or from operators. In any case, identifying, separating, and removing the

variability leads to improved response values. The attributes of the measurement

system that need to be considered for measuring the response are accuracy,

precision, stability, and capability.

It is equally important to state the objectives of the test in measurable terms.

First, the problem statement should define the change for each response ( yi) that

is important to be detected. Second, the problem statement should contain an

estimate of the experimental error ( ) for each response. Then the ratio of these

two terms, the signal to noise ratio ( /y ), combined with the confidence should be

used to drive the power of the design. Power is the probability of revealing an

active effect of size y relative to the noise and measured by signal-to-noise

ratio /y . Power gives us the probability of detecting factor effects when they are

really present. As a guideline, it is desirable to have a power of at least 80% for the


5

effect size of interest. Power can be increased by increasing the sample size or the

number of replicates.

Guidelines

Identify the responses early in the planning phase.

Identify and document concerns associated with the responses.

Identify the metrology system that will be used—accuracy, precision,

stability, and capacity.

Identify sources and magnitude of variability that could potentially affect the

responses.

Express the problem in statistical terms: power, confidence, and statistical

signal-to-noise ratio.

4.3 Define the test space

“One test is worth a thousand expert opinions.”

Wernher Von Braun

Key to defining the test space is to identify all of the potential factors that

affect the response variables and the span, or levels, of their influence. Factors are

the variables that can be manipulated throughout the test and that can have an

effect on the response. They can be grouped into either potential design factors or

nuisance factors, depending on how much interest is on them. It is extremely

important to identify all of the factors that are likely to have an effect on the

response, even as background, and to get consensus on which are of interest and to

be considered in the experiment. DOE provides a structured process for spanning

the operational envelope using the best allocation of resources.


6

The response variables can be quantitative or qualitative variables.

Qualitative variables are categorical in nature, and are expressed by natural

language (red/white/blue, short/long, pass/fail, yes/no) rather than numbers. They

are discrete in nature, and can’t take on all values within the span of the variables.

Quantitative variables are numerical in nature (time, length, composition, strength,

etc.) and can assume any possible value within the limits of the variable ranges. It

is important to identify potential problems associated with the response during the

planning phase of the test. The richness of the quantitative variables over the

qualitative variables can be illustrated in Figure 7. Note that it is not feasible to

generate a continuous response surface when the variables are categorical.

The potential design factors can be categorized as design factors (which are

those factors to be controlled and varied for the experiment), held-constant factors

(which are those factors that may have some effect on the response but are not of

interest and are held constant), and allowed-to-vary factors (which are those factors

that may have some effect on the response but are not of interest and are allowed to

vary). Key on assessing the response is the assumption that the effects of both held-

constant factors and allowed-to-vary factors are negligible from a test design

perspective. Nuisance factors can be categorized as controllable factors,

uncontrollable factors, and noise factors. These factors can have a large effect on

Figure 7. Response surfaces for qualitative (left) and quantitative (right) variables.


7

the response. However, their effect is not of interest to the experimenter and can be

“blocked” through statistical techniques.

Once the design factors have been identified, the next step is to identify the

range or region of interest for each factor and the levels for the test runs. This task

is not always easy, and may require consultation with subject matter experts. It is

important to have an understanding of the feasibility of testing specific treatments,

or combinations of factors levels, since some combinations may result in

undesirable, unsafe, or costly outcomes. There are specific designs and analysis

techniques that address debarred regions of interest.

Guidelines

Identify and understand all of the potential factors that can affect the responses.

Use factors that make sense and are consistent with the test requirements.

Define the test space—i.e. the region of interest and factor levels.

Understand and document the consequences of testing specific combinations of factors

levels.

4.4 Determine a test design structure

“The first principle is that you must not fool yourself and you are the easiest person

to fool.” Richard P. Feynman

This is probably the proper point to develop the test protocol, a data collection

plan, and an analysis and assessment plan. This step should be relatively easy if

the previous steps are done correctly.

There are many different types of designs to choose from, each with its own

distinguishing features. Some of the designs were developed for specific

applications. Considerations that influence the selection of a test design include the

nature of the problem to be studied, the number of potential factors and interactions


8

that affect the response, the number of replicates required, the number of runs

available, the efficiency of the design, ease to carry out, restricted or debarred

regions of interest, blocking of influential nuisance factors, restriction in

randomization, and resources available to carry out the experiment.

It is common practice to find a large number of potential design factors for a

test. Because the number of runs or treatments increases geometrically with the

number of factors, it is common to run a screening test design to identify the few

(statistically) significant factors from the many trivial ones that affect a response.

This is called the sparsity of effects principle, which states that the response is

dominated by some main effects (the effects of each individual factor) and low-order

factor interactions (effects of combined factors), and that high-order factor

interactions are negligible. The significant factors serve as the basis for subsequent

tests while the insignificant factors are held-constant in subsequent tests. Typical

screening designs include full factorial designs (2k), fractional factorial designs (2k-

p), minimum run resolution IV (MR Res 4), minimum run resolution V (MR Res 5),

irregular fraction, general factorials, optimal, Plackett-Burman, and Taguchi

orthogonal designs. Definitions of all of these designs can be find in Montgomery

(2013). Table 1 shows some design alternatives.

Table 1. Some experimental design alternatives (Completely Randomized Designs; Model – ME + 2FI; Power (1 std. dev.) at = 0.05)

Design Runs Center

Points

Power (%)

(ME) VIF

DOF Std. Error

Model LOF PE (FDS=0.8)

MR-Res IV 12 0 27-28 1.1 - 7.0 10 1 0 1.5

2V5-1 16 0 - 1.0 15 0 0 -

D-Optimal 21 0 54–57 1.1 15 5 0 1.0

25 32 0 76 1.0 15 16 0 0.7

ME – main effects 2FI – two factor interactions LOF – lack-of-fit DOF – degrees-of-freedom VIF – variance inflation factor PE – pure error FDS – fraction of the design space


9

Resolution refers to the degree in which the estimation of treatment effects

are contaminated with the influence of other treatment effects. For example,

resolution III indicates that the main effects are aliased with two-factor interactions

(2FI) while resolution IV indicates that main effects are aliased with three-factor

interactions and 2FIs are aliased with other 2FIs. This is an extremely important

property that needs to be kept in mind while selecting a design.

Sometimes it is more practical to carry out a sequence of smaller tests than

one large test. In those cases, each smaller test is designed for a specific purpose

and serves as a “stepping stone” for the next test. This sequential nature of test

design is an extremely important and useful feature, especially when resource

competition is an issue. It leads to economical and effective tests, and affords

opportunities for changing responses, adding or deleting factors, modifying the

levels or ranges of the factors, and augmenting the designs.

A helpful and

often used strategy

for the screening

phase involves the

use of fractional

factorials. The

successful use of

fractional factorial

designs depends on

three concepts: the

sparsity of effects principle, the projection property, and sequential

experimentation. The projection property states that fractional factorial designs

can be projected into larger designs in the subset of significant factors. For

example, consider that five factors were screened using a 2V5-1 fractional factorial

design (one-half fraction of a full factorial) and only four factors were determined to

Figure 8. Key concepts involving fractional factorials


10

be significant. The 2V5-1 fractional factorial can be projected into a 24 full factorial

design to improve the properties of the experiment. Figure 8 illustrates the

concepts.

The curvature of the response surface is tested by augmenting the design

with center points and evaluating lack-of-fit, which explains how well the model fits

the observations. Once those steps have been completed, the next phase is to fit an

empirical model that relates each response to the significant factors and their

interactions. This model can be used to predict performance within the region of

interest or to find different combinations of factor levels that result in similar

responses. The most frequently used designs for this phase are response surface

designs, which include central composite design (CCD) (standard, circumscribed,

and inscribed), Draper-Lynn designs, Box-Behnken designs (BBD) designs, space-

filling designs, uniform designs (UD), hybrid designs, uniform shell or Doehlert

designs, Koshal designs, Hoke designs, Optimal designs, and MR Res 5 CCDs.

These designs allow for fitting the data to a second- or higher-order model.

Response surface designs can be judge by both qualitative and quantitative

criteria. The criteria, which has matured since 1951, appears on leading textbooks

such as Montgomery (2013), includes:

Provides good information throughout the design space. The design

should provide for a reasonable distribution of data points through the

battle space.

Provides a good fit between the model and the data. The differences

between the predictions made by the empirical model generated by the

design and the true value at a setting of the factor levels should be small.

Allows for investigating the model adequacy, including lack-of-fit and a

good check on the homogeneous variance assumption. Lack-of-fit accounts

for how well the model fits the observations (or the variation due to higher

order terms not included in the model). Box & Draper (2007) point out


11

that ideally the size of the design space should be adequate, the model

should be as simple as possible, and the response function should be

smooth.

Permits blocking. This is extremely important when considering nuisance

factors involving the homogeneity of materials, availability of operators,

and process control concerns.

Allows for the sequential built-up of designs of higher order. A small

number of runs can provide information used to plan the next test.

Allows for the estimation of pure error. Pure error accounts for the

natural variation between points at the same factor levels. The design

should include sufficient number of runs to allow for the calculation of

error, especially if large experimental errors are expected.

Is robust. The design should be insensitive to outliers, missing

observations, and errors in the control and setting of the factor levels.

Is compact—require a small number of runs and a practical number of

factor levels—and easy to analyze. The design should provide simple data

patterns that allow for visual interpretation and should ensure simplicity

of calculation of the model parameters.

Requires a practical number of factor levels.

Provides a good distribution of the prediction variance over the design

space.

Is cost effective.

The last phase on the sequence is the verification phase. The objective is to

select a good design and conduct some confirmatory testing to validate the

prediction capability of the response surface model. Good designs for verification or

confirmation are Resolution III fractional factorial designs.

Clearly, there are compelling reasons for selecting a design with good

features. Unfortunately, all of the features are rarely achieved in a design.


12

Frequently the choice of the most useful design is driven by a compromise between

cost, schedule, and performance considerations. Nonetheless, it is more important

to select an approximate test design for the exact problem than to select an exact

test design for an approximate problem.

Guidelines

Choose a test design that fits the problem, a strategy that fits the resources, and

adhere as much as possible to the principles of factorization, replication,

randomization, and blocking.

Choose a design with good properties—resolution, power, etc.

Apply a sequential test strategy if possible.

Understand completely the aliasing structure.

Allocate sufficient runs to estimate pure error, test for curvature, and lack-of-fit.

4.5 Randomize the test sequence

“In theory there is no difference between theory and practice. In practice there is.”

Yogi Berra

It is extremely important to carry out the test following the fundamental

principles of design of experiments—factorization, replication, randomization, and

local control of error—as outlined by the test protocol. Factorization, or the factorial

principle, involves making simultaneous and intentional changes to the input

variables of a process to find and exploit the relationships between them and the

response. Replication is the assignment of a treatment to more than one

experimental unit, and is a key to obtaining: (1) a valid and more precise estimation

of the error; and (2) a reduction in the error. Randomization is the random

assignment of treatments to experimental units, which averages out the effects of

the undesirable factors that are present in the experiment and generally enables


13

the assumption that the errors are independently and identically distributed (a

critical assumption). Local control of error attempts to eliminate the effects of the

nuisance factor effects on the response, which improves the precision of the

comparison of factors of interest and reduces or eliminates the component of the

variability that is transmitted from nuisance factors. Blocking is a form of local

control of error. The run order is important, and should be as specified in the

design matrix. Procedural errors could have a catastrophic effect on the response if

they are not eliminated or at least significantly minimized. The data must be

collected in a manner that is consistent with the way the test was conducted.

Before leaving the test stage, common analysis techniques such as plotting the data

to visually identify outliers should be used to promptly identify any areas of risk.

Guidelines

Carry out the test as it was designed and documented in the test matrix.

Identify and manage unplanned or unexpected sources of variability, particularly those

involving test operations and measurements.

Document the test protocol and deviations from it.

Identify any outcomes or measurements that look suspect.

Perform a hot wash-up and quick-look analysis.

4.6 Use statistical techniques for the analysis

“Happy families are all alike; every unhappy family is unhappy in its own way.”

Leo Tolstoy

In 1998, the National Research Council concluded that T&E did not take full

advantage of the benefits afforded by state-of-the-art statistical methods. Statistics

is the mathematical science dealing with the planning, collection, analysis,

organization, explanation, and presentation of data. A common goal of statistics is


14

to investigate and establish a relationship between events. Clearly, it seems

practical to exploit the tremendous power of combining statistical methods and

testing with the specific goal of acquiring deeper knowledge and understanding of a

system and of maximizing the usefulness and accuracy of the information. STAT

allows us to exploit the power of combining sound test practices and statistical data

analysis.

It is also important to analyze the data immediately following the test

protocol. Once the factors and their interactions are estimated using techniques

such as the analysis of variance (ANOVA), their significance is determined by

comparing the factor effects and factor interactions to the error term. Ideally, the

error term is just pure experimental error, which is provided by the variability among

the experimental units obtained from replication. Depending on the test objectives,

mathematical techniques such as gradient methods, Newton methods, or conjugate

methods can be used along with the empirical model to search for the optimum

value.

Guidelines

Analyze the data immediately after the test to identify outliers and other test protocol

problems.

Re-run tests that have suspect test results, if possible.

Verify that fundamental assumptions have not been violated.

Apply the correct statistical analysis technique as dictated by the design matrix.

Validate the empirical results and the model.

4.7 Inform the decision

This is a very logical step towards the conclusion of the test process. Once

the data has been analyzed, it is important to organize the results in order to draw

appropriate conclusions and practical recommendations about the test process,


15

structure, designs, and more importantly, about the performance of the system.

This is the time to provide information to decision makers. Potential information

decision points involve assessing or modifying requirements, correcting training

deficiencies, improving combat capability, determining performance limiters,

documenting test results and outcomes, correcting system deficiencies (design,

implementation, suitability), improving models and simulations.

Guidelines

Draw rational conclusions about the performance of the system.

Draw rational conclusions about the overall test process.

Draw rational conclusions about the efficiency of the test design.

5.0 Summary

Test and Evaluation (T&E) are means to provide essential information to

decision makers regarding the capabilities and operational conditions of a system.

Design of experiments is the integration of well-defined and structured scientific

strategies for gathering empirical knowledge using statistical methods for planning,

designing, executing, and analyzing a test. Design of experiment adds rigor and

discipline to T&E and facilitates a comprehensive understanding of the tradeoffs in

the techno-programmatic domains: risks, cost, and utility of information. Design of

experiments can help reducing test assets, shortening the test schedule, and

providing more information to the warfighter and decision makers.


16

Appendix A - Paradigm of well-defined tests

Clear, testable, and measurable test objectives

o Systems engineering requirements

o Operational requirements

o Previously identified deficiencies

o Added capabilities

o Continuing assessment

o Verification and validation (V&V)

o Tactics, techniques, and procedures (TTP)

Suitable measures to assess or evaluate the objectives

o Key performance parameters (KPP)

o Key system attributes (KSA)

o Critical technical parameters (CTP)

o Measures of effectiveness (MOE) and measures of suitability (MOS)

o Measures of performance (MOP)

Appropriate analysis and assessment methods and evaluation criteria

o Data analysis plan

o Evaluation criteria

Data needed

o Data collection plan (DCP)

o Data management plan (DMAP)

Adequate test venues and test resources

o Hardware in the loop (HWIL)

o Modeling and simulation (M&S)

o Live events

o Real world events


17

Test plans

o Scenario plans and certification

o Instrumentation plans

o Test execution plans

Data analysis and assessment

o Functionality

o Performance

o Suitability

Results and outcomes

o Modify requirements

o Correct training deficiencies

o Correct system deficiencies (design, implementation, suitability)

o Improve capability

o Determine performance limiters

o Improve models


18

References

Coleman, D.E., & Montgomery, D.C. (1993). A Systematic Approach to Planning for

a Designed Industrial Experiment (with Discussion), Technometrics, 35, 1-27.

Montgomery, D. C. (2013). Design and Analysis of Experiments. (8th ed.). New York:

Wiley & Sons

best practices for highly effective test design; part 2 ......best practices for highly effective...

Documents