best practices for highly effective test design; part 2 ......best practices for highly effective...
TRANSCRIPT
STAT T&E Center of Excellence 2950 Hobson Way – Wright-Patterson AFB, OH 45433
Best practices for highly effective test design; Part 2 – Beginners’ guide to
design of experiments in T&E
Luis A. Cortes, PE
The goal of the STAT T&E COE is to assist in developing rigorous,
defensible test strategies to more effectively quantify and characterize
system performance and provide information that reduces risk. This
and other COE products are available at www.AFIT.edu/STAT.
This page intentionally left blank.
STAT COE Report 14-2014
i
Abstract
Test & Evaluation (T&E) produces knowledge about the true capability of a system
by comparing the analysis of empirical observations obtained from stimulating a
system to requirements and standards. The ultimate goal is to transform the
knowledge gained from testing into “decision-quality-information” to inform the
system engineering process and key acquisition decisions. A practical and efficient
method for transforming knowledge into information is design of experiments
(DOE). DOE is the systematic integration of well-defined and structured scientific
strategies for gathering empirical knowledge about a system and transforming it
into information by using statistical methods. Coleman & Montgomery (1993)
proposed guidelines for design of experiments. This document presents the
guidelines as best practices for planning, designing, executing, and analyzing a test
within the Department of Defense’s T&E framework: (1) formulating test objectives;
(2) selecting the measures to be analyzed; (3) defining the test space; (4) choosing a
test design; (5) performing the test; (6) conducting statistical data analysis; and
(7) formulating conclusions and recommendations. The use of these practices will
result in robust and disciplined T&E strategies that can yield better interpretation
of test observations and, consequently, a better understanding of the state of the
system capabilities and the risks associated with the decisions to be made.
Key words: design of experiments, DOD test and evaluation, test design
STAT COE Report 14-2014
ii
This page intentionally left blank.
STAT COE Report 14-2014
iii
Table of Contents
4.0 Best Practices for Effective Test Design ............................................................................ 1
4.1 Formulate a clear test objective ...................................................................................... 1
4.2 Identify the evaluation measures ................................................................................... 4
4.3 Define the test space ....................................................................................................... 5
4.4 Determine a test design structure .................................................................................. 7
4.5 Randomize the test sequence .........................................................................................12
4.6 Use statistical techniques for the analysis ....................................................................13
4.7 Inform the decision ........................................................................................................14
5.0 Summary...........................................................................................................................15
Appendix A - Paradigm of well-defined tests ..........................................................................16
References ................................................................................................................................18
STAT COE Report 14-2014
iv
This page intentionally left blank.
STAT COE Report 14-2014
1
Best practices for highly effective test design;
Part 2 – Beginners’ guide to design of experiments in T&E
4.0 Best Practices for Effective Test Design
Coleman and
Montgomery (1993)
provide guidelines
for design of
experiments. Figure
6 illustrates the
guidelines in the
context of the plan-
design-execute-
analyze paradigm.
These guidelines, or
best practices for
design of experiments, are foundation for test design.
4.1 Formulate a clear test objective
“If I had an hour to solve a problem and my life depended on the solution, I would
spend the first 55 minutes determining the proper question to ask, for once I know
the proper question, I could solve the problem in less than five minutes.”
Einstein, Albert
Identifying the type of problem being studied and formulating a clear and
well understood objective are the first steps of a well-designed test. This practice,
as fundamental as it sounds, sometimes is ignored because the type of problem
being studied (and consequently the type of test required) is unclear or because it is
Figure 6. Guidelines for design of experiments
STAT COE Report 14-2014
2
just hard to write it down. In addition to a clear and well understood objective, a
clearly defined problem statement has a defined solution path and, often, a defined
expected solution.
The types of problems can be classified in several ways. One way is to classify
the problem as a state problem or as a process problem. In a state problem, the
interest is in the given state of a system at a particular time. In a process problem,
the interest is in the change of a system over time. For instance, the interest could
be in measuring the influence that several factors have on the reaction time of a
weapon system for a particular computer program baseline prior to a deployment
(state problem) or in measuring the change in two different baselines as a result of
added capabilities (process problem).
Similarly, a problem can be classified as a screening, characterization,
optimization, or validation problem. In a screening problem, the interest is in
identifying the few significant factors, from the trivial many, that affect the
response of a system. Some people use the terms screening and characterization
interchangeably. Other people refer to a characterization problem as one in which
the interest is in developing a precise response surface or mathematical model. An
optimization problem is one in which the interest is to determine the region of the
significant factor space that leads to the best possible performance. A validation
problem is one in which the interest is to establish that a system perform as
advertised. The distinctions between the types of problems leads to different
objectives and types of experimental designs.
For example, the results of a test to determine the probability of kill of a
missile, depending on how the problem objective are phrased, could be used to
design a new missile (or to optimize some of its existing functions), or to design a
new fire control radar (or to optimize some of its existing functions), or both.
STAT COE Report 14-2014
3
An effective test design requires systems engineers, testers, and analysts to
have a common and clear understanding of the scope of the test: what is to be
tested, why is going to be tested, how will the test be carried out, where and when
will the test be carried out, who will carry out the test, what data is to be collected,
how is the data to be analyzed, and what is the criteria for evaluating the data.
These types of questions are typical of every test, and their answer may involve
identifying regions for which previous test results are known, the identification of
regions where it is impossible to test, the complexity of the relationship between
factors and the response, the type of response expected, the type of possible
interactions, how to control potential sources of natural and induced variation, how
to recover if test runs cannot be carried out, how many individuals are required to
carry out the test, how much time is available for testing, and what information is
to be extracted from the test. The formulation of the problem should consider
requirements such as test areas, supporting forces, threat systems or simulators,
new instrumentation, hours of operation, environmental conditions, maintenance
demonstrations, testing profiles, and other unique test requirements. The CTIs and
COIs are part of the problem statement. Clear, testable, and measurable test
objectives can be derived from systems engineering requirements, operational
requirements, previously identified deficiencies, added capabilities, continuing
programmatic assessments, verification and validation requirements (V&V), and
tactics, techniques, and procedures (TTP).
Guidelines
Identify the type of problem.
Formulate a clear, concise, testable, and measureable objective.
Answer the “what, when, who, where, why, and how” questions pertaining to
the objective.
STAT COE Report 14-2014
4
4.2 Identify the evaluation measures
“Today's scientists have substituted mathematics for experiments, and they wander
off through equation after equation, and eventually build a structure which has no
relation to reality.” Nikola Tesla
This is an extremely important and critical step in the process. The
responses are the output variables that are observed or measured by the test.
Suitable measures to assess or evaluate the objectives include KPP, KSA, CTP,
MOE, MOS, MOP, mission related parameters, or any other appropriate system
characteristic or responses to stimuli. All of the responses need to be completely
documented to give validity to the test.
It is also important to define the metrology system prior to carrying on a
test—what to measure, how to measure it, where to measure, etc. The methods of
measuring the response, whether directly or indirectly, produce variability in the
response. This variability may have contributions from the system, materials,
parts, or from operators. In any case, identifying, separating, and removing the
variability leads to improved response values. The attributes of the measurement
system that need to be considered for measuring the response are accuracy,
precision, stability, and capability.
It is equally important to state the objectives of the test in measurable terms.
First, the problem statement should define the change for each response ( yi) that
is important to be detected. Second, the problem statement should contain an
estimate of the experimental error ( ) for each response. Then the ratio of these
two terms, the signal to noise ratio ( /y ), combined with the confidence should be
used to drive the power of the design. Power is the probability of revealing an
active effect of size y relative to the noise and measured by signal-to-noise
ratio /y . Power gives us the probability of detecting factor effects when they are
really present. As a guideline, it is desirable to have a power of at least 80% for the
STAT COE Report 14-2014
5
effect size of interest. Power can be increased by increasing the sample size or the
number of replicates.
Guidelines
Identify the responses early in the planning phase.
Identify and document concerns associated with the responses.
Identify the metrology system that will be used—accuracy, precision,
stability, and capacity.
Identify sources and magnitude of variability that could potentially affect the
responses.
Express the problem in statistical terms: power, confidence, and statistical
signal-to-noise ratio.
4.3 Define the test space
“One test is worth a thousand expert opinions.”
Wernher Von Braun
Key to defining the test space is to identify all of the potential factors that
affect the response variables and the span, or levels, of their influence. Factors are
the variables that can be manipulated throughout the test and that can have an
effect on the response. They can be grouped into either potential design factors or
nuisance factors, depending on how much interest is on them. It is extremely
important to identify all of the factors that are likely to have an effect on the
response, even as background, and to get consensus on which are of interest and to
be considered in the experiment. DOE provides a structured process for spanning
the operational envelope using the best allocation of resources.
STAT COE Report 14-2014
6
The response variables can be quantitative or qualitative variables.
Qualitative variables are categorical in nature, and are expressed by natural
language (red/white/blue, short/long, pass/fail, yes/no) rather than numbers. They
are discrete in nature, and can’t take on all values within the span of the variables.
Quantitative variables are numerical in nature (time, length, composition, strength,
etc.) and can assume any possible value within the limits of the variable ranges. It
is important to identify potential problems associated with the response during the
planning phase of the test. The richness of the quantitative variables over the
qualitative variables can be illustrated in Figure 7. Note that it is not feasible to
generate a continuous response surface when the variables are categorical.
The potential design factors can be categorized as design factors (which are
those factors to be controlled and varied for the experiment), held-constant factors
(which are those factors that may have some effect on the response but are not of
interest and are held constant), and allowed-to-vary factors (which are those factors
that may have some effect on the response but are not of interest and are allowed to
vary). Key on assessing the response is the assumption that the effects of both held-
constant factors and allowed-to-vary factors are negligible from a test design
perspective. Nuisance factors can be categorized as controllable factors,
uncontrollable factors, and noise factors. These factors can have a large effect on
Figure 7. Response surfaces for qualitative (left) and quantitative (right) variables.
STAT COE Report 14-2014
7
the response. However, their effect is not of interest to the experimenter and can be
“blocked” through statistical techniques.
Once the design factors have been identified, the next step is to identify the
range or region of interest for each factor and the levels for the test runs. This task
is not always easy, and may require consultation with subject matter experts. It is
important to have an understanding of the feasibility of testing specific treatments,
or combinations of factors levels, since some combinations may result in
undesirable, unsafe, or costly outcomes. There are specific designs and analysis
techniques that address debarred regions of interest.
Guidelines
Identify and understand all of the potential factors that can affect the responses.
Use factors that make sense and are consistent with the test requirements.
Define the test space—i.e. the region of interest and factor levels.
Understand and document the consequences of testing specific combinations of factors
levels.
4.4 Determine a test design structure
“The first principle is that you must not fool yourself and you are the easiest person
to fool.” Richard P. Feynman
This is probably the proper point to develop the test protocol, a data collection
plan, and an analysis and assessment plan. This step should be relatively easy if
the previous steps are done correctly.
There are many different types of designs to choose from, each with its own
distinguishing features. Some of the designs were developed for specific
applications. Considerations that influence the selection of a test design include the
nature of the problem to be studied, the number of potential factors and interactions
STAT COE Report 14-2014
8
that affect the response, the number of replicates required, the number of runs
available, the efficiency of the design, ease to carry out, restricted or debarred
regions of interest, blocking of influential nuisance factors, restriction in
randomization, and resources available to carry out the experiment.
It is common practice to find a large number of potential design factors for a
test. Because the number of runs or treatments increases geometrically with the
number of factors, it is common to run a screening test design to identify the few
(statistically) significant factors from the many trivial ones that affect a response.
This is called the sparsity of effects principle, which states that the response is
dominated by some main effects (the effects of each individual factor) and low-order
factor interactions (effects of combined factors), and that high-order factor
interactions are negligible. The significant factors serve as the basis for subsequent
tests while the insignificant factors are held-constant in subsequent tests. Typical
screening designs include full factorial designs (2k), fractional factorial designs (2k-
p), minimum run resolution IV (MR Res 4), minimum run resolution V (MR Res 5),
irregular fraction, general factorials, optimal, Plackett-Burman, and Taguchi
orthogonal designs. Definitions of all of these designs can be find in Montgomery
(2013). Table 1 shows some design alternatives.
Table 1. Some experimental design alternatives (Completely Randomized Designs; Model – ME + 2FI; Power (1 std. dev.) at = 0.05)
Design Runs Center
Points
Power (%)
(ME) VIF
DOF Std. Error
Model LOF PE (FDS=0.8)
MR-Res IV 12 0 27-28 1.1 - 7.0 10 1 0 1.5
2V5-1 16 0 - 1.0 15 0 0 -
D-Optimal 21 0 54–57 1.1 15 5 0 1.0
25 32 0 76 1.0 15 16 0 0.7
ME – main effects 2FI – two factor interactions LOF – lack-of-fit DOF – degrees-of-freedom VIF – variance inflation factor PE – pure error FDS – fraction of the design space
STAT COE Report 14-2014
9
Resolution refers to the degree in which the estimation of treatment effects
are contaminated with the influence of other treatment effects. For example,
resolution III indicates that the main effects are aliased with two-factor interactions
(2FI) while resolution IV indicates that main effects are aliased with three-factor
interactions and 2FIs are aliased with other 2FIs. This is an extremely important
property that needs to be kept in mind while selecting a design.
Sometimes it is more practical to carry out a sequence of smaller tests than
one large test. In those cases, each smaller test is designed for a specific purpose
and serves as a “stepping stone” for the next test. This sequential nature of test
design is an extremely important and useful feature, especially when resource
competition is an issue. It leads to economical and effective tests, and affords
opportunities for changing responses, adding or deleting factors, modifying the
levels or ranges of the factors, and augmenting the designs.
A helpful and
often used strategy
for the screening
phase involves the
use of fractional
factorials. The
successful use of
fractional factorial
designs depends on
three concepts: the
sparsity of effects principle, the projection property, and sequential
experimentation. The projection property states that fractional factorial designs
can be projected into larger designs in the subset of significant factors. For
example, consider that five factors were screened using a 2V5-1 fractional factorial
design (one-half fraction of a full factorial) and only four factors were determined to
Figure 8. Key concepts involving fractional factorials
STAT COE Report 14-2014
10
be significant. The 2V5-1 fractional factorial can be projected into a 24 full factorial
design to improve the properties of the experiment. Figure 8 illustrates the
concepts.
The curvature of the response surface is tested by augmenting the design
with center points and evaluating lack-of-fit, which explains how well the model fits
the observations. Once those steps have been completed, the next phase is to fit an
empirical model that relates each response to the significant factors and their
interactions. This model can be used to predict performance within the region of
interest or to find different combinations of factor levels that result in similar
responses. The most frequently used designs for this phase are response surface
designs, which include central composite design (CCD) (standard, circumscribed,
and inscribed), Draper-Lynn designs, Box-Behnken designs (BBD) designs, space-
filling designs, uniform designs (UD), hybrid designs, uniform shell or Doehlert
designs, Koshal designs, Hoke designs, Optimal designs, and MR Res 5 CCDs.
These designs allow for fitting the data to a second- or higher-order model.
Response surface designs can be judge by both qualitative and quantitative
criteria. The criteria, which has matured since 1951, appears on leading textbooks
such as Montgomery (2013), includes:
Provides good information throughout the design space. The design
should provide for a reasonable distribution of data points through the
battle space.
Provides a good fit between the model and the data. The differences
between the predictions made by the empirical model generated by the
design and the true value at a setting of the factor levels should be small.
Allows for investigating the model adequacy, including lack-of-fit and a
good check on the homogeneous variance assumption. Lack-of-fit accounts
for how well the model fits the observations (or the variation due to higher
order terms not included in the model). Box & Draper (2007) point out
STAT COE Report 14-2014
11
that ideally the size of the design space should be adequate, the model
should be as simple as possible, and the response function should be
smooth.
Permits blocking. This is extremely important when considering nuisance
factors involving the homogeneity of materials, availability of operators,
and process control concerns.
Allows for the sequential built-up of designs of higher order. A small
number of runs can provide information used to plan the next test.
Allows for the estimation of pure error. Pure error accounts for the
natural variation between points at the same factor levels. The design
should include sufficient number of runs to allow for the calculation of
error, especially if large experimental errors are expected.
Is robust. The design should be insensitive to outliers, missing
observations, and errors in the control and setting of the factor levels.
Is compact—require a small number of runs and a practical number of
factor levels—and easy to analyze. The design should provide simple data
patterns that allow for visual interpretation and should ensure simplicity
of calculation of the model parameters.
Requires a practical number of factor levels.
Provides a good distribution of the prediction variance over the design
space.
Is cost effective.
The last phase on the sequence is the verification phase. The objective is to
select a good design and conduct some confirmatory testing to validate the
prediction capability of the response surface model. Good designs for verification or
confirmation are Resolution III fractional factorial designs.
Clearly, there are compelling reasons for selecting a design with good
features. Unfortunately, all of the features are rarely achieved in a design.
STAT COE Report 14-2014
12
Frequently the choice of the most useful design is driven by a compromise between
cost, schedule, and performance considerations. Nonetheless, it is more important
to select an approximate test design for the exact problem than to select an exact
test design for an approximate problem.
Guidelines
Choose a test design that fits the problem, a strategy that fits the resources, and
adhere as much as possible to the principles of factorization, replication,
randomization, and blocking.
Choose a design with good properties—resolution, power, etc.
Apply a sequential test strategy if possible.
Understand completely the aliasing structure.
Allocate sufficient runs to estimate pure error, test for curvature, and lack-of-fit.
4.5 Randomize the test sequence
“In theory there is no difference between theory and practice. In practice there is.”
Yogi Berra
It is extremely important to carry out the test following the fundamental
principles of design of experiments—factorization, replication, randomization, and
local control of error—as outlined by the test protocol. Factorization, or the factorial
principle, involves making simultaneous and intentional changes to the input
variables of a process to find and exploit the relationships between them and the
response. Replication is the assignment of a treatment to more than one
experimental unit, and is a key to obtaining: (1) a valid and more precise estimation
of the error; and (2) a reduction in the error. Randomization is the random
assignment of treatments to experimental units, which averages out the effects of
the undesirable factors that are present in the experiment and generally enables
STAT COE Report 14-2014
13
the assumption that the errors are independently and identically distributed (a
critical assumption). Local control of error attempts to eliminate the effects of the
nuisance factor effects on the response, which improves the precision of the
comparison of factors of interest and reduces or eliminates the component of the
variability that is transmitted from nuisance factors. Blocking is a form of local
control of error. The run order is important, and should be as specified in the
design matrix. Procedural errors could have a catastrophic effect on the response if
they are not eliminated or at least significantly minimized. The data must be
collected in a manner that is consistent with the way the test was conducted.
Before leaving the test stage, common analysis techniques such as plotting the data
to visually identify outliers should be used to promptly identify any areas of risk.
Guidelines
Carry out the test as it was designed and documented in the test matrix.
Identify and manage unplanned or unexpected sources of variability, particularly those
involving test operations and measurements.
Document the test protocol and deviations from it.
Identify any outcomes or measurements that look suspect.
Perform a hot wash-up and quick-look analysis.
4.6 Use statistical techniques for the analysis
“Happy families are all alike; every unhappy family is unhappy in its own way.”
Leo Tolstoy
In 1998, the National Research Council concluded that T&E did not take full
advantage of the benefits afforded by state-of-the-art statistical methods. Statistics
is the mathematical science dealing with the planning, collection, analysis,
organization, explanation, and presentation of data. A common goal of statistics is
STAT COE Report 14-2014
14
to investigate and establish a relationship between events. Clearly, it seems
practical to exploit the tremendous power of combining statistical methods and
testing with the specific goal of acquiring deeper knowledge and understanding of a
system and of maximizing the usefulness and accuracy of the information. STAT
allows us to exploit the power of combining sound test practices and statistical data
analysis.
It is also important to analyze the data immediately following the test
protocol. Once the factors and their interactions are estimated using techniques
such as the analysis of variance (ANOVA), their significance is determined by
comparing the factor effects and factor interactions to the error term. Ideally, the
error term is just pure experimental error, which is provided by the variability among
the experimental units obtained from replication. Depending on the test objectives,
mathematical techniques such as gradient methods, Newton methods, or conjugate
methods can be used along with the empirical model to search for the optimum
value.
Guidelines
Analyze the data immediately after the test to identify outliers and other test protocol
problems.
Re-run tests that have suspect test results, if possible.
Verify that fundamental assumptions have not been violated.
Apply the correct statistical analysis technique as dictated by the design matrix.
Validate the empirical results and the model.
4.7 Inform the decision
This is a very logical step towards the conclusion of the test process. Once
the data has been analyzed, it is important to organize the results in order to draw
appropriate conclusions and practical recommendations about the test process,
STAT COE Report 14-2014
15
structure, designs, and more importantly, about the performance of the system.
This is the time to provide information to decision makers. Potential information
decision points involve assessing or modifying requirements, correcting training
deficiencies, improving combat capability, determining performance limiters,
documenting test results and outcomes, correcting system deficiencies (design,
implementation, suitability), improving models and simulations.
Guidelines
Draw rational conclusions about the performance of the system.
Draw rational conclusions about the overall test process.
Draw rational conclusions about the efficiency of the test design.
5.0 Summary
Test and Evaluation (T&E) are means to provide essential information to
decision makers regarding the capabilities and operational conditions of a system.
Design of experiments is the integration of well-defined and structured scientific
strategies for gathering empirical knowledge using statistical methods for planning,
designing, executing, and analyzing a test. Design of experiment adds rigor and
discipline to T&E and facilitates a comprehensive understanding of the tradeoffs in
the techno-programmatic domains: risks, cost, and utility of information. Design of
experiments can help reducing test assets, shortening the test schedule, and
providing more information to the warfighter and decision makers.
STAT COE Report 14-2014
16
Appendix A - Paradigm of well-defined tests
Clear, testable, and measurable test objectives
o Systems engineering requirements
o Operational requirements
o Previously identified deficiencies
o Added capabilities
o Continuing assessment
o Verification and validation (V&V)
o Tactics, techniques, and procedures (TTP)
Suitable measures to assess or evaluate the objectives
o Key performance parameters (KPP)
o Key system attributes (KSA)
o Critical technical parameters (CTP)
o Measures of effectiveness (MOE) and measures of suitability (MOS)
o Measures of performance (MOP)
Appropriate analysis and assessment methods and evaluation criteria
o Data analysis plan
o Evaluation criteria
Data needed
o Data collection plan (DCP)
o Data management plan (DMAP)
Adequate test venues and test resources
o Hardware in the loop (HWIL)
o Modeling and simulation (M&S)
o Live events
o Real world events
STAT COE Report 14-2014
17
Test plans
o Scenario plans and certification
o Instrumentation plans
o Test execution plans
Data analysis and assessment
o Functionality
o Performance
o Suitability
Results and outcomes
o Modify requirements
o Correct training deficiencies
o Correct system deficiencies (design, implementation, suitability)
o Improve capability
o Determine performance limiters
o Improve models
STAT COE Report 14-2014
18
References
Coleman, D.E., & Montgomery, D.C. (1993). A Systematic Approach to Planning for
a Designed Industrial Experiment (with Discussion), Technometrics, 35, 1-27.
Montgomery, D. C. (2013). Design and Analysis of Experiments. (8th ed.). New York:
Wiley & Sons