statistical thinking and smart experimental design 2016

7/26/2019 Statistical Thinking and Smart Experimental Design 2016

http://slidepdf.com/reader/full/statistical-thinking-and-smart-experimental-design-2016 1/82

VLAAMS I NSTITUUT

VOOR B IOTECHNOLOGIE

Statistical Thinking andSmart ExperimentalDesign

Course Notes

Luc Wouters

April 2016



Contents

1 Introduction 1

2 Smart Research Design by Statistical

Thinking 5

2.1 The architecture of experimental re-

search . . . . . . . . . . . . . . . . . . 52.1.1 The controlled experiment . . 5

2.1.2 Scientific research as a

phased process . . . . . . . . 5

2.1.3 Scientific research as an iter-

ative, dynamic process . . . . 6

2.2 Research styles - The smart researcher 6

2.3 Principles of statistical thinking . . . 7

3 Planning the Experiment 9

3.1 The planning process . . . . . . . . . 93.2 Types of experiments . . . . . . . . . 10

3.3 The pilot study . . . . . . . . . . . . 11

4 Principles of Statistical Design 13

4.1 Some terminology . . . . . . . . . . . 13

4.2 The structure of the response variable 13

4.3 Defining the experimental unit . . . 13

4.4 Variation is omnipresent . . . . . . . 16

4.5 Balancing internal and external va-

lidity . . . . . . . . . . . . . . . . . . 16

4.6 Bias and variability . . . . . . . . . . 17

4.7 Requirements for a good experiment 18

4.8 Strategies for minimizing bias and

maximizing signal-to-noise ratio . . 18

4.8.1 Strategies for minimizing

bias - good experimental

practice . . . . . . . . . . . . . 18

4.8.1.1 The use of controls . 18

4.8.1.2 Blinding . . . . . . . 19

4.8.1.3 The presence of atechnical protocol . 20

4.8.1.4 Calibration . . . . . 20

4.8.1.5 Randomization . . . 20

4.8.1.6 Random sampling . 22

4.8.1.7 Standardization . . 22

4.8.2 Strategies for controlling

variability - good experi-

mental design . . . . . . . . . 224.8.2.1 Replication . . . . . 22

4.8.2.2 Subsampling . . . . 23

4.8.2.3 Blocking . . . . . . . 24

4.8.2.4 Covariates . . . . . 24

4.9 Simplicity of design . . . . . . . . . . 25

4.10 The calculation of uncertainty . . . . 25

5 Common Designs in Biological Experi-

mentation 27

5.1 Error-control designs . . . . . . . . . 285.1.1 The completely randomized

design . . . . . . . . . . . . . 28

5.1.2 The randomized complete

block design . . . . . . . . . . 29

5.1.2.1 The paired design . 30

5.1.2.2 Efficiency of the

randomized com-

plete block design . 31

5.1.3 Incomplete block designs . . 32

5.1.4 The Latin square design . . . 335.2 Treatment designs . . . . . . . . . . . 34

5.2.1 One-way layout . . . . . . . . 34

5.2.2 Factorial designs . . . . . . . 34

5.3 More complex designs . . . . . . . . 37

5.3.1 The split-plot and strip-plot

designs . . . . . . . . . . . . . 37

5.3.2 The repeated measures design 38

6 The Required Number of Replicates -

Sample Size 416.1 Determining sample size is a risk -

cost assessment . . . . . . . . . . . . 41

i



ii CONTENTS

6.2 The context of biomedical experiments 41

6.3 The hypothesis testing context - the

population model . . . . . . . . . . . 41

6.4 Sample size calculations . . . . . . . 42

6.4.1 Power analysis computations 42

6.4.2 Mead’s resource require-

ment equation . . . . . . . . . 43

6.5 How many subsamples . . . . . . . . 44

6.6 Multiplicity and sample size . . . . . 46

6.7 The problem with underpowered

studies . . . . . . . . . . . . . . . . . 47

6.8 Sequential plans . . . . . . . . . . . . 48

7 The Statistical Analysis 51

7.1 The statistical triangle . . . . . . . . 51

7.2 The statistical model revisited . . . . 51

7.3 Significance tests . . . . . . . . . . . 52

7.4 Verifying the statistical assumptions 53

7.5 The meaning of the p-value and sta-

tistical significance . . . . . . . . . . 53

7.6 Multiplicity . . . . . . . . . . . . . . 55

8 The Study Protocol 57

9 Interpretation and Reporting 59

9.1 The Methods section . . . . . . . . . 59

9.1.1 Experimental design . . . . . 59

9.1.2 Statistical methods . . . . . . 59

9.2 The Results section . . . . . . . . . . 60

9.2.1 Summarizing the data . . . . 60

9.2.2 Graphical displays . . . . . . 60

9.2.3 Interpreting and reporting

significance tests . . . . . . . 61

10 Concluding Remarks and Summary 63

10.1 Role of the statistician . . . . . . . . 63

10.2 Recommended reading . . . . . . . . 63

10.3 Summary . . . . . . . . . . . . . . . . 64

References 65

Appendices

Appendix A Glossary of Statistical Terms 73

Appendix B Tools for randomization in MS

Excel and R 77

B.1 Completely randomized design . . . 77

B.1.1 MS Excel . . . . . . . . . . . . 77

B.1.2 R-Language . . . . . . . . . . 77

B.2 Randomized complete block design 78

B.2.1 MS Excel . . . . . . . . . . . . 78

B.2.2 R-Language . . . . . . . . . . 78



1. Introduction

More often than not, we are unable to reproduce findings published by researchers in journals.

Glenn Begley, Vice President Research Amgen (2015)

The way we do our research [with our animals] is stone-age.

Ulrich Dirnagl, Charité University Medicine Berlin (2013)

Over the past decade, the biosciences have been

plagued by problems with the replicability and re-

producibility2 of research findings. This lack of re-

liability can be attributed in large part to statistical

fallacies, misconceptions, and other methodolog-

ical issues (Begley and Ioannidis, 2015; Loscalzo,

2012; Peng, 2015; Prinz et al., 2011; Reinhart, 2015;

van der Worp et al., 2010). The following exam-

ples illustrate some of these problems and showthat there is a definite need to transform and im-

prove the research process.

Example 1.1. In 2006, a group of researchers from

Duke University led by Anil Potti published a pa-

per claiming that they had built an algorithm us-

ing genomic microarray data that allowed to predict

which cancer patients would respond to chemother-

apy (Potti et al., 2006). This would spare patients

the side effects of ineffective treatments. Of coursethis paper drew a lot of attention and many inde-

pendent investigators tried to reproduce the results.

Keith Baggerly and Kevin Coombes, two statisti-

cians at MD Anderson Cancer Center, were also

asked to have a look at the the data. What they

found was a mess of poorly conducted data analysis

(Baggerly and Coombes, 2009). Some of the data

was mislabeled, some samples were duplicated in

the data, sometimes samples were marked as both

sensitive and resistant, etc. Baggerly and Coombes

concluded that they were unable to reproduce the

analysis carried out by Potti et al. (2006), but the

damage was done. Several clinical trials had started

based on the erroneous results. In 2011, after sev-

eral corrections, the original study by Potti et al.

was retracted from Nature Medicine ”because we

have been unable to reproduce certain crucial exper-

iments” (Potti et al., 2011).

Example 1.2. In 2009, a group of researchers from

the Harvard Medical School published a study

showing that cancer tumors could be destroyed by

targeting the STK33 protein (Scholl et al., 2009).

Scientists at Amgen Inc. pounced on the idea and

assigned a group of 24 researchers to try to repeat

the experiment with the objective of developing a

new medicine. After six months intensive lab work,

it turned out that the project was a waste of timeand money, since it was impossible for the Amgen

scientists to replicate the results (Babij et al., 2011;

Naik, 2011). Unfortunately, this was not the only

problem of replicability the Amgen researchers en-

countered. During a decade Begley and Ellis (2012)

identified a set of 53“landmark”publications in pre-

clinical cancer research, i.e. papers in top journals

from reputable labs. A team of 100 scientists tried

to replicate the results. To their surprise, in 47 of

2

Formally, we consider replicability as the replication of scientific findings using independent investigators, methods,data, equipment, and protocols. Replicability has long been, and will continue to be, the standard by which scientific claimsare evaluated. On the other hand, reproducibility means that starting from the data gathered by the scientist, we canreproduce the same results, p-values, confidence intervals, tables and figures as those reported by the scientist (Peng, 2009).

1



2 CHAPTER 1. INTRODUCTION

the 53 studies (i.e. 89%) the findings could not be

replicated. This outcome was particularly disturb-

ing since Begley and Ellis did every effort to work in

close collaboration with the authors of the original

papers and even tried to replicate the experiments

in the laboratory of the original investigator. In

some cases, 50 attempts were made to reproduce

the original data, without obtaining the claimed re-

sult (Begley, 2012). What is even more troubling is

that Amgen’s findings were consistent with those of

others. In a similar setting, Bayer researchers found

that only 25% of the original findings in target dis-

covery could be validated (Prinz et al., 2011).

Example 1.3. Seralini et al. (2012) published a 2-

year feeding study in rats investigating the healtheffects of genetically modified (GM) maize NK603

with and without glyphosate-containing herbicides.

The authors of the study concluded that GM maize

NK603 and low levels of glyphosate herbicide for-

mulations, at concentrations well below officially-

set safe limits, induce severe adverse health effects,

such as tumors, in rats. Apart from the publica-

tion, Seralini also presented his findings in a press

conference, which was widely covered in the media

showing shocking photos of rats with enormous tu-mors. Consequently, this study had severe impact

on the general public and also on the interest of

industry. The paper was used in the debate over

a referendum over labeling of GM food in Califor-

nia, and it led to bans on importation of certain

GMOs in Russia and Kenya. However, short after

its publication many scientists, among them also

researchers from the VIB (Vlaams Instituut voor

Biotechnologie, 2012), heavily criticized the study

and expressed their concerns about the validity of

the findings. A polemic debate started with oppo-

nents of GMOs and also within the scientific com-

munity, which inspired media to refer to the contro-

versy as The Seralini affair or Seralini tumor-gate .

Subsequently, the European Food Safety Authority

(2012) thoroughly scrutinized the study and found

that it was of inadequate design, analysis and re-

porting. Specifically, the number of animals was

considered too small and not sufficient for reaching

a solid conclusion. Eventually, the journal retracted

Seralini’s paper, claiming that it did not reach the

journal’s threshold of publication (Hayes, 2014) 1.

Example 1.4. Selwyn (1996) describes a study

where an investigator examined the effect of a test

compound on hepatocyte diameters. The experi-menter decided to study eight rats per treatment

group, three different lobes of each rat’s liver, five

fields per lobe, and approximately 1,000 to 2,000

cells per field. At that time, most of the work,

i.e. measuring the cell diameters, was done man-

ually, making the total amount of work, i.e. 15,000

- 30,000 measurements per rat, substantial. The

experimenter complained about the overwhelming

amount of work in this study and the tight dead-

lines that were set up. A sample size evaluationconducted after the study was completed indicated

that sampling as few as 100 cells per lobe would

have been without appreciable loss of information.

Unreliable biological reagents and reference materials 36.1%

Improper studydesign 27.6%

Inadequatedata analysisand reporting 25.5%

Laboratory protocol

errors 10.8%

Figure 1.1. Categories of errors that contribute to the problemof replicability in life science research (source Freedman et al.2015)

Doing good science and producing high-

quality data should be the concern of every serious

research scientist. Unfortunately, as shown by the

first three examples, this is not always the case. As

mentioned above, there is a genuine concern about

the reproducibility of research findings and it has

been argued that most research findings are false

(Ioannidis, 2005). In a recent paper Begley and

Ioannidis (2015) estimated that 85% of biomedical

research is wasted at large. Freedman et al. (2015)

tried to identify the root causes of the replicability

problem and to estimate its economic impact. They

estimated that in the United States alone approxi-

mately US$28B/year is spent on research that can-

not be replicated. The main problems causing this1 Seralini managed to republish the study in Environmental Sciences Europe (Seralini et al., 2014), a journal with a

considerably lower impact factor.



3

lack of replicability are summarized in Figure 1.1.

Issues in study design and data analysis accounted

for more than 50% of the studies that could not be

replicated. This was also concluded by Kilkenny

et al. (2009) who surveyed 271 papers reporting

laboratory animal experiments. They found that

most of the studies had flaws in their design and

almost one third of the papers that used statistical

methods did not describe them or did not present

their results adequately.

Not only scientists, but also the journals have a

great responsibility in guarding the quality of their

publications. Peer reviewers and editors, who

have little or no statistical training, let methodolog-

ical errors pass undetected. Moreover, high-impact

journals tend to focus on statistically significant re-

sults of unexpected findings, often without looking

at the practical importance. Especially, in studies

with insufficient sample size, this publication bias

causes high numbers of irreproducible and even

false results (Ioannidis, 2005; Reinhart, 2015).

In addition to the problem of replicability of re-

search findings, there has also been a dramatic risein the number of journal retractions over the last

decades (Cokol et al., 2008). In a review of all

2,047 biomedical and life-science research articles

indexed by PubMed as retracted on May 3, 2012,

Fang et al. (2012) found that 21.3% of the retrac-

tions were due to error, while 67.4% of the retrac-

tions were attributable to misconduct, including

fraud or suspected fraud (43.4%), duplicate pub-

lication (14.2%), and plagiarism (9.8%).

Studies such as those by Potti et al. (2006), Scholl

et al. (2009), and Séralini et al. (2012), as well

as the lack of replicability in general and the in-

creased number of retractions have also caught the

attention of mainstream media (Begley, 2012; Hotz,

2007; Lehrer, 2010; Naik, 2011; Zimmer, 2012) and

have put the integrity of science into question by

the general public.

To summarize, a substantial part of the issues of replicability can be attributed to a lack of quality in

the design and execution of the studies. When little

or no thought is given to methodological issues, in

particular to the statistical aspects of the study de-

sign, the studies are often seriously flawed and are

not capable of meeting their intended purpose. In

some cases, such as the Séralini study study, the

experiments were designed too small to enable an

answer to the research question. Conversely, like

in Example 1.4, there are also studies that use too

much experimental material, so that valuable re-

sources are wasted.

To improve on these issues of credibility and effi-

ciency, we need effective interventions and change

the way scientists look at the research process

(Ioannidis, 2014; Reinhart, 2015). This can be ac-complished by introducing statistical thinking and

statistical reasoning as powerful informed skills,

based on the fundamentals of statistics, that en-

hance the quality of the research data (Vanden-

broeck et al., 2006). While the science of statis-

tics is mostly involved with the complexities and

techniques of statistical analysis, statistical think-

ing and reasoning are generalist skills that focus on

the application of nontechnical concepts and prin-

ciples. There are no clear, generally accepted defi-nitions of statistical thinking and reasoning. In our

conceptualization we consider statistical thinking

as a skill that helps to better understand how sta-

tistical methods can contribute to finding answers

to specific research problems and what the impli-

cations are in terms of data collection, experimen-

tal setup, data analysis, and reporting. Statistical

thinking will provide us with a generic methodol-

ogy to design insightful experiments. On the other

hand, we will consider statistical reasoning as be-ing more involved with the presentation and inter-

pretation of the statistical analysis. Of course, as

is apparent from the above, there is a large over-

lap between the concepts of statistical thinking and

reasoning.

Statistical thinking permeates the entire research

process and, when adequately implemented, can

lead to a highly successful and productive research

enterprise. This was demonstrated by the eminentscientist, the late Dr. Paul Janssen. As pointed out

by Lewi (2005), the success of Dr. Paul Janssen



4 CHAPTER 1. INTRODUCTION

could be attibruted to a large extent on having a

set of statistical precepts being accepted by his col-

laborators. These formed the statistical founda-

tion upon which his research was built and insured

that research proceeded in an orderly and planned

fashion, while at the same time having an open

mind for unexpected opportunities. His approach

was such a success that, when he retired in 1991,

his laboratory had produced 77 original medicines

over a period of less than 40 years. This still rep-

resents a world record. In addition, at its peak, the

Janssen laboratory produced more than 200 scien-

tific publications per year (Lewi and Smith, 2007).



2. Smart Research Design by Statistical

Thinking

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and

write !

Samuel S. Wilks, (1951).

2.1 The architecture of experi-

mental research

2.1.1 The controlled experiment

There are two basic approaches to implement a sci-

entific research project. One approach is to con-

duct an observational study2 in which we investi-

gate the effect of naturally occurring variation and

the assignment of treatments is outside the control

of the investigator. Although there are often good

and valid reasons for conducting an observational

study, their main drawback is that the presence of

concomitant confounding variables can never be

excluded, thus weakening the conclusions.

An alternative to an observational study is an

experimental or manipulative study in which theinvestigator manipulates the experimental system

and measures the effect of his manipulations on

the experimental material. Since the manipulation

of the experimental system is under control of the

experimenter, one also speaks of controlled experi-

ments. A well-designed experimental study elim-

inates the bias caused by confounding variables.

The great power of a controlled experiment, pro-

vided it is well conceived, lies in the fact that it al-

lows us to demonstrate causal relationships. Wewill focus on controlled experiments and how sta-

tistical thinking and reasoning can be of use to op-

timize their design and interpretation.

2.1.2 Scientific research as a phased pro-

cess

Phase ⇒ Deliverable

Definition ⇒

Research ProposalDesign ⇒ Protocol

Data Collection ⇒ Data set

Analysis ⇒ Conclusions

Reporting ⇒ Report

Figure 2.1. Research is a phased process with each of the phaseshaving a specific deliverable

From a systems analysis point of view, the sci-

entific research process can be divided into five

distinct stages:

1. definition of the research question

2. design of the experiment

3. conduct of the experiment and data collec-

tion

4. data analysis

5. reporting

Each of these phases results in a specific deliver-

able (Figure 2.1). The definition of the researchquestion will usually result in a research or grant

2also called correlational study

5



6 CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING

proposal, stating the hypothesis related to the re-

search (research hypothesis) and the implications

or predictions that follow from it. The design of

the experiment needed for testing the research hy-

pothesis is formalized in a written protocol. After

the experiment has been carried out, the data will

be collected providing the experimental data set.

Statistical analysis of this data set will yield conclu-

sions that answer the research question by accept-

ing or rejecting the formalized hypothesis. Finally,

a well carried out research project will result in a

report, thesis, or journal article.

2.1.3 Scientific research as an iterative,

dynamic process

Figure 2.2. Scientific research as an iterative process

Scientific research is not a simple static activ-

ity, but as depicted in Figure 2.2, an iterative and

highly dynamic process. A research project is car-

ried out within some organizational or manage-

ment context which can be rather authoritative;

this context can be academic, governmental, or cor-porate (business). In this context, the management

objectives of the research project are put forward.

The aim of our research project itself is to fill an

existing information gap. Therefore, the research

question is defined, the experiment is designed

and carried out and the data are analyzed. The

results of this analysis allow informed decisions

to be made and provide a way of feedback to ad-

just the definition of the research question. On the

other hand, the experimental results will trigger re-search management to reconsider their objectives

and eventually request for more information.

2.2 Research styles - The smart

researcher

Figure 2.3. Modulating between the concrete and abstract world

The five phases that make up the research pro-

cess modulate between the concrete and the ab-

stract world (Figure 2.3). Definition and report-

ing are conceptual and complex tasks requiring a

great deal of abstract reasoning. Conversely, ex-

perimental work and data collection are very con-

crete, measurable tasks handling with the practical

details and complications of the specific research

domain.

Figure 2.4. Archetypes of researchers based on the relative frac-tion of the available resources that they are willing to spend ateach phase of the research process. D(1): definition phase, D(2):design phase, C: data collection, A: analysis, R: reporting

Scientists exhibit different styles in their research

depending on the relative fraction of the available

resources that they are willing to spend at each

phase of the research process. This allows us to

recognize different archetypes of researchers (Fig-

ure 2.4):

• the novelist who needs to spend a lot of time

distilling a report from an ill-conceived ex-

periment;

• the data salvager who believes that no matter

how you collect the data or set up the exper-iment, there is always a statistical fix-up at

analysis time;



2.3. PRINCIPLES OF STATISTICAL THINKING 7

Table 2.1. Statistical thinking versus statistics

Statistics Statistical Thinking

Sp ecialist skill Generalist skill

Science Informed practice

Technology Principles, patterns

Closure, seclusion Ambiguous, dialogueIntrovert Extravert

Discrete interventions Permeates the research process

Builds on good thinking Valued skill itself

• the lab freak who strongly believes that if

enough data are collected something inter-

esting will always emerge;

• the smart researcher who is aware of the ar-

chitecture of the experiment as a sequence of

steps and allocates a major part of his time

budget to the first two steps: definition and

design.

The smart researcher is convinced that time spent

planning and designing an experiment at the out-

set will save time and money in the long run. He

opposes the lab freak by trying to reduce the num-

ber of measurements to be taken, thus effectively

reducing the time spent in the lab. In contrast to

the lab freak , the smart researcher recognizes that the

design of the experiment will govern how the data

will be analyzed, thereby reducing time spent at

the data analysis stage to a minimum. By carefully

preparing and formalizing the definition and de-

sign phase, the smart researcher can look ahead to

the reporting phase with peace of mind, which is

in contrast to the novelist.

2.3 Principles of statistical think-

ing

The smart researcher recognizes the value of statis-

tical thinking for his application area and he him-

self is skilled in statistical thinking, or he collabo-

rates with a professional who masters this skill. As

noted before, statistical thinking is related to, but

distinct from statistical science (Table 2.1). While

statistics is a specialized technical skill based onmathematical statistics as a science on its own, sta-

tistical thinking is a generalist skill based on in-

formed practice and focused on the applications of

nontechnical concepts and principles.

The statistical thinker attempts to understand

how statistical methods can contribute to finding

answers to specific research problems in terms of data collection, experimental setup, data analysis

and reporting. He or she is able to postulate which

statistical expertise is required to enhance the re-

search project’s success. In this capacity the statis-

tical thinker acts as a diagnoser.

In contrast to statistics, which operates in a

closed and secluded mathematical context, statis-

tical thinking is a practice that is fully integrated

with the researcher’s scientific field, not merely anautonomous science. Hence, the statistical thinker

operates in a more ambiguous setting, where he

is deeply involved in applied research, with a

good working knowledge of the substantive sci-

ence. In this role, the statistical thinker acts as an

(intermediary) between scientists and statisticians

and goes into dialogue with them. He attempts to

integrate the several potentially competing priori-

ties that make up the success of a research project:

resource economy, statistical power, and scientificrelevance, into a coherent and statistically under-

pinned research strategy.

While the impact of the statistician on the re-

search process is limited to discrete interventions,

the statistical thinker truly permeates the research

process. His combined skills lead to increased ef-

ficiency, which is important to increase the speed

with which research data, analyses, and conclu-

sions become available. Moreover, these skills al-low to enhance the quality and to reduce the asso-

ciated cost. Statistical thinking then helps the sci-



8 CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING

entist to build a case and negotiate it on fair and

objective grounds with those in the organization

seeking to contribute to more business-oriented

measures of performance. In that sense, the suc-

cessful statistical thinker is a persuasive communica-

tor. This comparison clearly shows that the power

of statistics in research is actually founded upon

good statistical thinking.

Smart research design is based on the seven ba-

sic principles of statistical thinking:

1. Time spent thinking on the conceptualization

and design of an experiment is time wisely

spent.

2. The design of an experiment reflects the con-

tributions from different sources of variabil-

ity.

3. The design of an experiment balances be-

tween its internal validity (proper control of

noise) and external validity (the experiment’s

generalizability).

4. Good experimental practice provides the

clue to bias minimization.

5. Good experimental design is the clue to the

control of variability.

6. Experimental design integrates various dis-

ciplines.

7. A priori consideration of statistical power is

an indispensable pillar of an effective experi-

ment.



3. Planning the Experiment

Experimental observations are only experience carefully planned in advance, and designed to form a

secure basis of new knowledge.

R. A. Fisher (1935).

3.1 The planning process

Figure 3.1. The planning process

The first step in planning an experiment (Fig-

ure 3.1) is the specification of its objectives. The

researcher should realize what the actual goal is

of his experiment and how it integrates into the

whole set of related studies on the subject. How

does it relate to management or other objectives?

How will the results from this particular study

contribute to knowledge about the subject? Some-

times a preliminary exploratory experiment is use-ful to generate clear questions that will be an-

swered in the actual experiment. The study ob-

jectives should be well defined and written out as

explicitly as possible. It is wise to limit the objec-

tives of a study to a maximum of, say three (Sel-

wyn, 1996). Any more than that risks designing an

overly complex experiment and could compromise

the integrity of the study. Trying to accomplish

each of many objectives in a single study stretches

its resources too thin and as a result, often none of the study objectives is satisfied. Objectives should

also be reasonable and attainable and one should

be realistic in what can be accomplished in a single

study.

Example 3.1. The study by Seralini et al. (2012) is a

typical example of a study where the research team

tried to accomplish too much objectives. In this

study 10 treatments were examined in both female

and male rats. Since the research team apparently

had a very limited amount of resources available,

the investigators used only 10 animals per treat-

ment per sex. This was far below the 50 animals

per treatment group that are standard in long term

carcinogenicity studies (Gart et al., 1986; Haseman,

1984).

After having formulated the research objec-

tives, the scientist will then try to transfer them

into scientific hypotheses that might answer the

question. Often it is impossible to study the re-

search objective directly, but some surrogate ex-

perimental model is used instead. For example

Séralini was not interested whether GMO’s weretoxic in rats. The real objective was to establish

the toxicity in humans. As a surrogate for man, the

Sprague-Dawley strain of rat was chosen as exper-

imental model. By doing so, an auxiliary hypothesis

(Hempel, 1966) was put forward, namely that the

experimental model was adequate to the research

objectives. Séralini’s choice of the Sprague-Dawley

rat strain received much criticism (European Food

Safety Authority, 2012), since this strain is prone to

the development of tumors. Auxiliary hypothesesalso play a role when it is difficult or even impos-

sible to measure the variable of interest directly. In

9



10 CHAPTER 3. PLANNING THE EXPERIMENT

this case, an indirect measure as a surrogate for the

target variable might be available and the investi-

gator relies on the premiss that the indirect mea-

sure is a valid surrogate for the actual target vari-

able.

Based on both the scientific and auxiliary hy-

potheses, the researcher will then predict the test

implications of what to expect if these hypotheses

are true. Each of these predictions should be the

strongest possible test of the scientific hypotheses.

The deduction of these test implications also in-

volves additional auxiliary hypotheses. As stated

by Hempel (1966), reliance on auxiliary hypotheses

is the rule, rather than the exception, in testing sci-entific hypotheses. Therefore, it is important that

the researcher is aware of the auxiliary assump-

tions he makes when predicting the test implica-

tions. Generating sensible predictions is one of the

key factors of good experimental design. Good

predictions will follow logically from the hypothe-

ses that we wish to test, and not from other rival

hypotheses. Good predictions will also lead to in-

sightful experiments that allow the predictions to

be tested.

The next step in the planning process is then to

decide which data are required to confirm or refute

the predicted test implications. Throughout the se-

quence of question, hypothesis, prediction it is es-

sential to assess each step critically with enough

skepticism and even ask a colleague to play the

devil’s advocate. During the design and planning

stage of the study one should already have the

person refereeing the manuscript in mind. It is

much better that problems are identified at this

early stage of the research process than after the

experiment started. At the end of the experiment

the scientist should be able to determine whether

the objectives have been met, i.e. whether the re-

search questions were answered to satisfaction.

3.2 Types of experiments

We first distinguish between exploratory, pilot and

confirmatory experiments. Exploratory experiments

are used to explore a new research area. They pro-

vide a powerful method for discovery (Hempel,

1966), i.e they are performed to generate new hy-

potheses that can then be formally tested in confir-

matory experiments. Replication, sample size and

formal hypothesis testing are less appropriate with

this type of experiment. Currently, the vast major-

ity of published research in the biomedical sciences

originates from this sort of experiment (Kimmel-

man et al., 2014). The exploratory nature of these

studies is also reflected in the way the data are

analysed. Exploratory data analysis as opposed to

confirmatory data analysis is a flexible approach,

based mainly on graphical displays, towards for-

mulating new theories (Tukey, 1980). Exploratory

studies aim primarily at developing these new re-

search hypotheses, but they do not answer unam-

biguously the research question, since using the

same data that generated the research hypothesis

also for its confirmation, involves circular reason-

ing. Exploratory studies tend to consist of a pack-

age of small and flexible experiments using differ-

ent methodologies (Kimmelman et al., 2014). The

study by Séralini et al. (2012) was in fact an ex-

ploratory experiment and much of the controver-

sies around this study would not have arisen, if it

would have been presented as such.

Pilot experiments are designed to make sure the

research question is sensible, they allow to refine

the experimental procedures, to determine how

variables should be measured, whether the exper-

imental setup is feasible, etc. Pilot experiments

are especially useful when the actual experiment

is large, time-consuming or expensive (Selwyn,1996). Information obtained in the pilot experi-

ment is of particular importance when writing the

technical and study protocol of such studies. Pilot

experiments are discussed in more detail in Section

3.3.

Confirmatory experiments are used to assess the

test implications of a scientific hypothesis. In

biomedical research this assessment is based on

statistical methodology. In contrast to exploratorystudies, confirmatory experiments make use of

rigid pre-specified designs and a priori stated hy-



3.3. THE PILOT STUDY 11

potheses. Exploratory and confirmatory studies

complement one another in the sense that the for-

mer generates the hypotheses that can be put to

“crucial testing” in the latter. Confirmatory exper-

iments are the main topic of this tutorial.

A further distinction between different types of

experiments is based on the type of objective of

the study in question. A comparative experiment is

one in which two or more techniques, treatments,

or levels of an explanatory variable are to be com-

pared with one another. There are many examples

of comparative experiments in biomedical areas.

For example in nutrition studies different diets can

be compared to one another in laboratory animals.

In clinical studies, the efficacy of an experimental

drug is assessed in a trial by comparing it to treat-

ment with placebo. We will focus primarily on de-

signing comparative experiments for confirmation

of research hypotheses.

A second type of experiment is the optimiza-

tion experiment which has the objective of finding

conditions that give rise to a maximum or mini-

mum response. Optimization experiments are of-ten used in product development, such as finding

the optimum combination of concentration, tem-

perature, and pressure that gives rise to the max-

imum yield in a chemical production plant. Dose

finding trials in clinical development are another

example of optimization experiments.

The third type of experiment is the prediction ex-

periment in which the objective is to provide some

statistical/mathematical model to predict new re-sponses. Examples are dose response experiments

in pharmacology and immuno-assay experiments.

The final experimental type is the variation ex-

periment. This type of experiment has as objec-

tive to study the size and structure of bias and

random variation. Variation experiments are im-

plemented as uniformity trials, i.e. studies with-

out different treatment conditions. For example,

the assessment of sources of variation in microtiter plate experiments. These sources of variation can be

plate effects, row effects, column effects, and the

combination of row and column effects (Burrows

et al., 1984). A variation experiment can also tell

us about the importance of cage location in animal

experiments, where animals are kept in racks of 24

cages. Animals in cages close to the ventilation

could respond differently from the rest (Young,

1989). Other examples of variation experiments

come from the area plant research where the posi-

tion of plants in the greenhouse (Brien et al., 2013)

or the growth chamber (Potvin et al., 1990) could

be the subject of investigation.

3.3 The pilot study

As researchers are often under considerable time

pressure, there is the temptation to start as soon

as possible with the actual experiment. However,

a critical step in a new research project, that is of-

ten missed, is to spend a bit of time and resources

on the beginning of the study collecting some pilot

data. Preliminary experiments on a limited scale,

or pilot experiments, are especially useful when

we deal with time-consuming, important or expen-

sive studies and are of great value for assessing the

feasibility of the actual experiment. During the pi-

lot stage the researcher is allowed to make varia-

tions in experimental conditions such as measure-

ment method, experimental set-up, etc. The pilot

study can be of help to make sure that a sensi-

ble research question was asked. For instance, if

our research question was about whether there is

a difference in concentration of a certain protein

between diseased and non-diseased tissue, it is of

importance that this protein is present in a mea-

surable amount. Carrying out a pilot experiment

in this case can save considerable time, resources

and eventual embarrassment. One could also won-

der whether the effect of an intervention is large

enough to warrant further study. A pilot study can

then give a preliminary idea about the size of this

effect and could be of help in making such a strate-

gic decision.

A second crucial role for the pilot study is forthe researcher to practice, validate and standard-

ize the experimental techniques that will be used



12 CHAPTER 3. PLANNING THE EXPERIMENT

in the full study. When appropriate, trial runs of

different types of assays allow fine-tuning them so

that they will give optimal results. Finally, the pilot

study provides basic data to debug and fine-tune

the experimental design. Provided the experimen-

tal techniques work well, carrying out a small-scale

version of the actual experiment will yield some

preliminary experimental data. These pilot data

can be very valuable and allow to calculate or ad-

just the required sample size of the experiment and

to set up the data analysis environment.

The pilot study still belongs to the exploratory

phase of the research project and is not part of the

actual, final experiment. In order to preserve the

quality of the data and the validity of the statistical

analysis, the pilot data cannot be included in the final

dataset.



4. Principles of Statistical Design

It is easy to conduct an experiment in such a way that no useful inferences can be made.

William Cochran and Gertrude Cox (1957).

4.1 Some terminology

We refer to a factor as the condition or set of con-

ditions that we manipulate in the experiment, e.g.

the concentration of a drug. The factor level is the

particular value of a factor, e.g. 10-6M, 10-5M. A

treatment consists of a specific combination of fac-

tor levels. In single-factor studies a treatment cor-

responds to a factor level. The experimental unit is

defined as the smallest physical entity to which a

treatment is independently applied. The character-

istic that is measured and on which the effect of thedifferent treatments is investigated and analysed is

referred to as the response or dependent variable. The

observational unit is the unit on which the response

is measured or observed. Often the observational

unit is identical to the experimental unit, but this

is not necessarily always the case. The definition

of additional statistical terms can be found in Ap-

pendix ??.

4.2 The structure of the response

variable

Figure 4.1. The response variable as the result of an additivemodel

We assume that the response obtained for a par-

ticularexperimental unit can be described by a simple

additive model (Figure 4.1) consisting of the effect of

the specific treatment, the effect of the experimen-

tal design, and an error component that describes thedeviation of this particular experimental unit from

the mean value of its treatment group. There are

some strong assumptions associated with this sim-

ple model:

• the treatment terms add rather than, for ex-

ample multiply;

• treatment effects are constant;

• the response in one unit is unaffected by the

treatment applied to the other units.

These assumptions are particularly important in

the statistical analysis. A statistical analysis is only

valid when all of these assumptions are met.

4.3 Defining the experimental

unit

The experimental unit corresponds to the smallestdivision of the experimental material to which a

treatment can (randomly) be assigned, such that

any two units can receive different treatments. It

is important that the experimental units respond

independently of one another, in the sense that a

treatment applied to one unit cannot affect the re-

sponse obtained in another unit and that the oc-

currence of a high or low result in one unit has no

effect on the result of another unit. Correct iden-

tification of the experimental unit is of paramountimportance for a valid design and analysis of the

study.

13



14 CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

In many experiments the choice of the experi-

mental unit is obvious. However, in studies where

replication is at multiple levels, or when replicates

cannot be considered independent, it often hap-

pens that investigators have difficulties recogniz-

ing the proper basic unit in their experimental ma-

terial. In these cases, the term pseudoreplication is

often used (Fry, 2014). Pseudo-replication can re-

sult in a false estimate of the precision of the ex-

perimental results leading to invalid conclusions

(Lazic, 2010).

The following example represents a situation

commonly encountered in biomedical research

when multiple levels are present.

Figure 4.2. Morphometric analysis of the diameter of bile canali-culi in wild-type and Cx32-deficient liver. Means±SEM fromthree livers. *: P<0.005 (after Temme et al. (2001))

Example 4.1. Temme et al. (2001) compared two

genetic strains of mice, wild-type and connexin 32

(Cx32)-deficient. They measured the diameters of

bile canaliculi in the livers of three wild-type and

of three Cx32-deficient animals, making several ob-

servations on each liver. Their results are shown in

Figure 4.2. It should be clear that Temme et al.

(2001) mistakenly took cells, which were the ob-

servational units, for experimental units and used

them also as units of analysis. If we consider the

genotype as the treatment, then it is clear that not

the cell but the animal is the experimental unit.

Moreover, cells from the same animal will be more

alike than cells from different animals. This in-

terdependency of the cells invalidates the statisti-

cal analysis, as it was carried out by the investiga-

tors. Therefore, the correct experimental unit and

unit of analysis is the animal, not the cell. Hence,

there were only three experimental units per treat-

ment, certainly not 280 and 162 units1. The cor-

rect method of analysis calculates for each animal

the average cell diameter and takes this value as the

response variable.

Mistakes as in the above example are abundant

whenever microscopy is concerned and the indi-

vidual cell is used as experimental unit. One could

wonder whether these are mistakes made out of ig-

norance or out of convenience. The concern is even

greater when such studies get published in peer re-

viewed high impact scientific journals.

Independence of units can be an issue in plant

research and in studies using laboratory animalsas is illustrated by the following example.

Example 4.2. (LeBlanc, 2004) In order to assess the

effect of fertilization on corn plant growth, two plots

are established: one fertilized and one without fer-

tilizer (control). In each plot 100 corn plants are

grown. To determine the effect of fertilization, the

average mass of the corn plants is compared be-

tween the two plots.

The choice of plants as experimental units is in-

correct, because the observations of growth made

on adjacent plants within a plot are not indepen-

dent of one another. If one plant is genetically su-

perior and grows particularly large, it will tend to

shade its inferior neighbors and cause them to grow

more slowly. An appropriate design would be to put

the plants into separate pots and randomly treat-

ing each pot with fertilizer or not. In this case there

is no competition of resources between plants. An-

other alternative would make use of several fertil-

ized and unfertilized plots of ground. The outcome

would then be the crop yield in each plot.

Laboratory animals such as mice and rats are

often housed together in cages and the treatment

can than conveniently be supplied by the tap wa-

ter, such as in the following example:

Example 4.3. Rivenson et al. (1988) studied the tox-

icity of N-nitrosamines in rats and described their

experimental set-up as:1If we recalculate the standard errors of the mean (SEM) using the appropriate number of experimental units, then they

are a factor 7-10 larger than the reported ones.



4.3. DEFINING THE EXPERIMENTAL UNIT 15

The rats were housed in groups of 3 in

solid-bottomed polycarbonate cages with

hardwood bedding under standard condi-

tions diet and tap water with or without

N-nitrosamines were given ad libitum.

Since the treatment was supplied in the drinking

water, it is impossible to provide different treat-

ments to any two individual rats. Furthermore,

the responses obtained within the different animals

within a cage can be considered to be dependent

upon one another in the sense that the occurrence

of extreme values in one unit can affect the result

of another unit. Therefore, the experimental unit

here is not the single rat, but the cage.

An identical problem with independence of the

basic units is found in the study by Séralini et al.

(2012). In their study, rats were housed in groups

of two per cage and the treatment was present in

the food delivered to the cages.

Even when the animals are individually treated,

e.g. by injection, group-housing can cause ani-

mals in the same cage to interact which would in-

validate the assumption of independence of units.

For instance, in studies with rats, a socially domi-

nant animal may prevent others from eating at cer-

tain times. Mice housed in a group usually lie to-

gether, thereby reducing their total surface area.

A reduced heat loss per animal in the group is

the result. Due to this behavioral thermoregulation

their metabolic rate is altered (Ritskes-Hoitinga

and Strubbe, 2007).

Nevertheless, single housing of gregarious ani-

mal species is considered detrimental to their wel-

fare and regulations in Europe concerning animal

welfare insist on group housing of such species

(Council of Europe, 2006). However, when animals

are housed together, the cage rather than the indi-

vidual animal should be considered as the experi-

mental unit (Fry, 2014; Gart et al., 1986). Statistical

analysis should take this into account by using ap-

propriate techniques. Fortunately, as is pointed out by (Fry, 2014), when the cage is taken as the exper-

imental unit the total number of animals needed

is not just a simple multiple of the number of an-

imals in the cage and the required number of ex-

perimental units (cages). An experiment requiring

10 animals per treatment group when housed in-

dividually is almost equivalent to an experiment

with 12 animals distributed over 4 cages per treat-

ment. For some outcomes variability is expected to

be reduced when animals are more content when

they are group-housed, which would enhance the

latter experiment’s efficiency (Fry, 2014).

The choice of the experimental unit is also of par-

ticular concern in plant research when the treat-

ment condition has been applied to whole pots,

trays or growth chambers rather than to individualplants. By now, it should be clear that in these cases

it is the plot, tray or growth chamber that con-

stitutes the experimental unit (Fernandez, 2007).

Especially in the case of growth chambers, when

the treatment condition is an environmental fac-

tor, it is difficult to obtain a sufficient number of

genuine replications. Therefore, experimenters are

tempted to consider the individual plants, pots,

trays or Petri dishes as experimental units, which is

wrong. Because the treatment condition (tempera-ture, light intensity/duration, etc.) was applied to

the entire chamber, it is indeed the growth cham-

ber itself that constitutes the experimental unit.

For a sufficient number of replications when the

number of growth chambers is limited, the exper-

iment should be replicated using the same growth

chamber repeatedly over time (Fernandez, 2007).

Two-generation reproductive studies which in-

volve exposure in utero are standard procedures

in teratology. Also here, the entire litter rather

than the individual pup constitutes the experimen-

tal unit (Gart et al., 1986). This also applies to other

experiments in reproductive biology.

Example 4.4. (Fry, 2014) A drug was tested for its

capacity to reduce the effect of a mutation causing

a common condition. To accomplish this, homozy-

gous mutant female rats were randomly assigned to

drug-treated and control groups. Then they weremated with homozygous mutant males, producing

homozygous mutant offspring. Litters were weaned




and pups grouped five to a cage and the effects on

the offspring were observed. Here, although obser-

vations on the individual offspring were made, the

experimental units are the mutant dams that were

randomly assigned to treatment. Therefore, the ob-

servations on the offspring should be averaged to

give a single figure for each dam and these data,

one for each dam, are to be used for comparing the

treatments.

A single individual can also relate to several ex-

perimental units. This is illustrated by the follow-

ing example.

Example 4.5. (Fry, 2014) The efficacy of two agents

at promoting regrowth of epithelium across a wound

was evaluated by making 12 small wounds in a stan-

dardized way in a grid pattern on the back of a pig.

The wounds were far enough apart for effects on

each to be independent. One of four treatments

would then be applied at random to the wound in

each square of the grid. In this case the experimen-

tal unit would be the wound and, as there are 12

wounds, for each treatment there would be three

replicates.

4.4 Variation is omnipresent

Variation is everywhere in the natural world and

is often substantial in the life sciences. Despite ac-

curate execution of the experiment, the measure-

ments obtained in identically treated objects will

yield different results. For example, cells grown in

test tubes will vary in their growth rates and in an-

imal research no two animals will behave exactlythe same. In general, the more complex the system

that we study, the more factors will interact with

each other and the greater will be the variation

between the experimental units. Experiments in

whole animals will undoubtedly show more varia-

tion than in vitro studies on isolated organs. When

the variation cannot be controlled or its source can-

not be measured, we will refer to it as noise, ran-

dom variation or error. This uncontrollable variation

masks the effects under investigation and is thereason why replication of experimental units and

statistical methods are required to extract the nec-

essary information. This is in contrast to other sci-

entific areas such as physics, chemistry and engi-

neering where the studied effects are much larger

than the natural variation.

4.5 Balancing internal and exter-

nal validity

Figure 4.3. The basic dilemma: balancing between internal andexternal validity

Internal validity refers to the fact that in a well-

conceived experiment the effect of a given treat-

ment is unequivocally attributed to that treatment.

However, the effect of the treatment is masked by

the presence of the uncontrolled variation of the

experimental material.

An experiment with a high level of internal va-lidity should have a great chance to detect the ef-

fect of the treatment. If we consider the treatment

effect as a signal and the inherent variation of our

experimental material as noise, then a good exper-

imental design will maximize the signal-to-noise ra-

tio. Increasing the signal can be accomplished by

choosing experimental material that is more sensi-

tive to the treatment. Identification of factors that

increase the sensitivity of the experimental ma-

terial could be carried out in preliminary experi-ments. Reducing the noise is another way to in-

crease the signal-to-noise ratio. This can be accom-

plished by repeating the experiment in a number

of animals, but this is not a very efficient way of

reducing the noise. An alternative for noise reduc-

tion is by using very uniform experimental mate-

rial. The use of cells harvested from a single animal

is an example of noise reduction by uniformity of

experimental material.

External validity is related to the extent that our

conclusions can be generalized to the target popu-



4.6. BIAS AND VARIABILITY 17

lation. The choice of the target population, how a

sample is selected from this population and the ex-

perimental procedures used in the study are all de-

terminants of its external validity. Clearly, the ex-

perimental material should mimic the target pop-

ulation as close as possible. In animal experiments

specifying species and strain of the animal, the age

and weight range and other characteristics deter-

mine the target population and make the study as

realistic and informative as possible. External va-

lidity can be very low if we work in a highly con-

trolled environment using very uniform material.

Thus there is a trade-off between internal and ex-ternal validity (Figure 4.3), as one goes up the other

comes down. Fortunately, as we will see, there are

statistical strategies for designing a study such that

the noise is reduced while the external validity is

maintained.

4.6 Bias and variability

Bias and variability (Figure 4.4) are two important

concepts when dealing with the design of exper-

iments. A good experiment will minimize or, at

best, try to eliminate bias and will control for vari-

ability. By bias, we mean a systematic deviation in

observed measurements from the true value. One

of the most important sources of bias in a study is

the way experimental units are allocated to treat-

ment groups.

Example 4.6. Consider an animal study in which

the effect of an experimental treatment is investi-

gated with respect to a control treatment. Suppose

the experimenter allocates all males to the control

treatment and all females to the experimental treat-

ment. Further assume that at the end of the ex-

periment the investigator finds a strong difference

between the two treatment groups. It is clear that

this treatment effect is a biased result since the dif-ference between the two groups is completely con-

founded with the difference between both sexes.

Figure 4.4. Bias and variability illustrated by a marksman shotat a bull’s eye

Example 4.7. Confounding bias can enter a study

through less obvious routes. For example, micorar-

rays allow biologists to measure the expression of

thousands of genes or proteins simultaneously. Mi-

nor differences in a number of non-biological vari-

ables, such as reagents from different lots, differ-

ent technicians or even changing atmospheric ozonelevels can impact the data. Microarrays are usually

processed in batches and in large studies different

institutes can collaberate using different equipment.

Consider a naive experimental setup in which on

Monday all control samples are measured and on

Tuesday all diseased samples, thus masking the ef-

fect of disease with that of the two batches. Worse

is that these batch effects do not affect the entire

microarray in the same manner. Correlation pat-

terns differ by batch, and even reverse sign acrossbathes (Leek et al., 2010).

Studies using laboratory animals can be subject

to bias when all animals assigned to a specific treat-

ment are kept in the same cage. In Section 4.3, we

already mentioned the problem of independence

when animals were housed in groups. In addi-

tion to these, when all animals in a cage receive the

same treatment, the effects due to the conditions

in the cage are superimposed on the effects of the

treatments. When the experiment is restricted to

a single cage per treatment, the comparisons be-

tween the treatments can be biased(Fry, 2014). The

same reasoning applies to the location of the cages

in the rack (Gart et al., 1986).

By variability, we mean a random fluctuation

about a central value. The terms bias and vari-

ability are also related to the concepts of accuracy

and precision of a measurement process. Absenceof bias means that our measurement is accurate,

while little variability means that the measurement




is precise. Good experiments are as free as possi-

ble from bias and variability. Of the two, bias is

the most important. Failure to minimize the bias

of an experiment leads to erroneous conclusions

and thereby jeopardizes the internal validity. Con-

versely, if the outcome of the experiment shows

too much variability, this can sometimes be reme-

diated by refinement of the experimental methods,

increasing the sample size, or other techniques. In

this case the study may still reach the correct con-

clusions.

4.7 Requirements for a good ex-

perimentCox (1958) enunciated the following requirements

for a good experiment:

1. treatment comparisons should as far as pos-

sible be free of systematic error (bias);

2. the comparisons should also be made suffi-

ciently precise (signal-to-noise);

3. the conclusions should have a wide range of

validity (external validity);

4. the experimental arrangement should be as

simple as possible;

5. uncertainty in the conclusions should be as-

sessable.

These five requirements will determine the basic

elements of the design. We have discussed already

the first three requirements in the preceding sec-

tions. The following section provides some basicstrategies to fulfill these requirements.

Figure 4.5. Overview of strategies for minimizing the bias andmaximizing the signal-to-noise ratio

4.8 Strategies for minimizing bias

and maximizing signal-to-

noise ratio

To safeguard the internal validity of his study, the

scientists needs to optimize the signal-to-noise ra-

tio (Figure 4.5). This constitutes the basic principle

of statistical design of experiments. The signal can

be maximized by the proper choice of the measur-

ing device and experimental domain. The noise is

minimized by reducing bias and variability. Strate-

gies for minimizing the bias are based on good ex-

perimental practice, such as: the use of controls,

blinding, the presence of a protocol, calibration,

randomization, random sampling, and standard-

ization. Variability can be minimized by elements

of experimental design: replication, blocking, co-

variate measurement, and sub-sampling. In ad-

dition random sampling can be added to enhance

the external validity. We will now consider each of

these strategies in more detail.

4.8.1 Strategies for minimizing bias -

good experimental practice

4.8.1.1 The use of controls

In biomedical studies, a control or reference stan-

dard is a standard treatment condition against

which all others may be compared. The control

can either be a negative control or a positive control.

The term active control is also used for the latter. In

some studies, both negative and positive controls

are present. In this case, the purpose of the positive

control is mostly to provide an internal validationof the experiment1.

When negative controls are used, subjects can

sometimes act as their own control (self-control). In

this case the subject is first evaluated under stan-

dard conditions (i.e. untreated). Subsequently,

the treatment is applied and the subject is re-

evaluated. This design, also called pre-post de-

sign, has the property that all comparisons are

made within the same subject. In general, vari-1Active controls play a special role in so-called equivalence or non-inferiority studies, where the purpose is to show that

a given therapy is equivalent or non-inferior to an existing standard.



4.8. STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL-TO-NOISE RATIO 19

ability within a subject is smaller than between

subjects. Therefore, this is a more efficient design

than comparing control and treatment in two sep-

arate groups. However, the use of self-control has

the shortcoming that the effect of treatment is com-

pletely confounded with the effect of time, thus intro-

ducing a potential source of bias. Furthermore,

blinding, which is another method to minimize

bias, is impossible in this type of design.

Another type of negative control is where one

treated group does not receive any treatment at all,

i.e. the experimental units remain untouched. Just

as in the previous case of self-control, untreated con-

trols cannot be blinded. Furthermore, applying thetreatment (e.g. a drug) often requires extra manip-

ulation of the subjects (e.g. injection). The effect of

the treatment is then intertwined with that of the

manipulation and consequently it is potentially bi-

ased.

Vehicle control (laboratory experiments) or placebo

control (clinical trials) are terms that refer to a con-

trol group that receives a matching treatment con-dition without the active ingredient. Another term

for this type of control, in the context of experi-

mental surgery, is sham control. In the sham control

group subjects or animals undergo a faked opera-

tive intervention that omits the step thought to be

therapeutically necessary. This type of vehicle con-

trol, placebo control or sham control is the most

desirable and truly minimizes bias. In clinical re-

search the placebo controlled trial has become the

gold standard. However, in the same context of clinical research ethical consideration may some-

times preclude its application.

4.8.1.2 Blinding

Researchers’ expectations may influence the study

outcome at many stages. For instance, the exper-

imental material may unintentionally be handled

differently based on the treatment group, or ob-

servations may be biased to confirm prior beliefs.Blinding is a very useful strategy for minimizing

this subconscious experimenter bias.

In a recent survey of studies in evolutionary bi-

ology and of the life sciences at large, Holman

et al. (2015) found that in unblinded studies the

mean reported effect size was inflated by 27% and

the number of statistically significant findings was

substantially larger as compared to blinded stud-

ies. The importance of blinding in combination

with randomization in animal studies was also

highlighted by Hirst et al. (2014). Despite its im-

portance, blinding of experimenters is often ne-

glected in biomedical research. For example, in

a systematic review of studies on animals in non-

clinical research, van Luijk et al. (2014) found that

only 24% reported blinded assessment of the out-

come, while only 15% considered blinding of the

caretaker/investigator.

Two types of blinding must be distinguished. In

single blinding the investigators are uninformed re-

garding the treatment condition of the experimen-

tal subjects. Single blinding neutralizes investigator

bias. The term double blinding in laboratory exper-

iments means that both the experimenter and the

observer are uninformed about the treatment con-

dition of the experimental units. In clinical trialsdouble blinding means that both investigators and

subjects are unaware of the treatment condition.

Two methods of blinding have found their way

to the laboratory: group blinding and individual

blinding. Group blinding involves identical codes,

say A, B, C, etc., for entire treatment groups. This

approach is less appropriate, since when results ac-

cumulate and the investigator is able to break the

code, blindness is completely destroyed. A much better approach is to assign a code (e.g. sequence

number) to each experimental unit individually

and to maintain a list that indicates which code

corresponds to which particular treatment. The se-

quence of the treatments in the list should be ran-

domized. In practice, this individual blinding pro-

cedure often involves an independent person that

maintains the list and prepares the treatment con-

ditions (e.g. drugs).

Especially when the outcome of the experiment

is subjectively evaluated, blinding must be consid-




ered. However, there is one situation where blind-

ing does not seem to be appropriate, namely in tox-

icologic histopathology. Here, the bias that would

be reduced by blinding is actually a bias favor-

ing diagnosis of a toxicological hazard and a con-

servative safety evaluation, which is appropriate

in this context. Blinded evaluation would result

in a reduction in the sensitivity to detect anoma-

lies. In this context, Holland and Holland (2011)

suggested that for toxicological work both an un-

blinded and blinded evaluation of histologic mate-

rial should be carried out.

4.8.1.3 The presence of a technical protocol

The presence of a written technical protocol de-

scribing in full detail the specific definitions of

measurement and scoring methods is imperative

to minimize potential bias. The technical protocol

specifies practical actions and gives guidelines for

lab technicians on how to manipulate the experi-

mental units (animals, etc.), the materials involved

in the experiment, the required logistics, etc. It

also gives details on data collection and process-

ing. Last but not least, the technical protocol lays

down the personal responsibilities of the techni-

cal staff. The importance and contents of the other

protocol, the study protocol, will be discussed fur-

ther in Chapter 8.

4.8.1.4 Calibration

Calibration is an operation that compares the out-

put of a measurement device to standards of

known value, leading to correction of the values

indicated by the measurement device. Calibration

neutralizes instrument bias, i.e. the bias in the in-

vestigator’s measurement system.

4.8.1.5 Randomization

Together with blinding, randomization is an im-

portant tool for bias elimination in animal stud-

ies. In an overview of systematic reviews on an-

imal studies Hirst et al. (2014) found that failure

to randomize is likely to result in overestimation

of treatment effects across a range of disease ar-

eas and outcome measures. Randomization has

been successfully applied in many areas of re-

search, such as microarray studies (Göhlmann and

Talloen, 2009; Yang et al., 2008) and studies using

96-well microplates (Faessel et al., 1999; Levasseur

et al., 1995).

Formal randomization, in our context, is the pro-

cess of allocating experimental units to treatment

groups or conditions according to a well-defined

stochastic law1. Randomization is a critical ele-

ment in proper study design. It is an objective and

scientifically accepted method for the allocation of

experimental units to treatment groups. Formal

randomization ensures that the effect of uncon-

trolled sources of variability has equal probabil-

ity in all treatment groups. In the long run, ran-

domization balances treatment groups on unim-

portant or unobservable variables, of which we

are often unaware. Any differences that exist in

these variables after randomized treatment alloca-

tion are then to be attributed to the play of chance.

In other words, randomization is an operation that

effectively turns lethal bias into more manageable

random error (Vandenbroeck et al., 2006). The ran-

dom allocation of experimental units to treatment

conditions is also a necessary condition for a rigor-

ous statistical analysis, in the sense that it provides

an unbiased estimate of the standard error of the

treatment effects, makes experimental units inde-

pendent of one another and justifies the use of sig-

nificance tests (Cox, 1958; Fisher, 1935; Lehmann,

1975). In addition, randomization is also of use as

a device for blinding the experiment.

Example 4.8. In neurological research, animals are

randomly allocated to treatments. At the end of

the experimental procedures, the animals are sacri-

ficed, slides are made from certain target areas of

the brain and these slides are investigated micro-

scopically. At each of these stages errors can arise

leading to biased results if the original randomiza-

tion order is not maintained.

Errors may arise at various stages in the exper-

iment. Therefore, to eliminate all possible bias, it1By the term stochastic is meant that it involves some elements of chance, such as picking numbers out of a hat, or

preferably, using a computer program to assign experimental units to treatment groups.




is essential that the randomization procedure cov-

ers all important sources of variation connected

with the experimental units. In addition, as far

as practical, experimental units receiving the same

treatment should be dealt with separately and in-

dependently at all stages at which important er-

rors may arise. If this is not the case, additional

randomization procedures should be introduced

(Cox, 1958). To summarize, randomization should

apply to each stage of the experiment (Fry, 2014):

• allocation of independent experimental units

to treatment groups

• order of exposure to test alteration within an

environment

• order of measurement

Therefore, in animal experiments, when the cage

is the experimental unit, the arrangement of cages

within the rack or room, the administration of sub-

stances, taking of samples, etc. should also be ran-

domized, even though this adds an extra burden to

the laboratory staff. Of course, this can be accom-

plished by maintaining the original randomization

sequence throughout the experiment.

Formal randomization requires the use of a ran-

domization device. This can be the tossing of a coin,

use of randomization tables (Cox, 1958), or use of

computer software (Kilkenny et al., 2009). Meth-

ods of randomization using MS Excel and the R sys-

tem (R Core Team, 2013) are contained in Appendix

B.

Some investigators are convinced that not ran-

domization, but a systematic arrangement is the pre-

ferred way to eliminate the influence of uncon-

trolled variables. For example when two treat-

ments A and B have to be compared, one possibil-

ity is to set up pairs of experimental units and al-

ways assign treatment A to the first member of the

pair and B to the remaining unit. However, if there

is a systematic effect such that the first member of

each pair consistently yields a higher or lower re-sult than the second member, the estimated treat-

ment effect will be biased. Other arrangements are

more commonly used in the laboratory, e.g. the al-

ternating sequence AB, BA, AB, BA,. . . . Here too, it

cannot be excluded that a certain pattern in the un-

controlled variability coincides with this arrange-

ment. For instance, if 8 experimental units are

tested in one day, the first unit on a given day will

always receive treatment A. Furthermore, when a

systematic arrangement has been applied, the sta-

tistical analysis is based on the false assumption of

randomness and can be totally misleading.

Researchers are sometimes tempted to improve

on the random allocation of animals by re-arranging

individuals so that the mean weights are almost

identical. However, by reducing the variability

between the treatment groups as is done in Fig-

ure 4.6, the variability within the groups is altered

and can now differ between groups. This reduces

the precision of the experiment and invalidates the

subsequent statistical analysis. Later, we will see

that the randomized block design instead of sys-

tematic arrangement is the correct way of handling

these last two cases.

Figure 4.6. Trying to improve the random allocation by reducingthe intergroup variability increases the intragroup variability

Formal randomization must be distinguished from

haphazard allocation to treatment groups (Kilkenny

et al., 2009). For example, an investigator wishes to

compare the effect of two treatments ( A, B) on the

body weight of rats. All twelve animals are deliv-

ered in a single cage to the laboratory. The labo-

ratory technician takes six animals out of the cage

and assigns them to treatment A, while the remain-

ing animals will receive treatment B. Although,

many scientists would consider this as a random

assignment, it is not. Indeed, one could imagine

the following scenario: heavy animals react slower

and are easier to catch than the smaller animals.

Consequently, the first six animals will on average

weigh more than the remaining six.

Example 4.9. An important issue in the design of

an experiment is the moment of randomization . In




an experiment brain cells were taken from animals

and placed in Petri dishes, such that one Petri dish

corresponded to one particular animal. The Petri

dishes were then randomly divided into two groups

and placed in an incubator. After 72 hrs incuba-

tion one group of Petri dishes was treated with the

experimental drug, while the other group received

solvent.

Although the investigators made a serious effort to

introduce randomization in their experiment, they

overlooked the fact the placement of the Petri dishes

in the incubator introduced a systematic error. In-

stead of randomly dividing the Petri dishes in two

groups at the start of the experiment, they should

have made random treatment allocation after the

incubation period.

As pointed out before, it is important that the

randomization covers all substantial sources of varia-

tion connected with the experimental units. As a

rule, randomization should be performed imme-

diately before treatment application. Furthermore,

after the randomization process has been carried

out the randomized sequence of the experimental

units must be maintained, otherwise a new ran-

domization procedure is required.

4.8.1.6 Random sampling

Using a random sample in our study increases its

external validity and allows us to make a broad

inference, i.e. a population model of inference

(Lehmann, 1975). However, in practice it is of-

ten difficult and/or impractical to conduct studies

with true random sampling. Clinical trials are usu-

ally conducted using eligible patients from a smallnumber of study sites. Animal experiments are

based on the available animals. This certainly lim-

its the external validity of these studies and is one

of the reasons that the results are not always repli-

cable.

In some cases, maximizing the external validity

of the study is of great importance. This is es-

pecially the case in studies that attempt to make

a broad inference towards the target population

(population model), like gene expression exper-

iments that try to relate a specific pathology to

the differential expression of certain genes probes

(Nadon and Shoemaker, 2002). For such an exper-

iment, the bias in the results is minimized only if it

is based on a random sample from the target pop-

ulation.

4.8.1.7 Standardization

Standardization of the experimental conditions is

an effective way to minimize the bias. In addi-

tion it also can also be used to reduce the intrin-

sic variability in the results. Examples of standard-ization of the experimental conditions are the use

of genetically or phenotypically uniform animals

or plants, environmental and nutritional control,

acclimatization, and standardization of the mea-

surement system. As discussed before, too much

standardization of the experimental conditions can

jeopardize the external validity of the results.

4.8.2 Strategies for controlling variability

- good experimental design

4.8.2.1 Replication

Ronald Fisher1 noted in his pioneering book The

Design of Experiments that replication at the level

of the experimental unit serves two purposes. The

first is to increase the precision of estimation and

the second is to supply an estimate of error by

which the significance of the comparisons is to be

judged.

The precision of an experiment depends on the

standard deviation2 (σ) of the experimental mate-

rial and inversely on the number of experimental

units (n). In a comparative experiment with two

treatment groups (X 1) and (X 2) and equal num-

ber of experimental units per treatment group, this

precision is quantified by the standard error of the

difference between the two averages ( X 1 − X 2) as:

1Sir Ronald Aylmer Fisher (Londen,1890 - Adelaide 1962) is considered a genius who almost single-handedly created the

foundations of modern statistical science and experimental design.2The standard deviation refers to the variation of the individual experimental units, whereas the standard error refers

to the random variation of an estimate (mostly the mean) from a whole experiment. The standard deviation is a basicproperty of the underlying distribution and, unlike the standard error, is not altered by replication.




Figure 4.7. The effect of blocking illustrated by a study of the effect of diet on running speed of dogs. Not taking age of the doginto account (left panel) masks most of the effect of the diet. In the right panel dogs are grouped (blocked) according to age andcomparisons are made within each age group. The latter design is much more efficient.

σ X1− X2 = σ × 2/n, where σ is the common stan-

dard deviation and n is the number of experimen-

tal units in each treatment group.

The standard deviation is composed of the in-

trinsic variability of the experimental material and

the precision of the experimental work. Reduc-

tion of the standard deviation is only possible to

a limited extend by refining experimental proce-

dures. However, one can by increasing the num-

ber of experimental units effectively enhance the

experiment’s precision. Unfortunately, due to the

inverse square root dependency of the standard er-

ror on the sample size, this is not an efficient way to

control the precision. Indeed, the standard error is

halved by a fourfold increase in the number of ex-

perimental units, but a hundredfold increase in the

number of units is required to divide the standard

error by ten. In other words, replication at the level of

the experimental unit is an effective but expensive strat-

egy to control variability. As we will see later, choos-

ing an appropriate experimental design that takes

into account the different sources of variability that

can be identified, is a more efficient way to increase

the precision.

4.8.2.2 Subsampling

As mentioned above, reduction of the standard

deviation is only possible to a very limited ex-

tend. This can be accomplished by standardiza-

tion of the experimental conditions, but also this

method is limited and it jeopardizes the external

validity of the experiment. However, in some ex-periments it is possible to manipulate the phys-

ical size of the experimental units. In this case

one could choose units of a larger size, such that

the estimates are more precise. In still other ex-

periments, there are multiple levels of sampling.

The process of taking samples below the primary

level of the experimental unit is known as subsam-

pling (Cox, 1958; Selwyn, 1996) or pseudoreplica-

tion (Fry, 2014; Lazic, 2010; LeBlanc, 2004). The ex-

periment reported by Temme et al. (2001) where

the diameter of many liver cells was measured in

3 animals/experimental condition, is an example

of subsampling with animals at the primary level

and cells at the subsample level. Multiple obser-

vations or measurements made over time are also

pseudoreplications or subsamples. In biological

and chemical analysis, it is common practice to du-

plicate or triplicate independent determinations on

samples from the same experimental unit. In this

case samples and determinations within samples

constitute two distinct levels of subsampling.

When subsampling is present, the standard de-

viation σ used in the comparison of the treat-

ment means is composed of the variability be-tween the experimental units (between-unit vari-

ability) and the variability within the experimental

units(within-unit variability). It can be shown that

in the presence of subsampling the overall stan-

dard deviation of the experiment is equal to:

σ =

σ2n +

σ 2m

m

where n and m are the number of experimental

units and subsamples and σn and σm the betweensample and within sample standard deviation.

In Section 4.8.2.1, we defined the precision of




the comparative 2-treatments experiment. This

now becomes:

σ X1− X2 =

2

n(σ2

n + σ 2m

m )

Thus, by increasing the number of experimental

units n we reduce the total variability, while sub-

sample replication m only affects the within-unit

variability. A large number of subsamples makes

only sense when the variability of the measure-

ment at the sublevel σm is substantial as compared

to the between-unit variability σn. How to deter-

mine the required number of subsamples will be

discussed in Section 6.5. As a conclusion, we can

say that subsample replication is not identical and not

as effective as replication on the level of the true experi-

mental unit.

4.8.2.3 Blocking

Example 4.10. Consider a (hypothetical) experi-

ment in which the effect of two diets on running

speed of dogs is studied. We can carry out the ex-

periment by taking 6 dogs with varying age and

randomly allocating 3 dogs to diet A and the 3 re-

maining to diet B. However, as shown in the left

panel of Figure 4.7, the effect of diet will be greatly

masked by the variability between dogs. A more

intelligent way to set up the experiment is to group

the dogs by age and make all comparisons within

the same age group, thus removing the effect of dif-

ferent ages. This is illustrated in the right panel

of Figure 4.7. With the variability due to age re-

moved, the effect of the diets within the age groups

is much more clear.

If one or more factors other than the treatment

conditions can be identified as potentially influenc-

ing the outcome of the experiment, it may be ap-

propriate to group the experimental units on these

factors. Such groupings are referred to as blocks

or strata. Units within a block are then randomly

assigned to the treatments. Examples of blocking

factors are plates (in microtiter plate experiments),

greenhouses, animals, litters, date of experiment,

or based on categorizations of continuous base-line characteristics such as body weight, baseline

measurement of the response, etc. What we ef-

fectively do by blocking is dividing the variation

between the individuals into variation between

blocks and variation within blocks. If the blocking

factor has an important effect on the response, then

the between-block variation is much greater than

the within block variation. We will take this into

account in the analysis of the data (analysis of vari-

ance or ANOVA with blocks as additional factor).

Comparisons of treatments are then carried out

within blocks, where the variation is much smaller.

Blocking is an effective and efficient way to en-

hance the precision of the experiment. In addition,

blocking allows to reduce the bias due to an im-

balance in baseline characteristics that are known

to affect the outcome. However, blocking does not

eliminate the need for randomization. Within each

block treatments are randomly assigned to the ex-

perimental units, thus removing the effect of the

remaining unknown sources of bias.

4.8.2.4 Covariates

Figure 4.8. Additive model with a linear covariate adjustment

Blocking on a baseline characteristic such as

body weight is one possible strategy to elimi-

nate the variability induced by the heterogeneity

in weight of the animals or patients. Instead of

grouping in blocks, or in addition to, one can also

make use of the actual value of the measurement.

Such a concomitant measurement is referred to as

a covariate. It is an uncontrollable but measur-

able attribute of the experimental units (or theirenvironment) that is unaffected by the treatments but

may have an influence on the measured response.

Examples of covariates are body weight, age, am-

bient temperature, measurement of the response

variable before treatment, etc. The covariate fil-

ters out the effect of a particular source of variabil-

ity. Rather than blocking it represents a quantifi-

able attribute of the system studied. The statisti-

cal model underlying the design of an experiment

with covariate adjustment is conceptualized in Fig-ure 4.8. The model implies that there is a linear re-

lationship between the covariate and the response



4.9. SIMPLICITY OF DESIGN 25

and that this relationship is the same in all treat-

ment groups. In other words, there is a series of

parallel curves, one per treatment group, relating

the response to the covariate. This is exemplified

in Figure 4.9 showing the results of an experiment

with two treatment groups in which the baseline

measurement of the response variable served as

covariate. There is a linear relationship between

the covariate and the response and this is almost

the same in both treatment groups as is shown by

the fact that the two lines are almost parrallel to

one another.

Figure 4.9. Results of an experiment with baseline as covari-

ate. There is a linear relationship between the covariate andthe response and this relationship is the same in both treatmentgroups.

4.9 Simplicity of design

In addition to external validity, bias and precision,

Cox (1958) also stated that the design of our exper-

iment should be as simple as possible. When the

design of the experiment is too complex, it may

be difficult to ensure adherence to a complicated

schedule of alterations, especially if these are to becarried out by relatively unskilled people. An ef-

ficient and simple experimental design has the ad-

ditional advantage that the statistical analysis will

also be simple without making unreasonable as-

sumptions.

4.10 The calculation of uncer-

tainty

This is the last of Cox’s precepts for a good experi-

ment (see Section 4.7, page 18). It is the only sta-

tistical requirement, but it is also the most impor-

tant one. Unfortunately, it is also the requirement

that is most often neglected. Fisher (1935) already

lamented that:

It is possible, and indeed it is all too fre-

quent, for an experiment to be so conducted

that no valid estimate of error is available.

Without the ability to estimate error, there is no ba-

sis for statistical inference. Therefore, in a well-

conceived experiment, we should always be able

to calculate the uncertainty in the estimates of the

treatment comparisons. This usually means es-

timating the standard error of the difference be-tween the treatment means. To make this calcu-

lation in a rigorous manner, the set of experimen-

tal units must respond independently to a specific

treatment and may only differ in a random way

from the set of experimental units assigned to the

other treatments. This requirement again stresses

the importance of independence of experimental

units and randomness in treatment allocation.

Table 4.1. Multiplication factor to correct for the bias in esti-

mates of the standard deviation based on small samples, afterBolch (1968).

n Factor

2 1.253

3 1.128

4 1.085

5 1.064

6 1.051

When the number of experimental units is

small, the sample estimate of the standard devia-

tion σ is biased and underestimates the true stan-dard deviation. A multiplication factor to correct

for this bias for normal distributions is given in Ta-

ble 4.1. For a sample size of 3, the estimate should

be increased with 13% to obtain the actual stan-

dard deviation.

Alternatively, one can also make use of the re-

sults of previous experiments to guesstimate the

new experiment’s standard deviation. However,we then make the strong assumption that random

variation is the same in the new experiment.



5. Common Designs in Biological

Experimentation

And so it was ... borne in upon me that very often, when the most elaborate statistical refinements

possible could increase the precision by only a few percent, yet a different design involving little or no

additional experimental labour might increase the precision two-fold, or five-fold or even more

R. A. Fisher (1962)

There is a multitude of designs that can be con-

sidered when planning an experiment and some

have been employed more commonly than oth-

ers in the area of biological research. Unfortu-

nately, the literature about experimental designs

is littered with technical jargon, which makes its

understanding quite a challenge. To name a few,there are: completely randomized designs, ran-

domized complete block designs, factorial designs,

split plot designs, Latin square designs, Greco-

Latin squares, Youden square designs, lattice de-

signs, Placket-Burman designs, simplex designs,

Box-Behrken designs, etc.

It helps to find our way through this jungle of

designs by keeping in mind that the major prin-

ciple of experimental design is to provide a syn-

thetic approach to minimize bias and control variabil-

ity. Furthermore, as shown in Figure 5.1, we can

consider each of the specialized experimental de-

signs as integrating three different aspects of the

design (Hinkelmann and Kempthorne, 2008):

• the treatment design,

• the error-control design,

• the sampling & observation design.

The treatment design is concerned about which

treatments are to be included in the experiment

and is closely linked to the goals and aims of

study. Should a negative or positive control be

incorporated in the experiment, or should both

Figure 5.1. The three aspects of the design determine its complexity and the required resources

27



28 CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

be present? How many doses or concentrations

should be tested and at which level? Is the inter-

action of two treatment factors of interest or not?

The error-control design implements the strategies

that we learned in Section 4.8.2 to filter out differ-

ent sources of variability. The sampling & observa-

tion aspect of of our experiment is about how ex-

perimental units are sampled from the population,

how and how many subsamples should be drawn,

etc.

The complexity and the required resources of astudy are determined by these three aspects of ex-

perimental design. The required resources, namely

the number of experimental units, of a study are

governed by the number of treatments, the num-

ber of blocks and the standard error. The more

treatments, or the more blocks, the more experi-

mental units are needed. The complexity of the

experiment is determined by the underlying sta-

tistical model of Figure 4.1. In particular the error-

control design defines the study’s complexity. Therandomization process is a major part of this error-

control design. A justified and rigorous estimation

of the standard error is only possible in a random-

ized experiment. In addition, randomization has

the advantage that it distributes the effects of un-

controlled variability randomly over the treatment

groups. Replication of experimental units should

be sufficient, such that an adequate number of de-

grees of freedom are available for estimating the

experiment’s precision (standard error). This pa-rameter is related to the sampling & observation

aspect of the design.

These three aspects of experimental design pro-

vide a framework for classifying and comparing

the different types of experimental design that are

used in the life sciences. As we will see, eachof these designs has its advantages and disadvan-

tages.

5.1 Error-control designs

5.1.1 The completely randomized design

This is the most common and simplest possible

error-control design for comparative experiments.Each experimental unit is randomly assigned to ex-

actly one treatment condition. This is often the

default design used by investigators who do not

really think about design problems. The advan-

tage of the completely randomized design is that it

is simple to implement, as experimental units are

simply randomized to the various treatments. The

obvious disadvantage is the lack of precision in the

comparisons among the treatments, which is based

on the variation between the experimental units.

In the following example of a completely ran-

domized design the investigators used randomiza-

tion, blinding, and individual housing of animals

to guarantee the absence of systematic error and

independence of experimental units.

Example 5.1. To investigate the interaction of

chronic treatment with drugs on the proliferation

of gastric epithelial cells in rats, an experiment was

set up in which two experimental drugs were com-

pared with their respective vehicles. A total of 40

rats were randomly divided into four groups of each

ten animals, using the MS Excel randomization pro-

cedure described in Appendix B. To guarantee in-

dependence of the experimental units, the animals

were kept in separate cages. Cages were distributed

over the racks according to their sequence number.

Blinding was accomplished by letting the sequence

number of each animal correspond to a given treat-

ment. One laboratory worker was familiar with the

codes and prepared the daily drug solutions. Treat-

ment codes were concealed from the rest of the labo-

ratory staff that was responsible for the daily treat-

ment administration and final histological evalua-

tion.

Example 5.2. Another example of a completely

randomized design is about eliminating the bias

present in experiments using 96-well mictrotiter

plates. Burrows et al. (1984) already described thepresence of plate location effects in ELISA assays

carried out in 96-well microtiter plates. Similar



5.1. ERROR-CONTROL DESIGNS 29

findings were reported by Levasseur et al. (1995)

and Faessel et al. (1999) who described the presence

of parabolic patterns (Figure 5.2) in cell growth ex-

periments carried out in 96-well microtiter plates.

They were not able to show conclusively the un-

derlying causes of this systematic error, which, as

shown in Figure 5.2, could be of considerable mag-

nitude. Therefore, they concluded that only by ran-

dom allocation of the treatments to the wells these

systematic errors could be avoided.

Figure 5.2. Presence of bias in 96-well microtiter plates (afterLevasseur et al. (1995))

.

To accomplish this, they developed an ingeniousmethod for randomizing treatments in 96-well mi-

crotiter plates. Figure 5.3 sketches this procedure.

Drugs are serially diluted into tubes in a 96-tube

rack. Next, a randomization map is generated us-

ing an Excel macro. The randomization map is then

taped to the top of an empty tube rack. The original

tubes are then transferred to the new rack by push-

ing the numbered tubes through the correspond-

ing numbered rack position. Using a multipipette

system drug-containing medium is then transferredfrom the tubes to the wells of a 96-well microtiter

plate. At the end of the assay, the random data file

generated by the plate reader is imported into the

Excel spreadsheet and automatically resorted.

The above example is also a paradigm of how ran-

domization can be used to eliminate systematic er-

rors by transforming them into random noise. This

is one way of dealing with bias in microtiterplates.

Alternative methods consist of choosing an appro-priate experimental design such as a Latin square

design (see Section 5.1.4, page 33), or to construct

a statistical model that allows to correct for the

row-column bias (Schlain et al., 2001; Straetemans

et al., 2005).

Figure 5.3. Scheme for the addition and dilution of drug-containing medium to tubes in a 96-tube rack, randomizationof the tubes, and then addition of drug-containing medium tocell-containing wells of a 95-wel microtiter plate. (after Faesselet al. (1999))

.

5.1.2 The randomized complete block de-

sign

The concept of blocking as a tool to increase effi-

ciency by enhancing the signal-to-noise ratio has

already been introduced in Section 4.8.2.3 (page

24). In a randomized complete block design a sin-

gle isolated extraneous source of variability (block)

closely related to the response is eliminated from

the comparisons between treatment groups. The

design is complete, since all treatments are applied

within each block. Consequently, treatments can be

compared with one another within the blocks. The

randomization procedure now randomizes treat-

ments separately within each block. The random-

ized complete block design is a very useful and re-

liable error-control design, since all treatment com-

parisons can be made within a block. When a

study is designed such that the number of exper-

imental units within each block and treatment is

equal, the design is called a balanced design. A

few examples will illustrate its use in the labora-

tory.

Example 5.3. In the completely randomized design

of Example 5.1 (page 28), the rats were individu-ally housed in a rack consisting of 5 shelves of each

8 cages. On different shelves, rats are likely to be




Figure 5.4. Outline of a paired experiment on isolated cardiomyocytes. Cardiomyocytes of a single animal were isolated and seededin plastic Petri dishes. From the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, whilethe remaining member received the vehicle.

exposed to multiple varieties of light intensity, tem-

perature, humidity, sounds, views, etc. The inves-

tigators were convinced that shelf height affected

the results. Therefore, they decided to switch to

a randomized complete block design in which the

blocks corresponded to the 5 shelves of the rack.Within each block separately the animals were ran-

domly allocated to the treatments, such that in

each block each treatment condition occured ex-

actly twice. This example illustrates also that, al-

though all treatments are present in each block in a

randomized complete block design, more than one

experimental unit per block can be allocated to a

treatment condition.

There are two main reasons for choosing a ran-domized complete block design above a com-

pletely randomized design. Suppose there is an

extraneous factor that is strongly related to the out-

come of the experiment. It would be most unfor-

tunate if our randomization procedure yielded a

design in which there was a great imbalance on

this factor. If this were the case, the comparisons

between treatment groups would be confounded

with differences in this nuisance factor and be bi-

ased. The second main reason for a randomizedcomplete block design is its possibility to consid-

erably reduce the error variation in our experi-

ment, thereby making the comparisons more pre-

cise. The main objection to a randomized complete

block design is that it makes the strong assumption

that there is no interaction between the treatment vari-

able and the blocking characteristics, i.e. that the effect

of the treatments is the same among all blocks.

The basic idea behind blocking is to partition thetotal set of experimental units into subsets (blocks)

that are as homogeneous as possible. (Hinkelmann

and Kempthorne, 2008). Sometimes, this is accom-

plished by combining several extraneous sources

of variation into one aggregate blocking factor as

shown by the following example.

Example 5.4. In addition to the shelf height, the in-

vestigators also suspected that the body weight of

the animals might affect the results. Therefore, the

animals were numbered in order of increasing body

weight. The first eight animals were placed on the

top shelf and randomized to the four treatment con-

ditions, then the next eight animals were placed on

the second shelf and randomized, etc. The resulting

design was now simultaneously controlled for shelf

height and body weight. With regard to these two

characteristics, animals within a block were as much

alike as possible.

5.1.2.1 The paired design

Table 5.1. Results of an experiment using a paired design for test-ing the effect of a drug on the number of viable cardiomyocytesafter calcium overload

Rat No. Vehicle Drug Drug - Vehicle

1 44 46 2

2 64 75 11

3 60 67 7

4 50 64 145 76 77 1

When only two treatments are compared, the

randomized complete block design can be simpli-

fied to a paired design.

Example 5.5. Isolated cardiomyocytes provide an

easy tool to assess the effect of drugs on calcium-

overload (Ver Donck et al., 1986). Figure 5.4 illus-

trates the experimental setting. Cardiomyocytes of

a single animal were isolated and seeded in plasticPetri dishes. The Petri dishes were treated with

the experimental drug or with its solvent. After




Treatment

% V

i a b l e M y o c y t e

s

50

60

70

Control Treated

Treatment

% V

i a b l e M y o c y t e

s

50

60

70

Control Treated

Figure 5.5. Gain in efficiency induced by blocking illustrated in a paired design. In the left panel, the myocyte experiment is

considered as a completely randomized design in which the two samples largely overlap one another. In the right panel the linesconnect the data of the same animal and show a marked effect of the treatment.

a stabilization period the cells were exposed to a

stimulating substance (i.e. veratridine) and the per-

centage viable, i.e. rod-shaped, cardiomyocytes in

a dish was counted. Although comparison of the

treatment with the solvent control within a single

animal provides the best precision, it lacks external

validity. Therefore, a paired experiment with my-

ocytes from different animals and with animal as

blocking factor was carried out. From each animal

two Petri dishes containing exactly 100 cardiomy-

ocytes were prepared. From the resulting five pairs

of Petri dishes, one member was randomly assigned

to drug treatment, while the remaining member re-

ceived the vehicle. After stabilization and exposure

to the stimulus, the number of viable cardiomy-

ocytes in each Petri dish was counted. The result-

ing data are contained in Table 5.1 and displayed

in Figure 5.5.

There are 10 experimental units, since Petri

dishes can be independently assigned to vehicle or

drug. However, the statistical analysis should take

the particular structure of the experiment into ac-

count. More specifically, the pairing has imposed

restrictions on the randomization such that data

obtained from one animal cannot be freely inter-

changed with that from another animal. This is il-

lustrated in the right panel of Figure 5.5 by the lines

that connect the data from the same animal. It is

clear that for each pair the drug-treated Petri dish

yielded consistently a higher result than its vehicle

control counterpart. Since the different pairs (ani-

mals) are independent from one another, the mean

difference and its standard error can be calculated.

The mean difference is 7.0 with a standard error of

2.51.

5.1.2.2 Efficiency of the randomized complete

block design

Example 5.6. Suppose that in Example 5.5 the ex-

perimenter would not have used blocking, i.e. con-

sider it as if he had used myocytes originating from

10 completely different animals. The 10 Petri dishes

would then be randomly distributed over the two

treatment groups and we would have been con-

fronted with a completely randomized design. As-

sume also that the results of this hypothetical ex-

periment were identical to those obtained in the ac-

tual paired experiment. As is illustrated in the left

panel of Figure 5.5, the two groups largely over-

lap one another. Since all experimental units are

now independent of one another, the effect of the

drug is evaluated by calculating the difference be-

tween the two mean values and comparing it with

its standard error1 Obviously, the mean difference

is the same as in the paired experiment. However,1As already mentioned in Section 4.8.2.1, page 22, the standard error on the difference between two means is equal to

σ

2/n




the standard error on the mean difference has risen

considerably from a value of 2.51 to 7.83, i.e. the

use of blocking induced a substantial increase in the

precision of the experiment1.

Examples 5.5 and 5.6 clearly demonstrate that

carrying out a paired experiment has the possibil-

ity to enhance the precision of the experiment con-

siderably, while the conclusions have the same va-

lidity as in a completely randomized experiment.

However, the forming of blocks of experimental

units is only successful when the criterion upon

which the pairing is based, is related to the out-

come of the experiment. Using a characteristic that

does not have an important effect on the response

variables as blocking factor, is worse than useless

since the statistical analysis will lose a bit of power

by taking the blocking into account. This can be of

particular importance for small sized experiments.

5.1.3 Incomplete block designs

In some circumstances, the block size is smaller

than the number of treatments and consequently

it is impossible to assign all treatments within each

of the blocks. When a certain comparison is of spe-

cific interest, such as comparison with control, it is

wise to include it in each of the blocks.

Balanced incomplete block designs allow all

pairwise comparisons of treatments with equal

precision using a block size that is less than the

number of treatments. To achieve this, the bal-

anced incomplete block design has to satisfy the

following conditions (Bate and Clark, 2014; Cox,1958):

• each block contains the same number of

units;

• each treatment occurs the same number of

times in the design;

• every pair of treatments occurs together in

the same number of blocks.

Balanced incomplete block designs (BIB) exist for

only certain combinations of the number of treat-

ments and number and size of blocks. Software

to construct incomplete block designs and to test

whether a candidate design satisfies the criteria for

being balanced is provided by the crossdes package

in R (Sailer, 2013).

Table 5.2. Balanced incomplete block design for Example 5.7 withtreatments A,B ,C and D

Lamb Sibling lamb pair

1 2 3 4 5 6

First A A A B B C

Second B C D C D D

Example 5.7. (Anderson and McLean, 1974; Bateand Clark, 2014). An experiment was carried out

to assess the effect that vitamin A and a protein di-

etary supplement have on the weight gain of lambs

over a 2-month period. There were four treatments,

labeled A, B , C and D in the study corresponding

to low dose and high dose of vitamin A combined

with low dose and high dose of protein. Three repli-

cates per treatment was considered as sufficient and

blocking was carried using pairs of sibling lambs, so

six pairs of siblings were used. With the numberof treatments restricted to two per block, the bal-

anced incomplete block design shown in Table 5.2

was used.

A possible layout of the experiment is obtainedin R by:

> require(crossdes)

> # find a incomplete block design

> inc.bl<-find.BIB(4,6,2)

> # check if it is a valid BIB

> isGYD(inc.bl)

[1] The design is a balanced incomplete block design w.r.t. rows.

> # if OK make a suitable display

> # replace numbers by treatment codes

> aBIB<-LETTERS[inc.bl]

> dim(aBIB)<-dim((inc.bl))

> # transpose displays matrix as in text

> t(aBIB)

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] "B" "A" "A" "C" "B" "A"

[2,] "D" "C" "B" "D" "C" "D"

1

Kutner et al. (2004) provide a method to compare designs on the basis of their relative efficiency. For the design inExample 5.5, the calculations show that this paired design is 7.7 times more efficient than the completely randomized designin Example 5.6. In other words, about 8 times as many replications per treatment with a completely randomized designare required to achieve the same results.




It is clear that the BIB-design generated in the R-

session is equivalent to the design shown in Table

5.2.

When looking for a suitable balanced incom-

plete block design, it can happen that if we add one

or more additional treatments, a more suitable de-

sign is found. Alternatively, omitting one or more

treatments can yield a more efficient design (Cox,

1958).

5.1.4 The Latin square design

The Latin square design is an extension of the ran-

domized complete block design, but now blocking

is done simultaneously on two characteristics thataffect the response variable.

Example 5.8. (Hu et al., 2015). A total of 16

large (3m height,6.7m2 area) hexagonal ecosystem

chambers were arranged in a Latin square with four

repetition chambers for each treatment. The four

climate treatments were (i) control (CO, with ambi-

ent conditions), (ii) increased air temperature = air

warming (AW), (iii) reduced irrigation = drought

(D), and (iv) air warming and drought (AWD).

In the above example 4 × 4 Latin squares were

used to control for the location in the ecosystem

chambers. Another example from laboratory prac-

tice concerns the simultaneous control, without

loss of precision, of the row and column effects in

microtiter plates (Burrows et al., 1984). Still an-

other example is about experiments on neuronal

protection, where a pair of animals was tested each

day and the investigator expected a systematic dif-ference not only between the pairs but also be-

tween the animal tested in the morning and the

one tested in the afternoon.

Table 5.3. Arrangement for a 4×4 Latin square design controllingfor column and row effects. The letters A-D indicate the fourtreatments

Row Column

1 2 3 4

1 B A D C

2 C B A D

3 A D C B

4 D C B A

In a Latin square design the k treatments are ar-

ranged in a k × k square such as in Table 5.3 Each

of the four treatments A, B, C, and D occurs exactly

once in each row and exactly once in each column.

The Latin square design is categorized as a two-way

block error-control design.

The main advantage of the Latin square design

is that it simultaneous balances out two sources

of error. The disadvantage is the strong assump-

tion that there are no interactions between the

blocking variables or between the treatment vari-

able and blocking variables. In addition Latin

square designs are limited by the fact that the num-

ber of treatments, number of row, and number of

columns must all be equal. Fortunately, there are

arrangements that do not have this limitation (Cox,

1958; Hinkelmann and Kempthorne, 2008).

In a k × k Latin square, only k experimental

units are assigned to each treatment group. How-

ever, it may happen that more experimental units

are required to obtain an adequate precision. The

Latin square can then be replicated and several

squares can be used to obtain the necessary sam-ple size. In doing this, there are two possibilities

to consider. Either one stacks the squares on top

of each other and keeps them as separate indepen-

dent squares, or one completely randomizes the

order of the rows (or columns) of the design. In

general, keeping the squares separate is not a good

idea and leads to less precise estimation and loss

of degrees of freedom, especially in the case of a

2 × 2 Latin square. It is only when there is a rea-

son to believe that the column (or row) effects aredifferent between the squares, that keeping them

separate makes sense.

In the case of 96-well microplates consisting of

8 rows and 12 columns, scientists often omit the

edge rows and columns which are known to yield

extreme results. The remaining 6 rows and 10

columns are then used for experimental purposes.

However, using a Latin square design that appro-

priately corrects for row and column effects, thecomplete 96-well microplate can be used as is il-

lustrated by the following example.




Table 5.4. A balanced lattice square arrangement of 16 treatments with 5 replicate squares on a single microtiter plate with eightlettered rows and twelve numbered columns (after (Burrows et al., 1984))

1 2 3 4 5 6 7 8 9 10 11 12A 11 7 3 15 11 10 12 9 15 5 12 2B 12 8 4 16 6 7 5 8 9 3 14 8C 9 5 1 13 1 4 2 3 4 10 7 13D 10 6 2 14 16 13 15 14 6 16 1 11

E 12 14 7 1 5 11 4 14F 3 5 16 10 1 15 8 10G 13 11 2 8 9 7 16 2H 6 4 9 15 13 3 12 6

Example 5.9. Aoki et al. (2014) describe their assay

as follows: Samples diluted 1:1,600 and serially-

diluted standards ( 100 µ L) were placed in wells.

Samples were arranged in triads, each containing

samples from a case and two-matched controls. In

order to minimize measurement error due to a spa-

tial gradient in binding efficiency within plate, mi-

croplate wells were grouped into 6 blocks of 16 ( 4×4)

wells and each set of 3 samples was placed in the

same block using a version of Latin square de-

sign along with standards. Placement patterns were

changed across blocks such that the influence of the

spatial gradient on the signal-standard level rela-

tionship was minimized. The authors do not ex-

plicitly state what they mean by version of Latin

square . Presumably lattice square designs were

used. The interested reader is referred to the litera-

ture (Cox, 1958) for this more advanced topic, but

the idea is clear that this systematic arrangement

allows for unbiased comparisons. By ”placement

patterns were changed across blocks” they presum-

ably mean randomization of the rows and columns

of the individual Latin squares.

In microplate experiments, variations on Latin

square designs such as Latin rectangles (Hinkel-

mann and Kempthorne, 2008) and incompleteLatin squares such as Youden squares and lattice

squares (Cox, 1958) can be of use. Burrows et al.

(1984) investigated the properties of a number of

these designs for quantitative ELISA. One of the

designs they considered (see Table 5.4) was a bal-

anced lattice square design that allows to compare

within one plate 16 treatments with one another,

with 5 replicates for each treatment. Other system-

atic arrangements specifically for estimating rela-

tive potencies in the presence of microplate loca-tion effects have also been proposed (Schlain et al.,

2001).

The principle of simultaneously controlling for

row- and column-effect by means of Latin-square

designs are not restricted to 96-well microtiter

plates. but also apply to their larger variants, like

the 384 and 1536 microwell plates.

Random Latin squares can be produced usingthe R-package magic (Hankin, 2005). A possiblelayout of the experiment in Example 5.8 is obtained by:

> require(magic)

> trts<-c("CO","AW","D","AWD")

> ii<-rlatin(4)

> matrix(trts[ii],nrow=4,ncol=4,byrow=FALSE)

[,1] [,2] [,3] [,4]

[1,] "D" "AW" "AWD" "CO"

[2,] "AWD" "CO" "D" "AW"

[3,] "AW" "D" "CO" "AWD"

[4,] "CO" "AWD" "AW" "D"

5.2 Treatment designs

5.2.1 One-way layout

The examples that we discussed up to now (apart

from Examples 5.7 and 5.8), all considered the

treatment aspect of the design as consisting of a

single factor. This factor can represent presence or

absence of a single condition or several different

related treatment conditions (e.g. Drug A, Drug B,

Drug C ). The treatment aspect of these designs is

referred to as single factor or one-way layout.

5.2.2 Factorial designs

In some types of experimental work, such as in Ex-

ample 5.7 (page 32), it can be of interest to assess

the joint effect of two or more factors, e.g. high

and low dose of Vitamin A combined with high

and low dose of protein. This is a typical exampleof a 2 × 2 full factorial design, the simplest and most

frequently used factorial treatment design. In this



5.2. TREATMENT DESIGNS 35

design, the factors and all combinations (hence full)

of the levels of the factors are studied. The factorial

treatment design of this example allows estimating

the main effects of the individual treatments, as well

as their interaction effect, i.e. the deviation from ad-

ditivity of their joint effect. We will use the 2×2 full

factorial design to explore the concepts of factorial

designs and of statistical interaction.

Example 5.10. (Bate and Clark, 2014) A study was

conducted to assess whether the serum chemokines

JE and KC could be used as markers of atheroscle-

rosis development in mice (Parkin et al., 2004). Two

strains of apolipoprotein-E-deficient (apoE-/-) mice,

C3H apoE-/- and C57BL apoE-/- were used in the

study. These mice were fed either a normal dietor a diet containing cholesterol (the Western diet).

After 12 weeks the animals were sacrificed and their

atherosclerotic lesion area was determined.

1 5 0 0

2 0 0 0

2 5 0 0

3 0 0 0

3 5 0 0

4 0 0 0

4 5 0 0

5 0 0 0

Diet

L e s i o n a r e a

µ m

−

Normal Western

C3H

C57BL

Figure 5.6. Plot of mean lesion area for the case where there is nointeraction between strains and diets.

The study design consisted of two categorical

factors, Strain and Diet. The strain factor consisted

of two levels, C3H apoE-/- and C57BL apoE-/-. The

diet factor also contained two levels, the normal ro-

dent diet and the Western diet. In total there were

four combinations of factor levels:

1. C3H apoE-/- + normal diet

2. C3H apoE-/- + Western diet

3. C57BL apoE-/- + normal diet

4. C57BL apoE-/- + Western diet

Let us consider some possible outcomes of this

experiment.

No interaction When there is no interaction be-

tween strains and diets, the difference between the

two diets is the same, irrespective of the mouse

strain. The lines connecting the mean values of the

two diets are parallel to one another, as is shown in

Figure 5.6.

There is an overall effect of diet, animals fed

with the Western diet show larger lesions, and this

effect is the same in both strains. There is also an

overall effect of the strain. Lesions are larger in the

C3H apoE-/- than in the C57BL apoE-/- strain and

this difference is the same for both diets.

Since the differences between the diets is the

same regardless which strain of mice they are fed

to, it is appropriate to average the results from each

diet across both strains and make a single compari-

son between the diets rather than making the com-

parison separately for each strain. In doing this,

the external validity of the conclusions is broad-

ened since they apply to both strains. In addi-

tion, the comparison between the two strains can

be made, irrespective of the diet the animals are

receiving. C3H apoE-/- have larger lesions than

C57BL apoE-/- regardless of the diet. In both cases

the comparisons make use of all the experimental

units. This makes a factorial design a highly effi-

cient design, since all the animals are used to test

simultaneously two hypotheses.

1 5 0 0

2 0 0 0

2 5 0 0

3 0 0 0

3 5 0 0

4 0 0 0

4 5 0 0

5 0 0 0

Diet

L e s i o n a r e a

µ m

−

Normal Western

C3H

C57BL

Figure 5.7. Plot of mean lesion area for the case where there is amoderate interaction between strains and diets.




Moderate interaction When there is a moderate in-

teraction, the direction of the effect of the first fac-

tor is the same regardless of the level of the sec-

ond factor. However, the size of the effect varies

with the level of the second factor. This is exem-

plified by Figure 5.7, where the lines are not paral-

lel, though both indicate an increase in lesion size

for the Western diet as compared to the normal

diet. This increase is more pronounced in the C3H

apoE-/- strain than in the C57BL apoE-/- animals.

Hence, the C3H apoE-/- strain is more sensitive to

changes in diet.

Strong interaction The effect of the first factor can

also be entirely dependent on the level of the sec-

ond factor. This is illustrated in Figure 5.8, where

feeding the Western diet to the C3H apoE-/- has

little effect on lesion size, whereas the effect on

C57BL apoE-/- mice is substantial. This is an exam-

ple of strong interaction. The Western diet always

results in bigger lesions, but the effect in the C57BL

apoE-/- strain is much more pronounced than in

the C3H apoE-/- mice. Furthermore, when fed with

normal diet the C3H apoE-/- mice show a larger le-

sion area than the C57BL apoE

-/-

strain. However,when the animals receive Western diet the oppo-

site is true.

1 5 0 0

2 0 0 0

2 5 0 0

3 0 0 0

3 5 0 0

4 0 0 0

4 5 0 0

5 0 0 0

Diet

L e s i o n

a r e a ( µ m

− 2 )

Normal Western

C3H

C57BL

Figure 5.8. Plot of mean lesion area for the case where there is amoderate interaction between strains and diets.

The factorial treatment design can be combinedwith the error control designs that we encountered

in Section 5.1. Example 5.7 (page 32) illustrates a

2 × 2 factorial combined with a balanced incom-

plete block design, whereas a 2× 2 factorial design

is used in connection with a Latin square error con-

trol design in Example 5.8 (page 33).

Factorial designs are widely used in plant re-

search, process optimization, etc. Recent appli-

cations in the biomedical field include cDNA mi-

croarray experiments.

Example 5.11. Glonek and Solomon (2004) describe

an example of a 2 × 2 factorial design using cDNA

microarrays and, in doing this, highlight the impor-

tance of the interaction effect in this context:

A 2 × 2 factorial experiment was con-ducted to compare the two mutants at

times zero hours and 24 hours; it was

anticipated that measuring changes over

time would distinguish genes involved

in promoting or blocking differentiation,

or that suppress or enhance growth, as

genes potentially involved in leukaemia.

We are interested in genes differentially

expressed between the two samples i.e.

in the sample main effect, but more par-ticularly, in those genes which are dif-

ferentially expressed in the two samples

at time 24 hours but not at time zero

hours. This is the interaction of sample

and time.

In the case of cDNA experiments, the design

of a 2 × 2 factorial is not as straightforward as

in other application areas. More information on

how to set up factorial designs involving microar-ray experiments can be found in the specialized lit-

erature (Glonek and Solomon, 2004) and (Banerjee

and Mukerjee, 2007).

It happens that researchers have actually

planned a factorial experiment, but in the design

and analysis phase failed to recognize it as such.

In that case they do not take full advantage of the

factorial structure for interpreting treatment effects

and often analyze and interpret the experimentwith an incorrect procedure (Nieuwenhuis et al.,

2011). We will come back to that in Section 9.2.3.



5.3. MORE COMPLEX DESIGNS 37

When the two factors consist of concentrations

or dosages of drugs, researchers tend to confuse

the statistical concept of interaction with the phar-

macological concept of synergism. However, the

requirements for two drugs to be synergistic with

each other are much more stringent than just the

superadditivity associated with statistical interac-

tion (Greco et al., 1995; Tallarida, 2001). It is easy

to demonstrate that, due to the nonlinearity of the

log-dose response relationship, superadditive ef-

fects will always be present for the combination,

since the total drug dosage has increased, thus im-

plying that a drug could be synergistic with itself.

In connection with the 96-well plates and the pres-

ence of the plate location effects that we saw ear-

lier, Straetemans et al. (2005) provide a statistical

method for assessing synergism or antagonism.

Higher dimensional factorial designs Although

our discussion here is restricted to the 2 × 2 fac-

torial, more factors and more levels can be used.

However, the number of experimental units that

are involved may become rather large. For in-

stance, an experiment with three factors at three

levels involves 3 × 3 × 3 = 27 different treatment

combinations. When there are at least two experi-

mental units in each treatment group, then 54 units

will be required. Such an experiment may soon

be too large to be feasible. In addition, three fac-

tor and higher order interactions are difficult to in-

terpret (Clewer and Scarisbrick, 2001). Therefore,

when a large number of factors are involved, the

presence of higher order interactions is often ne-

glected. Such high dimensional fractional factorial

designs are mostly used in process optimization.

(Selwyn, 1996).

5.3 More complex designs

We will now consider some sepecialized exper-

imental designs consisting of a somewhat morecomplex error-control design that is intertwined

with a factorial treatment design.

5.3.1 The split-plot and strip-plot designs

These types of design incorporate subsampling

and allow to make comparisons among different

treatments at two or more sampling levels. Thesplit-plot design allows assessment of the effect

of two independent factors using different exper-

imental units as is illustrated by the following ex-

amples.

Figure 5.9. Outline of the split-plot experiment of 5.12. Cageseach containing two mice each were assigned at random to anumber of dietary treatments and the color-marked mice withinthe cage were randomly selected to receive one of two vitamintreatments by injection.

Example 5.12. An example of a split-plot design is

the following hypothetical experiment on diets and

vitamins (see Figure 5.9). Cages each containing

two mice were assigned at random to a number of

dietary treatments (i.e. cage was the experimen-

tal unit for comparing diets), and the color-marked

mice within the cage were randomly selected to re-

ceive one of two vitamin treatments by injection

(i.e. mice were the experimental units for the vita-

min effect).

Example 5.13. Another example is about the effects

of temperature and growth medium on yeast growth

rate (Ruxton and Colegrave, 2003). In this experi-

ment Petri dishes are placed inside constant temper-

ature incubators (see Figure 5.10). Within each in-

cubator growh media are randomly assigned to the

individual Petri dishes. Temperature is then consid-

ered as the main-plot factor and growth medium thesubplot factor. The experiment has to be repeated

using several incubators for each temperature.




Figure 5.10. Outline of the split-plot experiment of Example 5.13.Six incubators were randomly assigned to 3 temperature levelsin duplicate. In each incubator 8 Petri dishes were placed. Fourgrowth media were randomly applied to the Petri dishes.

The term split-plot originates from agricultural

research where fields are randomly assigned to dif-

ferent levels of a primary factor and smaller ar-

eas within the fields are randomly assigned to one

level of another secondary factor. Actually, the

split-plot design can be considered as two random-

ized complete block designs supperimposed upon

one another (Hinkelmann and Kempthorne, 2008).

The split-plot design is a two-way crossed (factorial)

treatment design and a split-plot error design.

A special case of the split-plot design, is called

the strip-plot or split-block design. The strip-plot

design differs from the split-plot design by apply-

ing a special kind of randomization. In the strip-

plot or strip-block design, both factors are applied

to whole-plots which are placed “orthogonal” to

each other.

Figure 5.11. Schematic representation of a strip-plot experiment.Factor A is applied to whole plots in the horizontal direction atlevels a1, a2, a3, a4, and a5. Factor B is also applied to wholeplots at levels b1, b2 , and b3.

Schematically, this is represented as in Fig-

ure 5.11. There are two factors A and B that are

applied to the two types of whole plots. As in Fig-

ure 5.11, both plots are placed orthogonal upon

one another. Although strip plot designs were

originally developed for agricultural field trials,

the next example shows that these designs have

found a place in the modern laboratory.

Example 5.14. (Casella, 2008; Lansky, 2002) de-

scribe strip-plot designs that accomodate for serialdilutions in 96-well microtiter plates. In this case,

the 8 rows of the microplate are randomly assigned

to the samples (e.g. reference and three test sam-

ples in duplicate) and the columns are assigned to

the serial dilution level (dose). The dilution level is

fixed for a given column, such that all samples in

a column have the same serial dilution level, which

makes it a strip-plot design.

5.3.2 The repeated measures design

Figure 5.12. A typical repeated measures design. Animals are ran-domized to different treatment groups, the variable of interest(e.g. blood pressure) is measured at the start of the experimentand at different time points following treatment application.

The repeated measures design is a special caseof the split-plot and more specifically of the strip-

plot designs. In a repeated measures design, we

typically take multiple measurements on a subject

over time. If any treatment is applied to the sub-

jects, they immediately become the whole plots

and Time is the subplot factor. A typical exper-

imental set-up is displayed in Figure 5.12 where

two groups of animals are randomized over 2 (or

more) treatment groups and the variable of inter-

est is measured just before and at several timepoints following treatment application. The ma-

jor disadvantage of repeated measures designs is

the presence of carry-over effects by which the re-

sults obtained for a treatment are influenced by

the previous treatment. In addition, any confound-

ing of the treatment effect with time, as was the

case in the use of self-controls in Section 4.8.1.1,

must be avoided. Therefore, like in the example of

Figure 5.12, a parallel control group which protects

against time related bias must always be included.



5.3. MORE COMPLEX DESIGNS 39

Table 5.5. Bioassay experiment of Example 5.14. Row and column indicators refer to the conventional coding of 96-well plates. Thefour samples (A, B, C, D) are in duplo applied to the rows, and the serial dilution level (dose) is applied to an entire column. A1/1in a cell means sample A, first replicate at dilution level 1, B2/3 second replicate of sample B at dilution 3, etc.

1 2 3 4 5 6 7 8 9 10 11 12

A B1/2 B1/8 B1/10 B1/1 B1/11 B1/3 B1/12 B1/7 B1/4 B1/5 B1/6 B1/9B D2/2 D2/8 D2/10 D2/1 D2/11 D2/3 D2/12 D2/7 D2/4 D2/5 D2/6 D2/9C B2/2 B2/8 B2/10 B2/1 B2/11 B2/3 B2/12 B2/7 B2/4 B2/5 B2/6 B2/9D C1/2 C1/8 C1/10 C1/1 C1/11 C1/3 C1/12 C1/7 C1/4 C1/5 C1/6 C1/9E A1/2 A1/8 A1/10 A1/1 A1/11 A1/3 A1/12 A1/7 A1/4 A1/5 A1/6 A1/9F A2/2 A2/8 A2/10 A2/1 A2/11 A2/3 A2/12 A2/7 A2/4 A2/5 A2/6 A2/9G C2/2 C2/8 C2/10 C2/1 C2/11 C2/3 C2/12 C2/7 C2/4 C2/5 C2/6 C2/9H D1/2 D1/8 D1/10 D1/1 D1/11 D1/3 D1/12 D1/7 D1/4 D1/5 D1/6 D1/9



6. The Required Number of Replicates -

Sample Size

”Data, data, data. I cannot make bricks without clay.”

Sherlock Holmes

The Adventure of the Copper Beeches, A.C. Doyle.

6.1 Determining sample size is a

risk - cost assessment

Replication is the basis of all experimental design

and a natural question that arises in each study is

how many replicates are required. The more repli-

cates, the more confidence we have in our conclu-

sions. Therefore, we would prefer to carry out our

experiment on a sample that is as large as possi-

ble. However, increasing the number of replicates

incurs a rise in cost. Thus, the answer to how large

an experiment should be, is that it should be just

big enough to give confidence that any biologically

meaningful effect that exists can be detected.

6.2 The context of biomedical ex-

periments

The estimation of the appropriate size of the exper-

iment is straightforward and depends on the statis-

tical context, the assumptions made, and the study

specifications. Context and specifications on their

turn depend on the study objectives and the design

of the experiment.

In practice, the most frequently encounteredcontexts in statistical inference are point estima-

tion, interval estimation, and hypothesis testing, of

which hypothesis testing is definitely the most im-

portant in biomedical studies.

6.3 The hypothesis testing con-

text - the population model

In the hypothesis testing context2 one defines a

null hypothesis and, for the purpose of sample size

estimation, an alternative hypothesis of interest.

The null hypothesis will often be that the response

variable does not really depend on the treatment

condition. For example, one may state as a null hy-

pothesis that the population means of a particular

measurement are equal under two or more differ-

ent treatment conditions and that any differences

found can be attributed to chance.

At the end of the study when the data are an-

alyzed (see Section 7.3), we will either accept or

reject the null hypothesis in favor of the alterna-

tive hypothesis. As is indicated in Table 6.1, there

are four possible outcomes at the end of the exper-

iment: When the null hypothesis is true and we

failed to reject it, we have made the correct deci-

sion. This is also the case when the null hypothesis

is false and we did reject it. However, there are twodecisions that are erroneous. If the null hypothesis

2The hypothesis testing context is in statistics also known as the Neyman-Pearson system

41



42 CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

Table 6.1. The decision process in hypothesis testing

State of Nature

Decision Null hypothesis true Alternative hypothesismade true true

Do not reject null hypothesis Correct decision False negative(1−α) β

Reject null hypothesis False positive Correct decisionα (1 − β)

is true and we incorrectly rejected it, then we made

a false positive decision. Conversely, if the alternative

hypothesis is true (i.e. the null hypothesis is false)

and we failed to reject the null hypothesis we have

made a false negative decision. In statistics, a false

positive decision is also referred to as a type I error

and a false negative decision as a type II error.

The basis of sample size calculation is formed by

specifying an allowable rate of false positives and

an allowable rate of false negatives for a particu-

lar alternative hypothesis and then to estimate a

sample size just large enough, so that these low er-

ror rates can be achieved. The allowable rate for

false positives is called the level of significance or al-

pha level and is usually set at values of 0.01, 0.05, or

0.10. The false negative rate depends on the pos-

tulated alternative hypothesis and is usually de-

scribed by its complement, i.e. the probability of

rejecting the null hypothesis when the alternative

hypothesis holds. This is called the power of the

statistical hypothesis test. Power levels are usually

expressed as percentages and values of 80% or 90%

are standard in sample size calculations.

Significance level and power are already two of

the four major determinants of the sample size re-

quired for hypothesis testing. The remaining two

are the inherent variability in the study parame-

ter of interest and the size of the difference to be

detected in the postulated alternative hypothesis.

Other key factors that determine the sample size

are the number of treatments and the number of

blocks used in the experimental design.

When the significance level decreases or the

power increases, the required sample size will be-

come larger. Similarly, when the variability is

larger or the difference to be detected smaller, the

required sample size will also become larger. Con-

versely, when the difference to be detected is large

or variability low, the required sample size will be

small. It is convenient, for quantitative data, to ex-

press the difference in means as effect size by divid-

ing it by the standard deviation1. The effect size

then takes both the difference and inherent vari-

ability into account. Cohen (1988) argues that ef-

fect sizes of 0.2, 0.5 and 0.8 can be regarded respec-

tively as small, medium and large. However, in

basic biomedical research the size of the effect that

one is interested in, can be substantially larger.

6.4 Sample size calculations

6.4.1 Power analysis computations

Now that we are familiar with the concepts of hy-

pothesis testing and the determinants of sample

size, we can proceed with the actual calculations.

There is a significant amount of free software avail-

able to make elementary sample size calculations.

In particular, there is the R-package pwr (Cham-

pely, 2009).

Example 6.1. Consider the completely randomizedexperiment about cardiomyocytes discussed in Ex-ample 5.6 (page 31). The standard deviations of thetwo groups are each about 12.5. A large effect of 0.8 in this case, corresponds to a difference betweenboth groups of 10 myocytes. Let’s assume that wewish to plan a new experiment to detect such adifference with a power of 80% and we want to reject the null hypothesis of no difference at a levelof significance of 0.05, whatever the direction of thedifference between the two samples (i.e. a two-sided

1When comparing mean values from two independent groups, the standard deviation for calculating the effect size, canbe from either group when variances of the two groups are homogeneous, or alternatively a pooled standard deviation can

be calculated as sp =

(s21

+ s22

)/2.1See Section 7.3 for a discussion of one-sided and two-sided tests.



6.4. SAMPLE SIZE CALCULATIONS 43

test1). The calculations are carried out in R in a sin-gle line of code and show that 26 experimental unitsare required in each of the two treatment groups:

> require(pwr)

> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,

+ type="two.sample",+ alternative="two.sided")

Two-sample t test power calculation

n = 25.52457

d = 0.8

sig.level = 0.05

power = 0.8

alternative = two.sided

NOTE: n is number in *each* group

Alternatively we could ask what would be the powerof an experiment with say, 5 animals per treatmentgroup:

> pwr.t.test(d=0.8,n=5,sig.level=0.05,

+ type="two.sample",

+ alternative="two.sided")

Two-sample t test power calculation

n = 5

d = 0.8

sig.level = 0.05

power = 0.2007395

alternative = two.sided

NOTE: n is number in *each* group

For a completely randomized experiment with 2

groups of 5 animals, as described in Example 5.6,

the power to detect a difference of 10 myocytes be-

tween treatment groups (∆ = 0.8) is only 20%.

A quick and dirty method for sample size cal-

culation against a two-sided alternative in a two

group comparison, with a power of 0.8 and Type I

error of 0.05 is provided by Lehr ’s equation (Lehr,

1992; Van Belle, 2008):

n ≈ 16/∆2

where ∆ represents the effect size and n stands for

the required sample size in each treatment group.

For the above example, the equation results in:

16/0.64 = 25 animals per treatment group. The nu-

merator of Lehr’s equation depends on the desiredpower. Alternative values for the numerator are 8

and 21 for powers of 50% and 90%, respectively.

6.4.2 Mead’s resource requirement equa-

tion

There are occasions when it is difficult to use a

power analysis, because there is no information on

the inherent variability (i.e. standard deviation)

and/or because the effect size of interest is dif-

ficult to specify. An alternative, quick and dirty

method for approximate sample size determina-

tion was proposed by Mead (1988). The method

is appropriate for comparative experiments which

can be analyzed using analysis of variance (Grafen

and Hails, 2002; Kutner et al., 2004), such as:

• Exploratory experiments

• Complex biological experiments with several

factors and treatments

• Any eperiment where the power analysis

method is not possible or practicable.

The method depends on the law of diminishing re-

turns: adding one experimental unit to a small ex-

periment gives good returns, while adding it to a

large experiment does not do so. It has been used

by statisticians for decades, but has been explicitly justified by Mead (1988). An appropriate sample

size can be roughly determined by the number of

degrees of freedom for the error term in the analy-

sis of variance (ANOVA) or t test given by the for-

mula:

E = N − T − B

where E, N, T and B are the total, error, treatment

and block degrees of freedom (number of occurrences

or levels minus 1) in the ANOVA. In order to ob-

tain a good estimate of error it is necessary to have

at least 10 degrees of freedom for E, and many

statisticians would take 12 or 15 degrees of free-

dom as their preferred lower limit. On the other

hand, if E is allowed to be large, say greater than

20, then the experimenter is wasting resources. It

is recommended that, in a non-blocked design the,

E should be between ten and twenty.

Example 6.2. Suppose an experiment is planned

with four treatments, with eight animals per group(32 rats total). In this case N=31, B=0 (no block-

ing), T=3 , hence E=28 . This experiment is a bit




Number of Subsamples

P o w e r

20

40

60

80

5 10 15 20 25 30

n=4

n=6

n=8

n=12

n=16

n=32

Number of Subsamples

P o w e r

20

40

60

80

5 10 15 20 25 30

n=4

n=6

n=8

n=12

n=16

n=32

Figure 6.1. Power curves for a two-group comparison to detect a difference of 1, with a two-sided t-test with significance levelα = 0.05 as a function of the number of subsamples m. Lines are drawn for different numbers of experimental units n in eachgroup. For both left and right panel the between sample standard deviation (σn) is 1, while within sample standard deviation (σm)is 1 in the left panel and 2 in the right panel. The dots connected by the dashed line indicate where the total number of subsamples2 × n×m equals 192. The vertical line indicates an upper bound to the useful number of subsamples of m = 4(σ2

m/σ2n

).

too large, and six animals per group might be more

appropriate (23 - 3 = 20).

There is one problem with this simple equa-tion. It appears as though blocking is bad, because

it reduces the error degrees of freedom. If in the

above example, the experiment would be done in

eight blocks, then N = 31, B = 7, T = 3 and

E = 31 − 7 − 3 = 21 instead of 28. However,

blocking nearly always reduces the inherent vari-

ability, which more than compensates for the de-

crease in the error degrees of freedom, unless the

experiment is very small and the blocking criterion

was not well related to the response. Therefore,when blocking is present and the error degrees of

freedom is not less than about 6, the experiment is

probably still of an adequate size.

Example 6.3. If we consider again, in this context

the paired experiment of Example 5.5 (page 30), we

have N = 9, B = 4, T = 1. Hence, E = 9−4−1 = 4.

Obviously the sample size of 10 experimental units

was too small to allow an adequate estimate of the error. At least 2 experimental units should be

added.

6.5 How many subsamples

In Section 4.8.2.2 we defined the standard error of

the experiment when subsamples are present as:

2

n(σ2

n + σ2m

m )

where n and m are the number of experimental

units and subsamples and σn and σm the between

sample and within sample standard deviation. Us-

ing this expression, we can establish the power for

different configurations of an experiment.

Figure 6.1 shows the influence of the number of

experimental units (n) and the number of subsam-

ples (m) per experimental unit on the power of two

experiments to detect a difference between two

mean values of 1. For both experiments σn = 1.

The left panel shows the case where σm = 1, while

in the right panel σm = 2. The dots connected by

a dashed line represent the power for experiments

where the total number of subsamples equals 192

(2 treatment groups × n× p).

As is illustrated in the left panel of Figure 6.1,

subsampling has only a limited effect on the power



6.5. HOW MANY SUBSAMPLES 45

Number of comparisons

R e q u i r e d s a m p l e s i z e p e r g r o u p

8

16

32

64

128

2 4 6 8 10

∆ = 0.5

∆ = 0.8

∆ = 1.0

∆ = 1.5

∆ = 2.0

Number of comparisons

R e q u i r e d s a m p l e s i z e p e r g r o u p

8

16

32

64

128

2 4 6 8 10

∆ = 0.5

∆ = 0.8

∆ = 1.0

∆ = 1.5

∆ = 2.0

Figure 6.2. Required sample size of a two-sided test with a significance level α of 0.05 and a power of 80% (left panel) and 90%(right panel) as a function of the number of comparisons that are carried out. Lines are drawn for different values of the effect size(∆). Note that the y-axis is logarithmic.

of the experiment when the within sample vari-

ability σm is the same size (or smaller) as the be-

tween sample variability σn. In this case it has

no sense taking more than say 4 subsamples per

experimental unit, as is indicated by the vertical

line in Figure 6.1. Furthermore, the sharp decline

of the dashed line connecting the points with the

same total number of subsamples indicates that

subsampling in this case is rather inefficient, at

least when the cost of subsamples and experimen-

tal units is not taken into consideration. An experi-

ment with 32 experimental units and 3 subsamples

has a power of more than 90%, while for an exper-

iment with the same total number of subsamples

but with 4 experimental units and 24 subsamples

per unit, the power is only about 20%.

The right panel of Figure 6.1 shows the case

where the within sample standard deviation σm

is twice the standard deviation between samples

σn. In this example, taking more subsamples does

make sense. The power curves keep increasing

until the number of subsamples is about 16. The

loss in efficiency by taking subsamples is also moremoderate, as is indicated by the less sharp decline

of the dotted line.

In both examples the power curves have flat-

tened after crossing the vertical line where m =

4(σ2m/σ2

n). This is known as Cox’s rule of thumb

(Cox, 1958) about subsamples, which states that

for the completely randomized design there is not

much increase in power when the number of sub-

samples m is greater than 4(σ2m/σ2

n). Cox’s ratio

provides an upper limit for a useful number of sub-

samples. However, this rule of thumb does not

take the different costs involved with experimental

units and subsamples into account. In many cases,

especially in animal research, the cost of the exper-

imental unit is substantially larger than that of the

subunit. Taking these differential costs into consid-

eration, the optimum number of subsamples can

be derived as (Snedecor and Cochran, 1980):

m =

cncm

× σ2m

σ2n

This equation shows that taking subsamples is of

interest when the cost of experimental units cn is

large relative to the cost of subsamples cm, or when

the variation among subsamples σm is large rela-

tive to the variation among experimental units σn.

Example 6.4. In a morphologic study (Verheyen

et al., 2014), the diameter of cardiomyocytes was




0.00

0.02

0.04

0.06

0.0 0.5 1.0 1.5 2.0

Effect size

F r a c t i o n

0.00

0.02

0.04

0.06

−1 0 1 2 3

Effect size

F r a c t i o n

Figure 6.3. Running the cardiomyocyte experiment a large number of times, the measured effect sizes follow a broad distribution. In

both plots the true effect size is 0.8. The dark area represents statistically significant results (two-sided p≤

0.05) and the verticaldotted line indicates the effect size which is just large enough to be statistically significant. Left panel: 26 animals are used pertreatment group which corresponds to a power of 80%. Right panel: Only 5 animals per treatment group are used which results inan underpowered experiment

examined in 7 sheep that underwent surgery and 6

sheep that were used as control. For each animal

the diameter of about 100 epicardial cells was mea-

sured. A sophisticated statistical technique known

as mixed model analysis of variance allowed to es-

timate from the data σ2n and σ2

m as 4.58 and 13.7

respectively. Surprisingly variability within an ani-

mal was larger than between animals. If we were to

set up a new experiment, we could limit the num-

ber of measurements to 4× 13.7/4.58 ≈ 12 per ani-

mal. Alternatively, we can take the differential costs

of experimental units and subsamples into account.

It makes sense to assume that the cost of 1 ani-

mal is about 100 times the cost of one diameter

measurement. Making this assumption, the opti-

mum number of subsamples per animal would be

100×

13.7/4.58 ≈

17. Thus the total number

of diameter measurements could be reduced from

1300 to 220. Even if animals would cost 1000 times

more than a diameter measurement, the optimum

number of subsamples per animal would be about

55, which is still a reduction of about 50% of the

original workload. As a conclusion, this is a typi-

cal example of a study in which statistical input at

the onset would have improved research efficiency

considerably.

6.6 Multiplicity and sample size

As we shall see in Section 7.6, when more than one

statistical test is carried out on the data, the over-

all rate of false positive findings is higher than the

false positive rate for each test separately. To cir-

cumvent this inflation of the false positive error

rate, the critical value of each individual test is usu-

ally set at a more stringent level. The most sim-

ple adjustment, Bonferroni’s adjustment, consists

of just dividing the significance level of each indi-

vidual test by the number of comparisons. Bonfer-

roni’s adjustment maintains the error rate α of the

totality of tests that are carried out in the same con-

text at its original level. But, as we already noted

above, when the significance level is set at a lower

value, the required sample size will necessarily in-

crease. Fortunately, the increase in required num-

ber of replicates is surprisingly small.

Figure 6.2 shows for a two-sided Student t-test

with a significance level α of 0.05 and a power of

80% (left panel) and 90% (right panel), how the

required sample size increases with an increasing

number of comparisons. The percent increase in

sample size due to adding an extra comparison,

corresponds to the slope of the line segment con-necting adjacent points in Figure 6.2. The evolu-

tion of the relative sample size is comparable for



6.7. THE PROBLEM WITH UNDERPOWERED STUDIES 47

all values of ∆ and power (100× (1−β )). For pow-

ers of 80% and 90 %, carrying out two indepen-

dent statistical tests instead of one, involves a 20%

larger sample size to maintain the overall error rate

at its level of α = 0.05. Similarly, when 3 or 4 in-

dependent tests are involved, the required sample

size increases with 30% or 40% respectively. After

4 comparisons, the effect tapers off and all curves

approach linearity. Adding an extra comparison in

the range of 4 - 10 comparisons, will increase the

required sample size with about 2.7%, leading to a

total increase in sample size for 10 comparisons of

about 70%. For a larger number of comparisons,

Witte et al. (2000) noted that the relative sample

size increases linear with the logarithm of the num-

ber of comparisons.

Figure 6.2 illustrates also how sample size de-

pends on the effect size. Large sample sizes are

indeed required for detecting moderate-small dif-

ferences. However, for large and very large dif-

ferences, as we usually want to detect in early re-

search, the required sample size reduces to an at-

tainable level.

6.7 The problem with underpow-

ered studies

A survey of articles that were published in 2011

(Tressoldi et al., 2013), showed that in prestigious

journals such as Science and Nature, fewer than 3%

of the publications calculate the statistical power

before starting their study. More specifically in

the field of neuroscience, published studies havea power between 8 and 32% to detect a true effect

(Button et al., 2013). Low statistical power might

lead the researcher to wrongly conclude there is

no effect from an experimental treatment when in

fact an effect does exist. In addition, in research in-

volving animals underpowered studies raise a sig-

nificant ethical concern. If each individual study

is underpowered, the true effect will only likely be

discovered after many studies using many animals

have been completed and analyzed, using far moreanimal subjects than if the study had been done

properly the first time (Button et al., 2013). An-

other consequence of low statistical power is that

effect sizes are overestimated and results become

less reproducible (Button et al., 2013). This is best

illustrated by the following example.

Example 6.5. Consider the cardiomyocyte example

as discussed in Example 5.6. A sample size calcu-

lation, treating the experiment as if it was a com-

pletely randomized design (which it was not) was

carried out on page 42 and yielded a required sam-

ple size of 26 animals in each group to detect a

large treatment effect ∆ = 0.8 with a power of

80%. Imagine running several copies of this experi-

ment, say 10,000. The effect sizes that are obtainedfrom these experiments follow a distribution as dis-

played in the left-hand panel of Figure 6.3. The

dark shaded area corresponds to experiments that

yielded a statistically significant result. This subset

yields a slightly increased estimate of the effect size

of 0.89, which corresponds to an effect size infla-

tion of 11%. This effect size inflation is to be ex-

pected when an effect has to pass a certain threshold

such as statistical significance. A relative inflation

of 11% as in this case is acceptable.

The situation is completely different in the right-

hand panel of Figure 6.3 where the experiments now

use only 5 animals per treatment group and corre-

sponding power has dropped to 20%. The variabil-

ity of the results is substantially larger as displayed

by the larger scale of the x-axis. While the stan-dard deviation of the effect sizes in the larger ex-

periment was 0.28, this has now increased to 0.63,

an increase with a factor 2.25 which corresponds

to

26/5. The significant experiments now consti-

tute a much smaller part of the distribution. The

mean effect size in this subset has now increased

to 1.57, an inflation of 96%. In addition, the max-

imum effect detected in all studies is now 3.16 as

compared to 1.95 in the larger experiment. This

phenomenon is known as truth inflation , type M er-ror (M stands for magnitude) or the winners’s curse

(Button et al., 2013; Reinhart, 2015).




Statistical power of study (%)

R e l a t i v e b i a s o f r e s e a r c h f i n d i n g ( % )

0

20

40

60

80

100

120

20 40 60 80

Figure 6.4. The winner’s curse: effect size inflation as a functionof statistical power.

As shown in Figure 6.4, effect inflation is worst

for small low-powered studies which can only detect

treatment effects that happen to be large. There-

fore, significant research findings of small studies

are biased in favour of inflated effects. This has

consequences when an attempt is made to replicate

a published finding and the sample size is computed

based on the published effect. When this is an in-

flated estimate, sample size of the confirmatory ex-

periment will be too low. To summarize, effect in-

flation due to small, underpowered experiments is

one of the major reasons for the lack of replicability

in scientific research.

Figure 6.5. Outline of a sequential experiment

6.8 Sequential plans

Sequential plans allow investigators to save on

the experimental material, by testing at different

stages, as data accumulate. These procedures have

been used in clinical research and are now advo-

cated for use in animal experiments (Fitts, 2010,

2011). Sequential plans are sometimes referred

to as "sequential designs", but strictly speaking

all types of designs that we discussed before can

be implemented in a sequential manner. Sequen-

tial procedures are entirely based on the Neyman-

Pearson hypothesis decision making approach that

we saw in Section 6.3 and do not consider the ac-

curacy or precision of the treatment effect estima-

tion. Therefore, in the case of early termination for

a significant result, sequential plans are prone to

exaggerate the treatment effect. There is certainly a

place for these procedures in exploratory research

such as early screening, but a fixed sample size

confirmatory experiment is needed to provide an

unbiased and precise estimate of the effect size.

Example 6.6. In a search for compounds that of-

fer protection against traumatic brain injury, a rat

model was used as a screening test. Preliminarypower calculations showed that at least 25 ani-

mals per treatment group were required to detect

a protective effect with a power of 80% against a

one-sided alternative with a type I error of 0.05.

Taking into consideration that a large number of

test compounds would be inactive, a fixed sam-

ple size approach was regarded as unethical and

inefficient. Therefore, a one-sided sequential pro-

cedure (Wilcoxon et al., 1963) was considered as

more appropriate. The procedure operated in dif-ferent stages (Figure 6.5). At each stage a num-

ber of animals was selected, such that the group

was as homogeneous as possible. The animals were

then randomly allocated to the different treatment

groups, as three per group. At a given stage the

treatments consisted of several experimental com-

pounds and their control vehicle. After measuring

the response, the procedure allowed the investigator

to make the decision of rejecting the drug as unin-

teresting, accepting the drug as active, or to con-tinue with a new group of animals in a next stage.

After having tested about 50 treatment conditions,



6.8. SEQUENTIAL PLANS 49

a candidate compound was selected for further de-

velopment. An advantage of this screening proce-

dure was that, given the biologically relevant level

of activity that must be detected, the expected frac-

tion of false positive and false negative results was

known and fixed. A disadvantage of the method was

that a dedicated computer program was required for

the follow-up of the results.



7. The Statistical Analysis

”How absurdly simple !”, I cried.

”Quite so !”, said he, a little nettled. ”Every problem becomes very childish when once it is explained

to you.”

Dr. Watson and Sherlock Holmes

The Adventure of the Dancing Men, A.C. Doyle.

”We teach it because it’s what we do; we do it because it’s what we teach.” (on the use of p < 0.05)

John Cobb (2014)

7.1 The statistical triangle

There is a one-to-one correspondence between the

study objectives, the study design and the analy-

sis. The objectives of the study will indicate which

of the designs may be considered. Once a study de-

sign is selected, it will on its turn determine which

type of analysis is appropriate. This principle, that

the statistical analysis is determined by the way

the experiment is conducted, was enunciated by

Fisher (1935):

All that we need to emphasize immediately

is that, if an experiment does allow us to

calculate a valid estimate of error, its struc-ture must completely determine the statis-

tical procedure by which this estimate is to

be calculated. If this were not so, no in-

terpretation of the data could ever be un-

ambiguous; for we could never be sure that

some other equally valid method of inter-

pretation would not lead to a different re-

sult.

In other words, choice of the statistical methodsfollows directly from the objectives and design of

the study. With this in mind, many of the complex-

ities of the statistical analysis have now almost be-

come trivial

7.2 The statistical model revisited

Figure 7.1. The statistical triangle: a conceptual framework forthe statistical analysis

Example 7.1. We already stated that every experi-

mental design is underpinned by a statistical model

and that the experimental results should be consid-

ered as being generated by this experimental model.

This conceptual framework as illustrated in Figure

7.1, greatly simplifies the statistical analysis to just

fitting the statistical model to the data and compar-

ing the model component related to the treatmenteffect with the error component (Grafen and Hails,

2002; Kutner et al., 2004). Hence, the choice of the

51



52 CHAPTER 7. THE STATISTICAL ANALYSIS

−4 −2 0 2 4

0 .

0

0 .

1

0 .

2

0 .

3

0 .

4

t

2.79 −4 −2 0 2 4

0 .

0

0 .

1

0 .

2

0 .

3

0 .

4

t

−2.79 2.79

Figure 7.2. Distribution of the test statistic t for the cardiomyocyte example, under the assumption that the null hypothesis of no

difference between the samples is true.

appropriate statistical analysis is straightforward.

However, some important statistical issues remain,

such as the type data and the assumptions we make

about the distribution of the data.

7.3 Significance tests

Significance testing is related to, but not exactly the

same as hypothesis testing (see Section 6.3). Signif-icance testing differs from the Neyman-Pearson hy-

pothesis testing approach in that there is no need to

define an alternative hypothesis. Here, we only de-

fine a null hypothesis and calculate the probability

to obtain results as extreme or more extreme than

what was actually observed, assuming the null hy-

pothesis is true. This is done by calculating, from

the experimental data, a quantity called test statis-

tic . Then, based on the the statistical model, the

distribution of this test statistic is derived under

the null hypothesis. With this null-distribution, the

probability is calculated of obtaining a test statis-

tic that is as extreme or more extreme than the

one observed. This probability is referred to as p-

value. It is common practice to compare this p-

value to a preset level of significance α (usually

0.05). When the p -value is smaller then α, the null

hypothesis is rejected, otherwise the result is incon-

clusive. However, this conflates the two worlds of

significance testing and the formal decision-making

approach of hypothesis testing. For Fisher, the p-

value was an informal measure to see how surpris-

ing the data were and whether they deserved a sec-

ond look (Nuzzo, 2014; Reinhart, 2015). It is good

practice to follow Fisher and to report the actual

p-values rather than p ≤ 0.05 (see Section 9.2.3),

since this allows anyone to construct their own hy-

pothesis tests. The cardiomyocytes experiment of

Example 5.5 (page 30) will help us to illustrate theidea of significance testing. The experiment was

set up to test the null hypothesis of no difference

between vehicle and drug. This null hypothesis is

tested at a level of significance α of 0.05, i.e. we

want to limit the probability of a false positive re-

sult to 0.05. The paired design of this experiment

is a special case of the randomized complete block

design with only two treatments and the response is

a continuously distributed variable. In this design,

calculations can be simplified by evaluating for eachpair separately the treatment effect, thus removing

the block effect. This is done in Table 5.1 in the col-

umn with the Drug - Vehicle differences. We now

must make some assumptions about the statistical

model that generated the data. Specifically, we as-

sume that the differences are independent from one

another and originate from a normal distribution.

Next, we define a relevant test statistic, in this

case it is the mean value of the differences (7), di-1the standard error of the mean of a sample is obtained by dividing the sample standard deviation by the square root of

the sample size, i.e. sx = SD/√ n = 5.61/

√ 5 = 2.51



7.4. VERIFYING THE STATISTICAL ASSUMPTIONS 53

vided by its standard error1. For our example, we

obtain a value of 7/√

2.51 = 2.79 for this statistic.

Under the assumptions made above and provided

the null hypothesis of no difference between the two

treatment conditions holds, the distribution of this

statistic is known2 and is depicted in Figure 7.2.

On the left panel of Figure 7.2, the value for the

test statistic 2.79 that was obtained from the ex-

perimental data is indicated and the area under the

curve, right to this value is shaded in grey. This

area corresponds to the one sided p-value, i.e. the

probability of obtaining a greater value for the test

statistic than the one obtained in the experiment.

Since by definition the total area under the curve

equals one, we can calculate the value of the shaded

area. For our example, this results in a value of

0.024, which is the probability of obtaining a value

for the test statistic as extreme or more extreme

than the one obtained in the experiment, provided

the null hypothesis holds.

However, before the experiment was carried out,

we were also interested in finding the opposite re-

sult, i.e. we were also interested in a decrease in

viable myocytes. Therefore, when we consider more

extreme results, we should also look at values that

are less than -2.79. This is done in the right panel

of figure 7.2. The sum of the two areas is called the

two-sided p-value and corresponds to the probabil-

ity of obtaining under the null hypothesis, a more

extreme result than

±2.79. In our example, the ob-

tained two-sided p-value is 0.049, which allows us

to reject the null-hypothesis at the pre-specified sig-

nificance level α of 0.05 using a two-sided test.

There is one important caveat in all this: Signif-

icance tests only reflect whether the obtained result

could be attributed to chance alone, but do not tell

whether the difference is meaningful or not from a

scientific point of view.

7.4 Verifying the statistical as-

sumptions

When the inferential results are sensitive to the dis-

tributional and other assumptions of the statisticalanalysis, it is essential that these assumptions are

also verified. The aptness of the statistical model is

preferably assessed by informal methods such as

diagnostic plotting (Grafen and Hails, 2002; Kut-

ner et al., 2004). When planning the experiment,

historical data, or the results of exploratory or pi-

lot experiments can already be used for a prelim-

inary verification of the model assumptions. An-

other option is to use statistical methods that are

robust against departures from the assumptions(Lehmann, 1975). It is also wise, before carrying

out formal tests, to make graphical displays of the

data. This allows to identify outliers and gives al-

ready indications whether the statistical model is

appropriate or not. Such exploratory work is also

a tool for gaining insight in the research project and

can lead to new hypotheses.

Figure 7.3. One hundred drugs are tested for activity against abiological targe. Each drug occupies a square in the grid, the toprow are the drugs that are truly active. Statistically significantresults are obtained only for the darker-grey drugs. The blackcells are false positives (after Reinhart (2015)).

7.5 The meaning of the p -value

and statistical significance

The literature in the life sciences is literally flooded

with p-values and yet this is also the most misun-

derstood, misinterpreted and sometimes miscalcu-

lated measure (Goodman, 2008). When we ob-

tain a result that is not significant, this does not2Under the null hypothesis and when the assumptions are true, the test statistic is distributed as a Student t-distribution

with n − 1 degrees of freedom.




mean that there was no difference between treat-

ment groups. Sample size of the experiment could

just be too small to establish a statistically signifi-

cant result (See Chapter 6). But, what does a sig-

nificant result mean?

Example 7.2. In a laboratory 100 experimental

compounds are tested against a particular biologi-

cal target. Figure 7.3 illustrates the situation. Each

square in the grid represents a tested compound. In

reality only 10 drugs are active against the target,

these are located in the top row. We call this value

of 10% the prevalence or base rate . Let’s assume

that our statistical test has a power of 80%. This

means that of the 10 active drugs 8 will be cor-

rectly detected. These are shown in darker grey.

The threshold for the p -value to declare a drug sta-

tistically significant is set to 0.05, thus there is a

5% chance to declare an inactive compound incor-

rectly as active. There are 90 drugs that are in

reality in active, so about 5 will yield a significant

effect. These are shown on the second row in black.

Hence, in the experiment 13 drugs are declared ac-

tive, of which only 8 are truly effective, i.e. the

positive predictive value is about 8/13 = 62%, or its

complement the false discovery rate (FDR) is about

38%.

Prevalence

F a l s e

D i s c o v e r y R a t e

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

20%

50%80%

Figure 7.4. False discovery rate as a function of the prevalenceπ) and the power 100 × (1− β), (α = 0.05), lines are drawn forpower 100 × (1− β) of 80%, 50% and 20%

From the above reasoning, it follows(Colquhoun, 2014; Wacholder et al., 2004) that the

FDR depends on the threshold α, the power (1−β )

and the prevalence or base rate π as:

F DR = α(1 − π)/[α(1 − π) + π(1− β )]

= 1/{1 + [π/(1 − π)][(1 − β )/α]} (7.1)

For our example, the above derivation of the FDR

yields a value of 0.36 when the prevalence of ac-

tive drugs π is 10%, significance threshold α of 0.05

and a power 1 − β of 80%. This rises to 0.69 when

the power is reduced to 20%. Meaning that un-

der these conditions, 69% of the drugs (or other re-

search questions) that were declared active, are in

fact false positives.

The FDR depends highly on the prevalence rate

as is illustrated in Figure 7.4, leading to the conclu-

sion that when working in new areas where the a

priori probability of a finding is low, say 1/100, a

significant result does not necessarily imply a gen-

uine activity. In fact, under these circumstances

even in a well-powered experiment (80% power)

with a significance level of 0.05, 69% of the posi-

tives findings are false. To make things worse, it

are such surprise, groundbreaking findings, often

combined with exaggerated effect sizes due to asmall sample size (Section 6.7), that are likely to be

published in prestigious journals like Nature and

Science.

Table 7.1. Minimum false discovery rate MFDR for some com-monly used critical values of p

p-value

0.1 0.05 0.01 0.005 0.001

MFDR 0 .385 0.289 0.111 0.067 0.0184

What is the value of p≈

0.05 Consider an exper-iment whose results yield a p-value close to 0.05,

say between 0.045 and 0.05. What does it actu-

ally mean? In how many instances does this re-

sult reflect a true difference? We already deduced

that, when the power or the prevalence rate are

low, the FDR can easily reach 70%. But what is

the most optimistic scenario? In other words what

is the minimum value for the FDR? Irrespective

of power, sample size and prior probability, Sellke

et al. (2001) derived an expression for what theycall the conditional error probability, which is equiv-

alent to the minimum FDR ( MFDR). This gives the



7.6. MULTIPLICITY 55

minimum probability that, when a test is declared

”significant”, the null hypothesis is in fact true.

Some values of the MFDR are given in Table 7.1.

For p = 0.05 the MFDR = 0.289, which means

that a researcher who claims a discovery when p≈0.05 is observed will make a fool of him-/herself in

about 30% of the cases. Even for a p-value of 0.01,

the null hypothesis can still be true in 11% of the

cases (Colquhoun, 2014).

The FDR is certainly one of the key factors re-

sponsible for the lack of replicability in research

and puts the decision-theoretic approach with its

irrational dichotomization of the p-value into sig-

nificant and non-significant certainly into question.

As it was noted already in the introductory

chapter, the issues of reproducibility and replica-

bility of research founding have been concerning

the scientific, but certainly also the statistical com-

munity, deeply. This has led the board of the Amer-

ican Statistical Association to issue a statement on

March 6, 2016 in which the society warns against

the misuse of p-values (Wasserstein and Lazar,

2016). This is the first time in its 177 year old

history that explicit recommendations on a fun-

damental matter in statistics were done. In sum-

mary, the ASA advises in its statement researchers

to avoid drawing scientific conclusions or making

decisions based on p-values alone. P-values should

certainly not be interpreted as measuring the prob-

ability that the studied hypothesis is true or the

probability that the data were produced by chance

alone. Researchers should describe not only the

data analyses that produced statistically significantresults, the society says, but all statistical tests and

choices made in calculations.

7.6 Multiplicity

In Section 3.1, we already pointed out that it is wise

to limit the number of objectives in a study. As al-

ready mentioned in Section 6.6, increasing the ob-

jectives not only increases the study’s complexity,

but also results in more hypotheses that are to be

tested. Testing multiple related hypotheses also

raises the type I error rate. The same problem of

multiplicity arises when a study includes a large

number of variables, or when measurements are

made at a large number of time points. Only in

studies of the most exploratory nature the statisti-

cal analysis of every possible variable or time point

is acceptable. In this case, the exploratory nature of

the study should be stressed and the results inter-

preted with great care.

Example 7.3. Suppose a drug is tested at 20 differ-

ent doses on a specific variable. Further, suppose

that we reject the null hypothesis of no treatmenteffect for each dose separately when the probability

of falsely rejecting the null hypothesis (the signifi-

cance level α) is less than or equal to 0.05. Then

the overall probability of falsely declaring the exis-

tence of a treatment effect when all underlying null

hypotheses are in fact true is 1−(1−0.05)20 = 0.64.

This means that we are more likely to get one signif-

icant result than not. The same multiplicity prob-

lem arises when a single dose of the drug is tested

on 20 variables that are mutually independent.

The problem of multiplicity is of particular im-

portance and magnitude in gene expression mi-

croarray experiments (Bretz et al., 2005). For exam-

ple a microarray experiment examines the differ-

ential expression of 30,000 genes in wildtype and

in a mutant. Assume that for each gene an appro-

priate two-sided two-sample test is performed at

the 5% significance level. Then we expect to obtain

roughly 1,500 false-positives. Strategies for dealingwith, what is often called the curse of multiplicity,

in microarrays are provided by Amaratunga and

Cabrera (2004) and Bretz et al. (2005).

The multiplicity problem must at least be recog-

nized at the planning stage. Ways to deal with it

(Bretz et al., 2010; Curran-Everett, 2000) should be

investigated and specified in the protocol.



8. The Study Protocol

Nothing clears up a case so much as stating it to another person.

Sherlock Holmes

The Memoirs of Sherlock Holmes. Silver Blaze. A.C. Doyle.

The writing of the study protocol finalizes theend of the research design phase. Every study

should have a written formal protocol before it is

started. The complete study protocol consists of a

more conceptual research protocol and the techni-

cal protocol, which we already discussed in Section

4.8.1.3. The research protocol states the rational for

performing the study, the study’s objectives, the

related hypotheses and working hypotheses that

are tested and their consequential predictions. It

should contain a section on experimental design,how treatments will be assigned to experimen-

tal units, information and justification of planned

sample sizes, and a description of the statistical

analysis that is to be performed. Defining the sta-

tistical methods in the protocol is of importance

since it allows preparation of the data analytic pro-

cedures beforehand and ensures against the mis-

leading practice of data dredging or data snooping.

Writing down the statistical analysis plan before-hand prevents also from trying several methods

of analysis and report only those results that suit

the investigator. Such a practice is of course inap-

propriate, unscientific, and unethical. In this con-

text, the study protocol is a safeguard for the re-

producibility of research findings.

Many investigators consider writing a detailed

protocol a waste of time. However, the smart re-searcher understands that by writing a good pro-

tocol he is actually preparing his final study re-

port. A well-written protocol is even more essen-

tial when the design is complex or the study is

collaborative. Once the protocol has been formal-

ized, it is important that it is followed as good as

possible and every deviation of it should be docu-

mented.

57



58 CHAPTER 8. THE STUDY PROTOCOL



9. Interpretation and Reporting

No isolated experiment, however significant in itself, can suffice for the experimental demonstration

of any natural phenomenon; for the ”one chance in a million” will undoubtedly occur, with no less

and no more than its appropriate frequency, however surprised we may be that it should occur to us.

R. A. Fisher (1935)

While the previous chapters focused on the

planning phase of the study with the protocol as fi-

nal deliverable, this section deals with some points

to consider when interpreting and reporting the

results of the statistical analysis. Several journals

(Altman et al., 1983; Bailar III and Mostelller, 1988)

have published guidelines for reporting statistical

results. Although these guidances focus mostly on

clinical research, many of the precepts also apply

to other types of research.

As a general rule in writing reports containing

statistical methodology, it is recommended not to

use technical statistical terms such as random, nor-

mal,significant,correlation, and sample in their every-

day meaning, i.e. out of the statistical context. We

will now discuss the different sections of a scien-

tific publication in which statistical reasoning is in-

volved.

9.1 The Methods section

The Methods section should contain details of the

experimental design, such as the size and number

of experimental groups, how experimental units

were assigned to treatment groups, how experi-mental outcomes were assessed, and what statis-

tical and analytical methods were used.

9.1.1 Experimental design

Readers should be told about weaknesses and

strengths in study design, e.g. when randomiza-

tion and/or blinding was used since it adds to the

reliability of the data. A detailed description of the

randomization and blinding procedures, and how

and when these were applied will allow the reader

to judge the quality of the study. Reasons for block-

ing and the blocking factors should be given and

how blocking was dealt with in the statistical anal-

ysis. When there is ambiguity about the experi-

mental unit, the unit used in the statistical analysis

should be specified and a justification for its choice

should be provided.

9.1.2 Statistical methods

Statistical methods should be described with

enough detail to enable a knowledgeable reader

with access to the original data to verify the re-ported results. The authors should report and jus-

tify which methods they used. A term like tests of

significance is too vague, and should be more de-

tailed.

The level of significance and, when applicable,

direction of statistical tests should be specified,

e.g. two-sided p-values less than or equal to 0.05

were considered to indicate statistical significance.

Some procedures, e.g. analysis of variance, chi-square tests, etc. are by definition two-sided. Is-

sues about multiplicity (Section 7.6) and a justifi-

59



60 CHAPTER 9. INTERPRETATION AND REPORTING

cation of the strategy that deals with them should

also be addressed here.

The software used in the statistical analysis and

its version should also be specified. When the R-system is used (R Core Team, 2013), both R and the

packages that were used should be referenced.

9.2 The Results section

9.2.1 Summarizing the data

The number of experimental units used in the anal-

ysis should always be clearly specified. Any dis-crepancies with the number of units actually ran-

domized to treatment conditions should be ac-

counted for. Whenever possible, findings should

be quantified and presented with appropriate in-

dicators of measurement error or uncertainty. As

measures of spread and precision, standard devi-

ations (SD) and standard errors (SEM) should not

be confused. Standard deviations are a measure of

spread and as such a descriptive statistic , while

standard errors are a measure of precision of the

mean. Normally distributed data should prefer-

ably be summarized as mean (SD), not as mean ±SD. For non-normally distributed data, medians

and inter-quartile ranges are the most appropri-

ate summary statistics. The practice of reporting

mean ± SEM should preferably be replaced by the

reporting of confidence intervals, which are more

informative. Extremely small datasets should not

be summarized at all, but should preferably be re-

ported or displayed as raw data.

When reporting SD (or SEM) one should realize

that for positive variables (i.e. variables measured

on a ratio scale) such as concentrations, durations,

and counts, the mean minus 2 × SD (or minus

2 × SEM × √ n) which indicates a lower 2.5% of

the distribution, can lead to a ridiculous negative

value. In this case, an appropriate 95% confidence

interval based on the lognormal distribution, or al-ternatively a distribution-free interval, will avoid

such a pitfall.

Spurious precision detracts from a paper’s read-

ability and credibility. Therefore, unnecessary pre-

cision, particularly in tables, should be avoided.

When presenting means and standard deviations,

it is important to bear in mind the precision of the

original data. Means should be given one addi-

tional decimal place more than the raw data. Stan-

dard deviations and standard errors usually re-

quire one more extra decimal place. Percentages

should not be expressed to more than one deci-

mal place and with samples less than 100, the use

of decimal places should be avoided. Percentages

should not be used at all for small samples. Note

that the remarks about rounding only apply to the

presentation of results, rounding should not be

done at all before or during the statistical analysis.

9.2.2 Graphical displays

Figure 9.1. Scatter diagram with indication of median values and95% distribution-free confidence intervals

Graphical displays complement tabular pre-

sentations of descriptive statistics. Generally,

graphs are better suited than tables for identifyingpatterns in the data, whereas tables are better for

providing large amounts of data with a high de-

gree of numerical detail. Whenever possible, one

should always attempt to graph individual data

points. This is especially the case when treatment

groups are small. Graphs such as Figure 9.1 and

Figure 9.2 are much more informative than the

usual bar and line graphs showing mean values ±SEM (Weissgerber et al., 2015). These graphs are

easily constructed using the R-language. Specifi-cally, the R-package beeswarm (Eklund, 2010) can

be of great help.



9.2. THE RESULTS SECTION 61

Figure 9.2. Graphical display of longitudinal data showing indi-vidual subject profiles

9.2.3 Interpreting and reporting signifi-

cance tests

When data are summarized in the Results section,

the statistical methods that were used to analyze

them should be specified. It is to the reader of

little help to have in the Methods section a state-

ment such as “statistical methods included analysis of

variance, regression analysis, as well as tests of signifi-

cance” without any reference to which specific pro-

cedure is reported in the Results part.

Tests of statistical significance should be two-

sided. When comparing two means or two pro-

portions, there is a choice between a two-sided or

a one-sided test (see Section 7.3. In a one-sided

test the investigator alternative hypothesis speci-

fies the direction of the difference, e.g. experimen-

tal treatment greater than control. In a two-sided

test, no such direction is specified. A one-sided test

is rarely appropriate and when one-sided tests are

used, their use should be justified (Bland and Alt-

man, 1994). For all two group comparisons, thereport should clearly state whether one-sided or

two-sided p-values are reported.

Exact p-values, rather than statements such as

“ p < 0.05 or even worse “NS” (not significant),

should be reported where possible. The practice

of dichotomizing p-values into significant and not

significant has no rational scientific basis at all and

should be abandoned. This lack of rationality be-

comes apparent when one considers the situationwhere a study yielding a p-value of 0.049 would be

flagged significant, while an almost equivalent re-

sult of 0.051 would be flagged as “NS”. Reporting

exact p-values, would allow readers to compare the

reported p-value with their own choice of signifi-

cance levels. One should also avoid reporting a p-

value as p = 0.000, since a value with zero proba-

bility of occurrence is, by definition, an impossible

value. No observed event can ever have a prob-

ability of zero. Therefore, such an extreme small

p-value must be reported as p < 0.001. In round-

ing a p-value, it happens that a value that is techni-

cally larger than the significance level of 0.05, say

0.051, is rounded down to p = 0.05. This is inac-

curate and, to avoid this error, p-values should be

reported to the third decimal. If a one-sided test is

used and the result is in the wrong direction, then

the report must state that p > 0.05 (Levine and

Atkin, 2004), or even better report the complement

of the p-value, i.e. 1 − p.

There is a common misconception among scien-

tists that a nonsignificant result implies that the

null hypothesis can be accepted. Consequently,

they conclude that there is no effect of the treat-

ment or that there is no difference between the

treatment groups. However, from a philosophi-cal point of view, one can never prove the nonex-

istence of something. As Fisher (1935) clearly

pointed out:

it should be noted that the null hypothesis is

never proved or established, but is possibly

disproved, in the course of experimentation.

To state it otherwise: Lack of evidence is no evidence

for lack of effect. Conversely, an effect that is sta-

tistically significant is not necessarily of biomedi-cal importance, nor is it replicable (see Section 7.5.

Therefore, one should avoid sole reliance on statis-

tical hypothesis testing and preferably supplement

ones findings with confidence intervals which are

more informative. Confidence intervals on a dif-

ference of means or proportions provide informa-

tion about the size of an effect and its uncertainty

and are of particular value when the results of the

test fail to reject the null hypothesis. This is illus-

trated in Figure 9.3 showing treatment effects andtheir 95% confidence intervals. The shaded area

indicates the region in which results are important



62 CHAPTER 9. INTERPRETATION AND REPORTING

from a scientific (biological) point of view. Three

possible outcomes for treatment effects are shown

here as mean values and 95% confidence intervals.

The region encompassed by the confidence inter-

val can be interpreted as the set of plausible val-

ues of the treatment effect. The top row shows a

result that is statistically significant, the 95% confi-

dence interval does not encompass the zero effect

line. However, effect sizes that have no biologi-

cal relevance are still plausible as is shown by the

upper limit of the confidence interval. The second

row shows the result of an experiment that was not

significant at the 0.05 level. However, the confi-

dence interval reaches well within the area of bio-

logical relevance. Therefore, notwithstanding the

nonsignificant outcome, this experiment is incon-

clusive. The third outcome concerns a result that

was not significant, but the 95% confidence inter-

val does not reach beyond the boundaries of scien-

tific relevance. The nonsignificant result here can

also be interpreted that, with 95% confidence, the

treatment effect was also irrelevant from a scien-

tific point of view.

Figure 9.3. Use of confidence intervals for interpreting statisti-cal results. Estimated treatment effects are displayed with their95% confidence intervals. The shaded area indicates the zone of biological relevance.

In the scientific literature, it is common prac-

tice to make a sharp distinction between significant

and nonsignificant findings and making compar-isons of the sort “X is statistically significant, while

Y is not”. A representative, though fictitious, ex-

ample of such claims is for instance:

The percentage of neurons showing cue-

related activity increased with training in

the mutant mice (P < 0.05), but not in the

control mice (P > 0.05). (Nieuwenhuis

et al., 2011)

Such comparisons are absurd, inappropriate and

can be misleading. Indeed, the difference between

“significant” and “not significant” is not itself statis-

tically significant (Gelman and Stern, 2006). Unfor-

tunately, such a practice is commonplace. A re-

cent review by Nieuwenhuis et al. (2011) showed

that in the area of cellular and molecular neuro-science the majority of authors erroneously claim

an interaction effect when they obtained a signif-

icant result in one group and a nonsignificant re-

sult in the other. In view of our discussion of p-

values and nonsignificant findings, it is needless

to say that this approach is completely wrong and

misleading. The correct approach would be to de-

sign a factorial experiment and test the interac-

tion effect of genotype on training. In this context,

it must also be noted that carrying out a statisti-cal test to prove equivalence of baseline measure-

ments is also pointless. Tests of significance are not

tests of equivalence. When baseline measurements

are present, their value should be included in the

statistical model.

Finally, when interpreting the results of the ex-

periment, the scientist should bear in mind the top-

ics covered in Section 6.7 about effect size inflationand Section 7.5 about the pitfalls of p-values.



10. Concluding Remarks and Summary

You know my methods. Apply them.

Sherlock Holmes

The Sign of the Four A.C. Doyle.

To consult the statistician after an experiment is finished is often merely to ask him to conduct a postmortem examination. He can perhaps say what the experiment died of.

R.A. Fisher (1938)

10.1 Role of the statistician

What we didn’t touch yet was the role of the statis-

tician in the research project. The statistician is

a professional particularly skilled in solving re-

search problems. She should be considered as a

team member and often even as collaborator or

partner in the research process in which she can

have a critical role. Whenever possible, the statisti-

cian should be consulted when there is doubt with

regard to design, sample size, or statistical analy-

sis. A statistician working closely together with

a scientist can greatly improve the project’s like-

lihood of success. Many applied statisticians be-

come involved into the subject area and, by virtue

of their statistical training, take on the role of statis-

tical thinker, thereby permeating the research pro-

cess. In a great many instances this key role of the

statistician is recognized and granted with a co-

authorship.

The most effective way to work with a con-

sulting statistician is to include her or him from

the very beginning of the project, when the

study objectives are formulated (Hinkelmann and

Kempthorne, 2008). What always should beavoided, is contacting the statistical support group

after the experiment has reached its completion,

perhaps they can only say what the experiment

died of .

10.2 Recommended reading

Statistics Done Wrong: The Woefully Complete Guide by Reinhart (2015) is in my opinion essential

reading material for all scientists working in

biomedicine and the life sciences in general. This

small book (152 pages) provides a well-written

very accessible guide to the most popular statis-

tical errors and slip-ups committed by scientists

every day, in the lab and in peer-reviewed jour-

nals. Scientists working with laboratory animals

should certainly read the article by Fry (2014) and

the book The Design and Statistical Analysis of An-imal Experiments by Bate and Clark (2014). For

those interested in the history of statistics and the

life of famous statisticians, The Lady Tasting Tea

by Salsburg (2001) is a lucidly written account of

the history of statistics, experimental design and

how statistical thinking revolutionized 20th Cen-

tury science. A clear, comprehensive and highly

recommended work on experimental design, is the

book by Selwyn (1996), while, on a more introduc-

tory level, there is the book by Ruxton and Cole-grave (2003). A gentle introduction to statistics

in general and hypothesis testing, confidence in-

63



64 CHAPTER 10. CONCLUDING REMARKS AND SUMMARY

tervals and analysis of variance in particular, can

be found in the highly recommended book of the

two Wonnacott brothers (Wonnacott and Wonna-

cott, 1990). Comprehensive works at an advanced

level on statistics and experimental design are the

books by Kutner et al. (2004), Hinkelmann and

Kempthorne (2008), Casella (2008), and Giesbrecht

and Gumpertz (2004). The latter two also provide

designs suitable for 96-well microtiter plates. For

those who want to carry out their analyses in the

freely available R-language (R Core Team, 2013) is

the book by Dalgaard (2002) a good starter, while

the book by Everitt and Hothorn (2010) is at a

more advanced level. Guidances and tips for ef-

ficient data visualization can be found in the work

of Tufte (1983) and in the two books by William

Cleveland (Cleveland, 1993, 1994). Finally, there

is the freely available e-book Speaking of Graphics

(Lewi, 2006), which takes the reader on a fascinat-

ing journey through the history of statistical graph-

ics1.

10.3 Summary

We have looked at the complexities of the researchprocess from the vantage point of a generalist. Sta-

tistical thinking was introduced as a non-specialist

generalist skill that permeates the entire research

process. The seven principles of statistical think-

ing were formulated as: 1) time spent thinking on

the conceptualization and design of an experiment

is time wisely spent; 2) the design of an experiment

reflects the contributions from different sources of

variability; 3) the design of an experiment balances

between its internal validity (proper control of

noise) and external validity (the experiment’s gen-

eralizability); 4) good experimental practice pro-

vides the clue to bias minimization; 5) good ex-

perimental design is the clue to the control of vari-

ability; 6) experimental design integrates various

disciplines; 7) a priori consideration of statistical

power is an indispensable pillar of an effective ex-

periment.

We elaborated on each of these and finally dis-

cussed some points to consider in the interpreta-

tion and reporting of scientific results. In partic-

ular, the problems with a blind trust on statistical

hypothesis tests and of exaggerated effect sizes in

small significant studies were highlighted.

1http://www.datascope.be



REFERENCES 65

References

Altman, D. G., Gore, S. M., Gardner, M. J., and Pocock, S. J.

(1983). Statistical guidelines for contributors to medical jour-

nals. BMJ 286, 1489–1493.

URL http://www.bmj.com/content/286/6376/1489

Amaratunga, D. and Cabrera, J. (2004). Exploration and Analy-

sis of DNA Microarray and Protein Array Data. New York, NY: J.

Wiley.

Anderson, V. and McLean, R. (1974). Design of Experiments.

New York, NY: Marcel Dekker Inc.

Aoki, Y., Helzlsouer, K. J., and Strickland, P. T. (2014).

Arylesterase phenotype-specific positive association between

arylesterase activity and cholinesterase specific activity in hu-

man serum. Int. J. Environ. Res. Public Health 11 , 1422–1443.

doi:doi:10.3390/ijerph110201422.

Babij, C. J., Zhang, R. J., Y. anf Kurzeja, Munzli, A., Sheha-

beldin, A., Fernando, M., Quon, K., Kassner, P. D., Ruefli-

Brasse, A. A., Watson, V. J., Fajardo, F., Jackson, A., Zondlo,

J., Sun, Y., Ellison, A. R., Plewa, C. A., T., S., Robinson, J.,

McCarter, J., Judd, T., Carnahan, J., and Dussault, I. (2011).

STK33 kinase activity is nonessential in KRAS-dependent can-

cer cells. Cancer Research 71 , 5818–5826. doi:10.1158/0008-

5472.CAN-11-0778 .

Baggerly, K. A. and Coombes, K. R. (2009). Deriving

chemosensitivity from cell lines: Forensic bioinformatics and

reproducible research in high-throughput biology. Annals of

Applied Statistics 3, 1309–1334. doi:10.1214/09-AOAS291.

Bailar III, J. C. and Mostelller, F. (1988). Guidelines for statis-

tical reporting in articles for medical journals. Ann. Int. Med.

108, 226–273.

URL http://www.people.vcu.edu/$\sim$albest/Guidance/

guidelines_for_statistical_reporting.htm

Banerjee, T. and Mukerjee, R. (2007). Optimal factorial designs

for cDNA microarray experiments. Ann. Appl. Stat. 2, 366–385.

doi:10.1214/07-AOAS144.

Bate, S. and Clark, R. (2014). The Design and Statistical Analysis

of Animal Experiments. Cambridge, UK: Cambridge University

Press.

Begley, C. G. and Ellis, L. M. (2012). Raise standards for pre-

clinical research. Nature 483, 531–533. doi:10.1038/483531a.

Begley, C. G. and Ioannidis, J. P. A. (2015). Reproducibil-

ity in science. Circ. Res. 116, 116–126. doi:10.1161/

CIRCRESAHA114.303819.

Begley, S. (2012). In cancer science, many "discoveries" don’t

hold up. Reuters March 28.

URL http://www.reuters.com/article/2012/03/28/us-

science-cancer-idUSBRE82R12P20120328

Bland, M. and Altman, D. (1994). One and two sided tests of

significance. BMJ 309, 248.

Bolch, B. (1968). More on unbiased estimation of the standard

deviation. The American Statician 22, 27.

Bretz, F., Hothorn, T., and Westfall, P. (2010). Multiple Compar-

isons Using R. Boca Raton, FL: CRC Press.

Bretz, F., Landgrebe, J., and Brunner, E. (2005). Multiplicity is-

sues in microarray experiments. Methods Inf. Med. 44, 431–437.

Brien, C. J., Berger, B., Rabie, H., and Tester, M. (2013). Ac-

counting for variation in designing greenhouse experimentswith special reference to greenhouses containing plants on

conveyor systems. Plant Methods 9, 5–26. doi:10.1186/1746-

4811-9-5.

Burrows, P. M., Scott, S. W., Barnett, O., and McLaughlin,

M. R. (1984). Use of experimental designs with quantitative

ELISA. J. Virol. Methods 8, 207–216.

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A.,

Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power

failure: Why small sample size undermines the reliability

of neuroscience. Nature Reviews Neuroscience 14, 1–12. doi:

10.1038/nrn3475.

Casella, G. (2008). Statistical Design. New Tork, NY: Springer.

Champely, S. (2009). pwr: Basic functions for power analysis.

R package version 1.1.1.

URL http://CRAN.R-project.org/package=pwr

Cleveland, W. S. (1993). Visualizing Data. Summit, NJ: Hobart

Press.

Cleveland, W. S. (1994). The Elements of Graphing Data. Sum-

mit, NJ: Hobart Press.

Clewer, A. G. and Scarisbrick, D. H. (2001). Practical Statistics

and Experimental Design for Plant and Crop Science. Chichester,

UK: J. Wiley.

Cochran, W. and Cox, G. (1957). Experimental Designs. New

York, NY: John Wiley & Sons Inc., 2nd edition.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sci-

ences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd edi-

tion.

Cokol, M., Ozbay, F., and Rodriguez-Esteban, R. (2008).

Retraction rates are on the rise. EMBO Rep. 9, 2. doi:

10.1038/sj.embor.7401143.

URL http://www.ncbi.nlm.nih.gov/pmc/articles/

PMC2246630/

Colquhoun, D. (2014). An investigation of the discovery rate

and the misinterpretation of p-values. R. Soc. open sci. 1,140216. doi:10.1098/rsos/140216.

Council of Europe (2006). Appendixa of the europeanconven-

tion for the protection of vertebrate animals used for experi-

mental and other scientific purposes (ets no. 123). guidelines

for accomodation and care of animals (article 5 of the conven-

tion). approved by the multilateral consultation.

URL https://www.aaalac.org/about/AppA-ETS123.pdf

Cox, D. (1958). Planning of Experiments. New York, NY: J. Wi-

ley.

Curran-Everett, D. (2000). Multiple comparisons: philoso-

phies and illustrations. Am. J. Physiol. Regulatory Integrative

Comp. Physiol. 279, R1–R8.

Dalgaard, P. (2002). Introductory Statistics with R. New York,

NY: Springer.

http://www.bmj.com/content/286/6376/1489

http://www.people.vcu.edu/$/sim%20$albest/Guidance/guidelines_for_statistical_reporting.htm


http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328


http://cran.r-project.org/package=pwr

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2246630/


https://www.aaalac.org/about/AppA-ETS123.pdf

https://www.aaalac.org/about/AppA-ETS123.pdf



http://cran.r-project.org/package=pwr





http://www.bmj.com/content/286/6376/1489



66 REFERENCES

Eklund, A. (2010). beeswarm: The bee swarm plot, an alterna-

tive to stripchart. R package version 0.0.7.

URL http://CRAN.R-project.org/package=beeswarm

European Food Safety Authority (2012). Final review of the

Séralini et al. (2012) publication on a 2-year rodent feeding

study with glyphosate formulations and GM maize NK603 as

published online on 19 September 2012 in Food and Chemical

Toxicology. EFSA Journal 10 , 2986. doi:10.2903/j.efsa.2012.

2986.

Everitt, B. S. and Hothorn, T. (2010). A Handbook of Statistical

Analyses using R. Boca Raton, FL: Chapman and Hall/CRC,

2nd edition.

Faessel, H., Levasseur, L., Slocum, H., and Greco, W. (1999).

Parabolic growth patterns in 96-well plate cell growth experi-

ments. In Vitro Cell. Dev. Biol. Anim. 35, 270–278.

Fang, F. C., Steen, R. C., and Casadevall, A. (2012). Miscon-

duct accounts for the majority of retracted scientific publi-

cations. Proc. Natl. Acad. Sci. U.S.A. 109, 17028–17033. doi:10.1073/pnas.1212247109.

Fernandez, G. C. J. (2007). Design and analysis of commonly

used comparative horticultural experiments. HortScience 42 ,

1052–1069.

Fisher, R. (1962). The place of the design of experiments in the

logic of scientific inference. Colloques Int. Centre Natl. Recherche

Sci. Paris 110, 13–19.

Fisher, R. A. (1935). The Design of Experiments. Edinburgh, UK:

Oliver and Boyd.

Fisher, R. A. (1938). Presidential address: The first session of

the Indian Statistical Conference, Calcutta. Sankhya 4, 14–17.

Fitts, D. A. (2010). Improved stopping rules for the design of

small-scale experiments in biomedical and biobehavioral re-

search. Behavior Research Methods 42, 3–22. doi:10.3758/BRM.

42.1.3.

Fitts, D. A. (2011). Minimizing animal numbers: the variable-

criteria stopping rule. Comparative Medicine 61, 206–218.

Freedman, L. P., Cockburn, I. M., and Simcoe, T. S. (2015). The

economics of reproducibility in preclinical research. PLoS Biol.

13, e1002165. doi:10.1371/journal.pbio.1002165.

Fry, D. (2014). Experimental design: reduction and refinement

in studies using animals. In K. Bayne and P. Turner, editors,

Laboratory Animal Welfare, chapter 8, pages 95–112. London,UK: Academic Press.

Gart, J., Krewski, D., P.N., L., Tarone, R., and Wahrendorf,

J. (1986). The design and analysis of long-term animal experi-

ments, volume 3 of Statistical Methods in Cancer Research. Lyon,

France: International Agency for Research on Cancer.

Gelman, A. and Stern, H. (2006). The difference between "sig-

nificant" and "not significant" is not itself statistically signifi-

cant. Am. Stat. 60, 328–331.

Giesbrecht, F. G. and Gumpertz, M. L. (2004). Planning, Con-

struction, and Statistical Analysis of Comparative Experiments.

New York, NY: J. Wiley.

Glonek, G. F. V. and Solomon, P. J. (2004). Factorial and time

course designs for cDNA microarray experiments. Biostatistics

5, 89–111.

Göhlmann, H. and Talloen, W. (2009). Gene Expression Studies

Using Affymetrix Microarrays. Boca Raton, FL: CRC Press.

Goodman, S. (2008). A dirty dozen: Twelve p-value mis-

conceptions. Semin Hematol 45, 135–140. doi:10.1053/j.

seminhematol.2008.04.003.

Grafen, A. and Hails, R. (2002). Modern Statistics for the Life

Sciences. Oxford, UK: Oxford University Press.

Greco, W. R., Bravo, G., and Parsons, J. C. (1995). The search

for synergy: a critical review from a response surface perspec-

tive. Pharmacol. Rev. 47, 331–385.

Hankin, R. K. S. (2005). Recreational mathematics with r: in-

troducing the ’magic’ package. R News 5.

Haseman, J. K. (1984). Statistical issues in the design, analysis

and interpretation of animal carcinogenicity studies. Enviro-

mental Health Perspect. 58, 385–392.

URL http://www.ncbi.nlm.nih.gov/pmc/articles/

PMC1569418/

Hayes, W. (2014). Retraction notice to "Long term toxicity of a

Roundup herbicide and a Roundup-tolerant genetically mod-

ified maize" [Food Chem. Toxicol. 50 (2012): 4221-4231]. Food

Chem. Toxicol. 52, 244. doi:10.1016/j.fct.2013.11.047.

Hempel, C. G. (1966). Philosophy of Natural Science.

Englewood-Cliffs, NJ: Prentice-Hall.

Hinkelmann, K. and Kempthorne, O. (2008). Design and Analy-

sis of Experiments. Volume 1. Introduction to Experimental Design.

Hoboken, NJ: J. Wiley, 2nd edition.

Hirst, J. A., Howick, J., Aronson, J. K., Roberts, N., Perera, R.,

Koshiaris, C., and Heneghan, C. (2014). The need for random-ization in animal trials: An overview of systematic reviews.

PLOS ONE 9, e98856. doi:10.1371/journal.pone.0098856.

Holland, T. and Holland, C. (2011). Unbiased histological

examinations in toxicological experiments (or, the informed

leading the blinded examination). Toxicol. Pathol. 39, 711–714.

doi:10.1177/0192623311406288.

Holman, L., Head, M. L., Lanfear, R., and Jennions, M. D.

(2015). Evidence of experimental bias in the life sciences: why

we need blind data recording. PLoS Biol. 13, e1002190. doi:

10.1371/journal.pbio.1002190.

Hotz, R. L. (2007). Most science studies appear to be tainted by sloppy analysis. The Wall Street Journal September 14.

URL http://online.wsj.com/article/SB118972683557627104.

html

Hu, B., Simon, J., Günthardt-Goerg, M. S., Arend, M., Kuster,

T. M., and Rennenberg, H. (2015). Changes in the dynamics of

foliar n metabolites in oak saplings by drought and air warm-

ing depend on species and soil type. PLOS ONE 10, e0126701.

doi:10.1371/journal.pone.0126701.

Ioannidis, J. P. A. (2005). Why most published research find-

ings are false. PLoS Med. 2, e124. doi:10.1371/journal.pmed.

0020124.

Ioannidis, J. P. A. (2014). How to make more published re-

search true. PLoS Med. 11, e1001747. doi:10.1371/journal.

pmed.1001747.

http://cran.r-project.org/package=beeswarm



http://online.wsj.com/article/SB118972683557627104.html






http://cran.r-project.org/package=beeswarm



REFERENCES 67

Kilkenny, C., Parsons, N., Kadyszewski, E., Festing, M. F. W.,

Cuthill, I. C., Fry, D., Hutton, J., and Altman, D. G. (2009). Sur-

vey of the quality of experimental design, statistical analysis

and reporting of research using animals. PLOS ONE 4, e7824.

doi:10.1371/journal.pone.0007824.

Kimmelman, J., Mogil, J. S., and Dirnagl, U. (2014). Distin-guishing between exploratory and confirmatory preclinical re-

search will improve translation. PLoS Biol. 12, e1001863. doi:

10.1371/journal.pbio.1001863.

Kutner, M. H., Nachtsheim, C., Neter, J., and Li, W. (2004).

Applied Linear Statistical Models. Chicago, IL: McGraw-

Hill/Irwin, 5th edition.

Lansky, D. (2002). Strip-plot designs, mixed models, and com-

parisons between linear and non-linear models for microtitre

plate bioassays. In W. Brown and A. R. Mire-Sluis, editors, The

Design and Analysis of Potency Assays for Biotechnology Products.

Dev. Biol., volume 107, pages 11–23. Basel: Karger.

Lazic, S. (2010). The problem of pseudoreplication in neurosci-entific studies: is it affecting your analysis. BMC Neuroscience

11. doi:10.1186/1471-2202-11-5 .

LeBlanc, D. C. (2004). Statistics: Concepts and Applications for

Science. Sudbury, MA: Jones and Bartlett Publishers.

Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead,

B., Johnson, E. W., Geman, D., Baggerly, K., and Irizarry, R. A.

(2010). Tackling the widespread and critical impact of batch

effects in high-throughput data. Nature Reviews Genetics 11 ,

733–739. doi:10.1038/nrg2825.

Lehmann, E. L. (1975). Nonparametrics: Statistical Methods

Based on Ranks. San Francisco, CA: Holden-Day.

Lehr, R. (1992). Sixteen s squared over d squared: a relation

for crude sample size estimates. Stat. Med. 11, 1099–1102.

Lehrer, J. (2010). The truth wears off. The New Yorker [online]

December 13.

URL http://www.newyorker.com/magazine/2010/12/13/

the-truth-wears-off

Levasseur, L., Faessel, H., Slocum, H., and Greco, W. . (1995).

Precisionand pattern in 96-well plate cell growth experiments.

In Proceedings of the American Statistical Association, Biopharma-

ceutical Section, pages 227–232. Alexandria, Virginia: Ameri-

can Statistical Association.

Levine, T. R. and Atkin, C. (2004). The accurate reporting of software-generated p-values: a cautionary note. Comm. Res.

Rep. 21, 324–327. doi:10.1080/08824090409359995.

Lewi, P. J. (2005). The role of statistics in the success of a phar-

maceutical research laboratory: a historical case description. J

Chemometr. 19, 282–287.

Lewi, P. J. (2006). Speaking of graphics.

URL http://www.datascope.be

Lewi, P. J. and Smith, A. (2007). Successful pharmaceutical

discovery: Paul Janssen’s concept of drug research. R&D Man-

agement 37, 355–361. doi:10.1111/j.1467-9310.2007.00481.x.

Loscalzo, J. (2012). Irreproducible experimental results:

causes, (mis)interpretations, and consequences. Circula-

tion 125, 1211–1214. doi:10.1161/CIRCULATIONAHA.112.

098244.

Mead, R. (1988). The design of experiments: statistical principles

for practical application. Cambridge, UK: Cambridge University

Press.

Nadon, R. and Shoemaker, J. (2002). Statistical issues with mi-

croarrays: processing and analysis. Trends in Genetics 15, 265–

271.Naik, G. (2011). Scientists’ elusive goal: Reproducing study

results. The Wall Street Journal December 2.

URL http://online.wsj.com/article/

SB10001424052970203764804577059841672541590.html

Nieuwenhuis, S., Forstmann, B. U., and Wagenmakers, E.-J.

(2011). Erroneous analysis of interactions in neuroscience: a

problem of significance. Nat. Neurosci. 14, 1105–1107.

Nuzzo, R. (2014). Scientific method: Statistical errors. Nature

506, 150–152. doi:10.1038/506150a.

Parkin, S., Pritchett, J., Grimsdich, D., Bruckdorfer, K., Sahota,

P., Lloyd, A., and Overend, P. (2004). Circulating levels of

the chemokines JE and KC in female C3H apolipoprotein-E-

deficient and C57BL apolipoprotein-E-deficient mice as poten-

tial markers of atherosclerosis development. Biochemical Soci-

ety Transactions 32, 128–130.

Peng, R. (2009). Reproducible research and biostatistics. Bio-

statistics 10, 405–408. doi:10.1093/biostatistics/kxp014.

Peng, R. (2015). The reproducibility crisis in science. Signifi-

cance 12, 30–32. doi:10.1111/j.1740-9713.2015.00827.x.

Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,

R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,

D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lan-

caster, J., and Nevins, J. R. (2006). Genomic signature to guidethe use of chemotherapeutics. Nature Medicine 12, 1294–1300.

doi:10.1038/nm1491. (Retracted).

Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,

R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,

D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lan-

caster, J., and Nevins, J. R. (2011). Retracted: Genomic signa-

ture to guide the use of chemotherapeutics. Nature Medicine

17, 135. doi:10.1038/nm0111-135. (Retracted).

Potvin, C., Lechowicz, M. J., Bell, G., and Schoen, D. (1990).

Spatial, temporal and species-specific patterns of heterogene-

ity in growth chamber experiments. Canadian Journal of Botany

68, 499–504.Prinz, F., Schlange, A., and Asadullah, K. (2011). Believe it

or not: how much can we rely on published data on poten-

tial drug targets. Nature Rev. Drug Discov. 10 , 712–713. doi:

10.1038/nrd3439-c1.

R Core Team (2013). R: A Language and Environment for Statis-

tical Computing. R Foundation for Statistical Computing, Vi-

enna, Austria.

URL http://www.R-project.org

Reinhart, A. (2015). Statistics Done Wrong: The Woefully Com-

plete Guide. San Francisco, CA: no starch press.

Ritskes-Hoitinga, M. and Strubbe, J. (2007). Nutrition and an-

imal welfare. In E. Kaliste, editor, The Welfare of Laboratory An-

imals, chapter 5, pages 95–112. Dordrecht, The Netherlands:

Springer.

http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off


http://www.datascope.be/

http://online.wsj.com/article/%20SB10001424052970203764804577059841672541590.html


http://www.r-project.org/

http://www.r-project.org/



http://www.datascope.be/





68 REFERENCES

Rivenson, A., Hoffmann, D., Prokopczyk, B., Amin, S., and

Hecht, S. S. (1988). Induction of lung and exocrine pancreas

tumors in F344 rats by tobacco-specific and Areca-derived N-

nitrosamines. Cancer Res. 48, 6912–6917.

Ruxton, G. D. and Colegrave, N. (2003). Experimental Design

for the Life Sciences. Oxford, UK: Oxford University Press.Sailer, M. O. (2013). crossdes: Construction of Crossover Designs.

R package version 1.1-1.

URL http://CRAN.R-project.org/package=crossdes

Salsburg, D. (2001). The Lady Tasting Tea. New York, NY.: Free-

man.

Schlain, B., Jethwa, H., Subramanyam, M., Moulder, K., Bhatt,

B., and Molley, M. (2001). Designs for bioassays with plate

location effects. BioPharm International 14, 40–44.

Scholl, C., Fröhling, S., Dunn, I., Schinzel, A. C., Barbie, D. A.,

Kim, S. Y., Silver, S. J., Tamayo, P., Wadlow, R. C., Ramaswamy,

S., Döhner, K., Bullinger, L., Sandy, P., J.S., B., Root, D. E., Jacks,

T., Hahn, W., and Gilliland, D. G. (2009). Synthetic lethal in-

teraction between oncogenic KRAS dependency and STK33

suppression in human cancer cells. Cell 137, 821–834. doi:

10.1016/j.cell.2009.03.017.

Sellke, T., Bayarri, M., and Berger, J. (2001). Calibration of p

values for testing precise null hypotheses. The American Statis-

tician 55, 62–71.

Selwyn, M. R. (1996). Principles of Experimental Design for the

Life Sciences. Boca Raton, FL: CRC Press.

Séralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,

Malatesta, M., Hennequin, D., and Vendômois, J. (2012). Long

term toxicity of a roundup herbicide and a roundup-tolerantgenetically modified maize. Food Chem. Toxicol. 50, 4221–4231.

doi:10.1016/j.fct.2012.08.005.

Séralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,

Malatesta, M., Hennequin, D., and Vendômois, J. (2014). Re-

published study: long term toxicity of a roundup herbicide

and a roundup-tolerant genetically modified maize. Environ-

mental Sciences Europe 26, 14. doi:10.1186/s12302-014-0014-5 .

Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods.

Ames, IA: Iowa State University Press, 7th edition.

Straetemans, R., O’Brien, T., Wouters, L., Van Dun, J., Janicot,

M., Bijnens, L., Burzykowski, T., and M, A. (2005). Design and

analysis of drug combination experiments. Biometrical J 47,

299–308.

Tallarida, R. J. (2001). Drug synergism: its detection and ap-

plications. J. Pharm. Exp. Ther. 298, 865–872.

Temme, A., Sümpel, F., Rieber, G. S. E. P., Willecke, K. J. K.,

and Ott, T. (2001). Dilated bile canaliculi and attenuated

decrease of nerve-dependent bile secretion in connexin32-

deficient mouse liver. Eur. J. Physiol. 442, 961–966.

Tressoldi, P. E., Giofré, D., Sella, F., and Cumming, G. (2013).

High impact = high statistical standards? not necessarily so.

Nature Reviews Neuroscience 8 , e56180. doi:10.1371/journal.

pone.0056180.

Tufte, E. R. (1983). The Visual Display of Quantitative Informa-

tion. Cheshire, CT.: Graphics Press.

Tukey, J. W. (1980). We need both exploratory and confirma-

tory. The American Statistician 34, 23–25.

Van Belle, G. (2008). Statistical Rules of Thumb. Hoboken, NJ: J.

Wiley, 2nd edition.

van der Worp, B., Howells, D. W., Sena, E. S., Porritt, M.,

Rewell, S., O’Collins, V., and Macleod, M. R. (2010). Can an-

imal models of disease reliably inform human studies. PLoS

Med. 7, e1000245. doi:10.1371/journal.pmed.1000245.

van Luijk, J., Bakker, B., Rovers, M. M., Ritskes-Hoitinga, M.,

de Vries, R. B. M., and Leenaars, M. (2014). Systematic re-

views of animal studies; missing linkin translational research?

PLOS ONE 9, e89981. doi:10.1371/journal.pone.0089981.

Vandenbroeck, P., Wouters, L., Molenberghs, G., Van Gestel,

J., and Bijnens, L. (2006). Teaching statistical thinking to life

scientists: a case-based approach. J. Biopharm. Stat. 16, 61–75.

Ver Donck, L., Pauwels, P. J., Vandeplassche, G., and Borgers,

M. (1986). Isolated rat cardiac myocytes as an experimental

model to study calcium overload: the effect of calcium-entry

blockers. Life Sci. 38, 765–772.

Verheyen, F., Racz, R., Borgers, M., Driesen, R. B., Lenders,

M. H., and Flameng, W. J. (2014). Chronic hibernating my-

ocardium in sheep canoccur without degenerating eventsand

is reversed after revascularization. Cardiovasc Pathol. 23, 160–

168. doi:10.1016/j.carpath.2014.01.003.

Vlaams Instituut voor Biotechnologie (2012). A scientific

analysis of the rat study conducted by Gilles-Eric Séralini et

al.

URL http://www.vib.be/en/news/Documents/20121008_EN_Analyse\rattenstudieS{é}ralini\et\al.pdf

Wacholder, S., Chanoch, S., Garcia-Closas, M., El ghormli, L.,

and Rothman, N. (2004). Assessing the probability that a posi-

tive report is false: an approach for molecular epidemiology

studies. J Natl Cancer Inst 96, 434–442. doi:10.1093/jnci/

djh075.

Wasserstein, R. and Lazar, N. (2016). The asa’s statement on

p-values: context, process, and purpose. The American Statisti-

cian 70. doi:10.1080/00031305.2016.1154108. In press.

URL http://dx.doi.org/10.1080/00031305.2016.1154108

Weissgerber, T. L., Milic, N. M., Wionham, S. J., and Garovic,V. D. (2015). Beyond bar and line graphs: time for a new

data presentation paradigm. PLoS Biol. 13, e1002128. doi:

10.1371/journal.pbio.1002128.

Wilcoxon, F., Rhodes, L. J., and Bradley, R. A. (1963). Two se-

quential two-sample grouped rank tests with applications to

screening experiments. Biometrics 19, 58–84.

Wilks, S. S. (1951). Undergraduate statistical education. J.

Amer. Statist. Assoc. 46, 1–18.

Witte, J., Elston, R., and Cardon, L. (2000). On the relative

sample size required for multiple comparisons. Statist. Med.

19, 369–372.

Wonnacott, T. H. and Wonnacott, R. J. (1990). Introductory

Statistics. New York, NY.: J. Wiley, 5th edition.

http://cran.r-project.org/package=crossdes

http://www.vib.be/en/news/Documents/20121008_EN_Analyse/%20rattenstudie%20S%7B%E9%BD%B2alini/%20et/%20al.pdf


http://dx.doi.org/10.1080/00031305.2016.1154108

http://dx.doi.org/10.1080/00031305.2016.1154108



http://cran.r-project.org/package=crossdes



REFERENCES 69

Yang, H., Harrington, C. A., Vartanian, K., Coldren, C. D.,

Hall, R., and Churchill, G. A. (2008). Randomization in lab-

oratory procedure is key to obtaining reproducible microar-

ray results. PLOS ONE 3, e3724. doi:10.1371/journal.pone.

0003724.

Young, S. S. (1989). Are there location/cage/systematic non-

treatment effects in long-term rodent studies? a question re-

visited. Fundam Appl Toxicol. 13, 183–188.

Zimmer, C. (2012). A sharp rise in retractions prompts calls

for reform. The New York Times April 17.

URL http://www.nytimes.com/2012/04/17/science/rise-

in-scientific-journal-retractions-prompts-calls-for-

reform.html

http://www.nytimes.com/2012/04/17/science/rise-in-scientific-journal-retractions-prompts-calls-for-reform.html








70 REFERENCES



Appendices

71



A. Glossary of Statistical Terms

ANOVA : see analysis of variance.

Accuracy : the degree to which a measurement

process is free of bias.

Additive model : a model in which the combined

effect of several explanatory variables or fac-

tors is equal to the sum of their separate ef-

fects.

Alternative hypothesis : a hypothesis which is

presumed to hold if the null hypothesis does

not; the alternative hypothesis is necessary in

deciding upon the direction of the test and in

estimating sample sizes.

Analysis of variance :a statistical method of infer-

ence for making simultaneous comparisons

between two or more means.

Balanced design : a term usually applied to any

experimental design in which the same num-

ber of observations is taken for each combi-

nation of the experimental factors.

Bias : the long-run difference between the average

of a measurement process and its true value.

Blinding : the condition under which individu-als are uninformed as to the treatment con-

ditions of the experimental units.

Block : a set of units which are expected to re-

spond similarly as a result of treatment.

Completely randomized design : a design in which

each experimental unit is randomized to a

single treatment condition or set of treat-

ments.

Confidence interval : a random interval that de-

pends on the data obtained in the study and

is used to indicate the reliability of an esti-

mate. For a given confidence level, if several

confidence intervals are constructed based

on independent repeats of the study, then on

the long run, the proportion of such intervals

that contain the true value of the parameterwill correspond the confidence level.

Confounding : the phenomenon in which an extra-

neous variable, not under control of the in-

vestigator, influences both the factors under

study and the response variable.

Covariate : a concomitant measurement that is re-

lated to the response but is not affected by the

treatment.

Critical value : the cutoff or decision value in hy-

pothesis testing which separates the accep-

tance and rejection regions of a test.

Data set : a general term for observations and

measurements collected during any type of

scientific investigation.

Degrees of freedom : the number of values that are

free to vary in the calculation of a statistic,e.g. for the standard deviation, the mean is

already calculated and puts a restriction on

the number of values that can vary; therefore

the degrees of freedom of the standard devi-

ation is the number of observations minus 1.

Effect size : when comparing treatment differ-

ences, the effect size is the mean difference

divided by the standard deviation (not stan-

dard error); the standard deviation can befrom either group, or a poooled standard de-

viation can be used.

73



74 APPENDIX A. GLOSSARY OF STATISTICAL TERMS

Estimation : an inferential process that uses the

value of a statistic derived from a sample to

estimate the value of a corresponding popu-

lation parameter.

Experimental unit : the smalles unit to which dif-ferent treatments or experimental conditions

can be applied.

Explanatory variable : also called predictor, a vari-

able which is used in a relationship to explain

or to predict changes in the values of another

variable; the latter called the dependent vari-

able.

External validity : extent to which the results of a

study can be generalized to other situations.

Factor : the condition or set of conditions that is

manipulated by the investigator.

Factor level : the particular value of a factor.

False negative : the error of accepting the null hy-

pothesis when it is false.

False positive : the error of rejecting the null hy-

pothisis when it is true.

Hypothesis testing : a formal statistical procedure

where one tests a particular hypothesis on

the basis of experimental data.

Internal validity : extent to which a causal conclu-

sion based on a study is warranted.

Level of significance : the allowable rate of false

positives, set prior to analysis of the data.

Null hypothesis : a hypothesis indicating “no dif-

ference” which will either be accepted or re-

jected as a result of a statistical test.

Observational unit : the unit on which the re-

sponse is measured or observed; this is

not necessarily identical to the experimental

unit.

One-sided test : a statistical test for which the re-

jection region consists of either very large orvery small values of the test statistic, but not

of both.

P-value : the probability of obtaining a test statis-

tic as extreme as or more extreme than the

observed one, provided the null hypothesis

is true; small p-values are unlikely when the

null hypothesis holds.

Parameter : a population quantity of interest, ex-

amples are the population mean and stan-

dard deviation of a normal distribution.

Pilot study : a preliminary study performed to

gain initial information to be used in plan-

ning a subsequent, definitive study; pilot

studies are used to refine experimental pro-

cedures and provide information on sources

of bias and variability.

Population : the collection of all subjects or units

about which inference is desired.

Precision : the degree to which a measurement

process is limited in terms of its variability

about a central value.

Protocol : a document describing the plan for

a study; protocols typically contain infor-

mation on the rationale for performing the

study, the study objectives, experimental

procedures to be followed, sample sizes and

their justification, and the statistical analsy-

ses to be performed; the study protocol must

be distinguished from the technical protocol

which is more about lab instructions.

Randomization : a well-defined stochastic law

for assigning experimental units to differing

treatment conditions; randomization may

also be applied elsewhere in the experiment.

Statistic : a mathematical function of the observed

data.

Statistical inference : the process of drawing con-

clusions from data that is subject to random

variation.

Stochastic : non-determistic, chance dependent.

Subsampling : the situation in which measure-

ments are taken at several nested levels; thehighest level is called the primary sampling

unit; the next level is called the secondary



75

sampling unit, etc. when subsampling is

present, it is of great importance to identify

the correct experimental unit.

Test statistic : a statistic used in hypothesis test-

ing; extreme values of the test statistic are un-likely under the null hypothesis.

Treatment : a specific combination of factor levels.

Two-sided test : a statistical test for which the re-

jection region consists of both very large or

very small values of the test statistic.

Type II error : error made by not rejecting the

null hypothesis when the alternative hypoth-

esis is true.

Type I error : error made by the incorrect rejec-

tion of a true null hypothesis.

Variability : the random fluctuation of a measure-

ment process about its central value.



76 APPENDIX A. GLOSSARY OF STATISTICAL TERMS



B. Tools for randomization in MS Excel

and R

B.1 Completely randomized de-

sign

Suppose 21 experimental units have to be ran-

domly assigned to three treatment groups, such

that each treatment group contains exactly seven

animals

B.1.1 MS Excel

A randomization list is easily constructed using a

spreadsheet program like Microsoft Excel. This is

illustrated in Figure B.1. We enter in the first col-

umn of the spreadsheet the code for the treatment

(1, 2, 3). Using the RAND() function, we fill the

second column with pseudo-random numbers be-

tween 0 and 1. Subsequently the two columns are

selected and the Sort command from the Data-menu

is executed. In the Sort-window that appears now,

we select column B as column to be sorted by. The

treatment codes in column A are now in random

order, i.e. the first animal will receive treatment 2,

the second treatment 3, etc.

B.1.2 R-Language

In the open source statistical language R, the sameresult is obtained by

> # make randomization process reproducible

> set.seed(14391)

> # sequence of treatment codes A,B,C repeated 7 times

> x<-rep(c("A","B","C"),7)

> x

[1] "A" "B" "C" "A" "B" "C"

[7] "A" "B" "C" "A" "B" "C"

[13] "A" "B" "C" "A" "B" "C"

[19] "A" "B" "C"

> # randomize the sequence in x

> rx<-sample(x)

> rx

[1] "B" "B" "B" "A" "A" "C"

[7] "C" "C" "B" "A" "C" "C"

Figure B.1. Generating a completely randomized design in MS Excel

77



78 APPENDIX B. TOOLS FOR RANDOMIZATION IN MS EXCEL AND R

Figure B.2. Generating a randomized complete block design in MS Excel

.

[13] "A" "B" "C" "B" "A" "A"[19] "C" "A" "B"

B.2 Randomized complete block

design

Suppose 20 experimental units, organized in 5 blocks of size 4

have to be randomly assigned to 4 treatment groups A, B, C, D,

such that each treatment occurs exactly once in each block.

B.2.1 MS Excel

To generate the design in MS Excel, follow the procedure that

is depicted in Figure B.2. We enter in the first column of the

spreadsheet the code for the treatment (A, C, B, D). The sec-

ond column (Column B) is filled with an indication of the block

(1:5). Using the RAND() function, we fill the third column with

pseudo-random numbers between 0 and 1. Subsequently, the

three columns are selected and the Sort command from the

Data-menu is executed. In the Sort window that appears now,

we select Column B as first sort criterion and Column C as sec-

ond sort criterion. The treatment codes in column A are now

for each block in random order, i.e. the first animal in block 1

will receive treatment A, the second treatment D, etc.

B.2.2 R-Language

> design<-data.frame(block=rep(1:5,rep(4,5)),treat=treat)> head(design,10) # first 10 exp units

block treat

1 1 A

2 1 B

3 1 C

4 1 D

5 2 A

6 2 B

7 2 C

8 2 D

9 3 A

10 3 B

> # randomly distribute units

> rdesign<-design[sample(dim(design)[1]),]

> # order by blocks for convenience

> rdesign<-rdesign[order(rdesign[,"block"]),]

> # sequence of units within blocks

> # is randomly assigned to treatments

> head(rdesign,10)

block treat

3 1 C

4 1 D

1 1 A

2 1 B

8 2 D

6 2 B