tatsuki koyama, phd - vanderbilt...

161
C LINICAL T RIALS O SAKA U NIVERSITY B IOSTATISTICS Tatsuki Koyama, PhD Center for Quantitative Sciences Vanderbilt University School of Medicine [email protected] 7/6/15 7/10/15 Copyright 2015. T Koyama. All Rights Reserved. Updated June 18, 2015

Upload: others

Post on 19-Jun-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CLINICAL TRIALSOSAKA UNIVERSITY BIOSTATISTICS

Tatsuki Koyama, PhDCenter for Quantitative Sciences

Vanderbilt University School of [email protected]

7/6/15 ∼ 7/10/15

Copyright 2015. T Koyama. All Rights Reserved.

Updated June 18, 2015

Page 2: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Contents

1 Observational Studies and Experimental Studies 1

1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Advantages of observational studies . . . . . . . . . . . . . . . . . . 2

1.2.2 PCOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Clinical trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Essential requirements for clinical trials . . . . . . . . . . . . . . . . 6

1.3.2 Some unique issues with clinical trials . . . . . . . . . . . . . . . . . 7

1.3.3 PIVOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Clinical trial 10

2.1 Phases of clinical trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Page 3: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

2.1.1 Phase I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Phase II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Phase III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.4 Phase IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Clinical trial terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Intention-to-Treat (ITT) . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Selected topics in basic statistics 15

3.1 Stop when significant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Repeat until significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Divide by the control mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Brief introduction to simple Bayesian analysis 22

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Beta-Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Page 4: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

5 Randomization in clinical trials 31

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Example: The Salk Vaccine Field Trial . . . . . . . . . . . . . . . . . . . . . 32

5.3 Simple randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4 Imbalance in treatment allocation . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4.1 Block randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4.2 Biased coin and urn model . . . . . . . . . . . . . . . . . . . . . . . 36

5.5 Imbalance in baseline patient characteristics . . . . . . . . . . . . . . . . . 36

5.5.1 Stratified randomization . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5.2 Adaptive and minimization randomization . . . . . . . . . . . . . . . 38

5.6 Response adaptive randomization . . . . . . . . . . . . . . . . . . . . . . . 39

6 Phase I clinical trials 40

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Non-cancer, non-AIDS phase I . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 TGN1412 disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3 3 + 3 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Page 5: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

6.3.1 Accelerated titration . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Bayesian approach: CRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.5 Modified Toxicity Probability Interval Design . . . . . . . . . . . . . . . . . . 54

7 Phase II clinical trials 60

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.2 Phase II trials in oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.3 Two-stage designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.3.1 Gehan’s design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.3.2 Fleming’s design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.3.3 Simon’s design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.4 Data analysis following a two-stage design in phase II clinical trials . . . . . 69

7.4.1 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.4.2 Point estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Treatment effects monitoring 78

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Page 6: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

8.1.1 Composition and organization of TEMC = DMC . . . . . . . . . . . . 81

9 Group Sequential Method 85

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.3 General applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.3.1 Beta blocker heart attack trial . . . . . . . . . . . . . . . . . . . . . . 92

9.3.2 non-Hodgkin’s lymphoma . . . . . . . . . . . . . . . . . . . . . . . . 92

9.4 Alpha-spending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.5 One-sided test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9.6 Repeated confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.7 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10 Two-stage adaptive designs 100

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3 Toward conditional power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

10.4 Conditional power functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Page 7: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

10.5 Unspecified designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.6 Ordering of sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

10.7 Predictive power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

11 Factorial design 121

11.1 Notation using cell means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

11.2 Efficiency when no interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 124

11.3 Example: the Physician’s Health Study I (1989) . . . . . . . . . . . . . . . . 127

11.4 Treatment interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

12 Crossover design 131

12.1 Some characteristics of crossover design . . . . . . . . . . . . . . . . . . . 132

12.2 Analysis of 2×2 crossover design . . . . . . . . . . . . . . . . . . . . . . . 134

12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

12.4 Analysis of simple crossover design . . . . . . . . . . . . . . . . . . . . . . 138

12.5 A two-period crossover design for the comparison of two active treatmentsand placebo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

12.6 Latin squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

12.7 Optimal designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Page 8: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

13 Pragmatic clinical trials 148

13.1 Superiority, Noninferiority, and Equivalence . . . . . . . . . . . . . . . . . . 149

13.1.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

13.2 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

13.2.1 Sample size adjustment for ITT analysis . . . . . . . . . . . . . . . . 151

Page 9: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 1

Observational Studies andExperimental Studies

1.1 Definitions

Observational study A study design in which the investigator does not control the as-signment of treatment of individual study subjects (Piantadosi)

Experiment A study in which the investigator makes a series of careful observationsunder controlled or arranged conditions. In particular, the investigator controls thetreatment or exposure applied to the subject(s) by design and then carefully andthoroughly records outcome measurements.

Clinical trial An experiment in humans designed to accurately assess the effectsof a treatment or treatments by reducing random error and bias. (Piantadosi)

Clinical trial A prospective study comparing the effect and value of an interventionagainst a control in human subjects. (Friedman)

Clinical trial An experiment designed to assess the efficacy of a test treatmentby comparing its effects with those produced using some other test or controltreatment in comparable groups of human beings. (Meinert CL 1994)

1

Page 10: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 2

1.2 Observational studies

1.2.1 Advantages of observational studies

• Lower cost

• Greater timeliness

• A broader range of patient

• It may be applied when a controlled clinical trial would be impossible or unethical

1.2.2 PCOS

Prostate cancer is the second leading cause of cancer death among American men (be-hind lung cancer). The common treatment choices for localized disease are surgery, radi-ation and observation. Suppose we are interested in comparing effectiveness of surgeryand radiation therapies.

The Prostate Cancer Outcomes Study (PCOS): Subjects were identified through six sitesparticipating in the NCI’s SEER program. (diagnosed with Prostate cancer from 1994/10/1to 1995/10/31). N = 1,655.

Alive DeadRadiation 240 (49%) 251 (51%)

Surgery 838 (72%) 326 (28%)

Can we conclude that Surgery is better?

χ2 = 80.2, p < 0.001. Odds ratio: 2.69 (2.15, 3.36).

In general, it is difficult to establish a cause-and-effect association from an observationalstudy because of confounders.

Page 11: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 3

Confounder A prognostic factor that is associated with both response (e.g. survival) andexplanatory variable (e.g. treatment choice).

Radiation Surgery P-valueN = 491 N = 1164

Age 69 (64,71) 64 (59,68) P < 0.001PSA 2.14 (1.78,2.60) 1.93 (1.63,2.42) P < 0.001

Tumor grade P = 0.0191 65% (292) 72% (743)2 25% (110) 21% (216)3 10% ( 46) 7% ( 73)

Gleason score P = 0.0492−6 62% (223) 69% (520)

7 27% ( 95) 23% (170)8−10 11% ( 44) 8% ( 60)

radiation surgery

55

60

65

70

75n=295 n=705

Page 12: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 4

●●●

radiation Alive radiation Dead surgery Alive surgery Dead

55

60

65

70

75n=132 n=163 n=521 n=184

Age

Let’s analyze the data with a method that account for the baseline difference in the twotreatment groups.

Survival ∼ Treatment ∗ (Age + PSA + Tumor grade + Gleason score)

Many statistical methods exist to establish causal relationship from an observational study.(e.g., propensity score, instrumental variable)

• Propensity score analysis

– propensity score matching

– propensity score as weights

– propensity score as additional variable

Page 13: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 5

50 25 0 25 50 75 100

Radiation Surgery

0.0

0.2

0.4

0.6

0.8

1.0

Can observational studies establish a cause-effect association?

“PM USA agrees with the overwhelming medical and scientific consensus that cigarettesmoking causes lung cancer, heart disease, emphysema and other serious diseases insmokers. Smokers are far more likely to develop serious diseases, like lung cancer, thannon-smokers. There is no safe cigarette.”

Page 14: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 6

“Smoking and health” by JTThe Ministry of Health’s claim that smoking is a risk factor for many diseases is primarilybased on epidemiological studies of comparisons between smokers and non-smokers ondisease rate. Epidemiological studies are useful in establishing exploratory associationsbetween a disease and risk factors, but they can not establish a cause-and-effect associ-ation without controlling for other factors such as genetic factors, diet, exercise and stress.Moreover, epidemiological studies are intended to compare populations and do not revealthe risk of disease for individual smokers.

1.3 Clinical trials

1.3.1 Essential requirements for clinical trials

• Human subjects

• Designed

• Involves intervention

• Comparable treatment groups

• Prospective follow-up for a specified outcome

Types of study designs that are not clinical trials.

• Case report

• Cross-sectional study

• Case-control study

• General observational study (prospective study)

• Animal study

Page 15: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 7

Some typical characteristics of human studies

• Large variation among subjects

• Lengthy disease process

• Rare disease

• Non-compliance / dropouts

• Ethical issues

Advantages of clinical trials

• Can establish a cause-effect association (free of confounding)

1.3.2 Some unique issues with clinical trials

• Ethics

– Randomization

– Placebo

– One patient and the society

– When accumulating data show trends, what should we do?

• Safety / toxicity

• Patient consent

• Patient compliance

• Quality of life

Page 16: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 8

1.3.3 PIVOT

Prostate Cancer Intervention Versus Observation Trial (PIVOT):

• Results presented at American Urological Association Annual Meeting (May 2011)

• Wilt et al. (PIVOT Study Group). “Radical prostatectomy versus observation forlocalized prostate cancer” N Engl J Med. 2012. 367(3):203–213. Erratum in : NEngl J Med. 2012. 367(6):582.

• Patients with localized prostate cancer (enrollment 1994 - 2002), the last observationwas made in 2010.

• Inclusion criteria:75 years or youngerLocalized diseasePSA < 50mg/mLDiagnosed within 12 monthsRadical prostatectomy candidate

• Endpoint: All cause mortality

• intention-to-treat analysis

The study objective was “Among men with clinically localized prostate cancer detectedduring the early PSA era, does the intent to treat with radical prostatectomy reduce all-cause & prostate cancer mortality compared to observation?”

• 5,023 were eligible, 4,292 declined randomization.

• 731 were randomized (364 prostatectomy / 367 observation).Prostatectomy was performed 281 (77%) of 364 in group 1 and 35 (10%) of 367 ingroup 2.

Page 17: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 1. OBSERVATIONAL STUDIES AND EXPERIMENTAL STUDIES 9

Actual TreatmentAssigned Treatment Surgery Observation

Surgery 281 (77%) 83 (23%) 364 (50%)Observation 35 (10%) 332 (90%) 367 (50%)

316 (43%) 415 (57%) 731

Intention-to-treat analysis compares 364 surgery patients and 367 observation patientsbased on the assigned treatments.

On-treatment analysis compares 316 surgery patients and 415 observation patients basedon the actual treatments.

Per-protocol analysis compares 281 surgery patients and 332 observation patients whoadhered to the protocol.

• No statistically significant difference in baseline patient characteristics (age / race /marital status / comorbidities / PSA / stage).

• Median follow-up: 10.0 years (quartiles: 7.3, 12.6).

• All-cause mortality: 354/731 = 48.4%.

• Prostate cancer specific mortality: 52/731 = 7.1%.

• No difference in all-cause mortality between the treatments.

• Significant difference (all surgery better) was found.

– All-cause mortality among low-risk patients.

– Prostate cancer specific mortality in high-risk patients.

– Prostate cancer specific mortality in PSA high patients.

“While surgery did not reduce mortality more than observation in men with low PSA or lowrisk prostate cancer, our results suggest a benefit from surgery in men with higher PSAor higher risk disease.”

Page 18: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 2

Clinical trial

A clinical trial is an experiment testing a medical treatment on human subjects. Theopposite term is perhaps a nonexperimental study, not an observational study.

2.1 Phases of clinical trials

Usually, clinical trials are often classified into phase (I to IV). This terminology may beinadequate but very widely used.

2.1.1 Phase I

The main objective of phase I clinical trial is to establish safety. (to estimate MTD) Thesestudies should provide information on the pharmacokinetics of the drug in humans. Theymay provide preliminary information the pharmacodynamics of the drug.

Pharmacokinetics What the body does to drug. The process by which the drug is ab-

10

Page 19: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 2. CLINICAL TRIAL 11

sorbed, distributed, metabolized, and eliminated by the body. Some commonly usedparameters to study pharmacokinetics are: Concentration of drug (in plasma); Bio-logical half-life, Cmax, the peak plasma concentration of a drug; tmax, time to achieveCmax.

Pharmacodynamics What drug does to the body. Effects of drugs on living organismsand systems.

Subjects for phase I study are usually normal healthy volunteers. In oncology studies,patients may participate. Single arm dose escalation (dose determination) trials are com-mon in phase I studies. Historically, 3+3 designs have been frequently used; however,Bayesian designs, namely, the continual reassessment method and the modified toxicityprobability interval, are gaining popularity.

2.1.2 Phase II

Phase II trials primarily look for evidence of efficacy (activity), but safety should also beclosely monitored. Sometimes, phase II trials are divided to phase IIa and IIb trials.

Phase IIa The primary objective is to establish a safe (and effective) dose.

Phase IIb The primary objective is to assess efficacy of the drug.

Phase II trials are often single-arm trials, where the response rate is compared to a his-torical control. They can be multi-arm with a placebo control. Two-stage and multi-stagedesigns are often applied to expedite a decision of futility.

2.1.3 Phase III

Phase III clinical trials are considered pivotal, and the new treatment is compared tostandard treatment or placebo to establish effectiveness of the new treatment. These

Page 20: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 2. CLINICAL TRIAL 12

trials are often multi-center trials involving hundreds to thousands of patients. Whenthere is already a good conventional treatment, establishing noninferiority -as opposedto superiority-, may be the primary objective.

2.1.4 Phase IV

Phase IV is a postmarketing surveillance, often to look for uncommon but serious side-effects.

Phase I/II trials and phase II/III trials have become popular.

2.2 Clinical trial terminologies

• Clinical trial protocols

• Data monitoring

– for safety

– for efficacy

• Active control

• Patient populations

– As treated / Treatment received

– Intention to treat (ITT)

– Per protocol / Adherers only

• Hypothesis of interest

– Superiority

– Non-inferiority

– Equivalence

Page 21: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 2. CLINICAL TRIAL 13

• Masking / Blinding

• Regulatory body

– FDA (Food and Drug Administration)

– EMA (European Medicines Agency)

– PMDA (Pharmaceuticals and Medical Devices Agency)

• ICH (International Conference on Harmonisation)

– ICH E9: Statistics

– ICH E6: Data management

– ICH E3: Study reports

2.2.1 Intention-to-Treat (ITT)

Intention to treat is the idea that patients on a randomized clinical trial should be analyzedas part of the treatment group to which they were assigned, even if they did not actuallyreceive the intended treatment. (Piantadosi).

Randomization in clinical trial should eliminate observable and unobservable bias, there-fore, if some patients are excluded or allowed to switch assignments, potential for bias isre-introduced.

In practice, oftentimes, ITT only includes the patients who took at least one dose of treat-ment and provided any data.

• All patients screened

• All patients randomized

– ITT (largest group of patients intended for treatment)

Page 22: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 2. CLINICAL TRIAL 14

– Perhaps overly conservative compared to the defintion below

• All patients receiving at least one dose of the study drug

– Practical definition of ITT. May induce some bias.

Some patients may be missing some data, and imputation is required.

Reasons that may prompt exclusion:

• protocol violation

• incorrect drug administration

• use of disallowed medication

• poor compliance

• drop out due to adverse event

• drop out due to perceived lack of efficacy

Page 23: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 3

Selected topics in basic statistics

3.1 Stop when significant

Suppose we would like to test H0 : π = 0.1; H1 : π > 0.1 by taking a random sampleof size 40 from a Bernoulli(π). Under H0, the number of “successes” out of 40 has aBinomial(40,0.1) distribution. Moreover, we compute P0[X ≥ 8] = 0.042, so the test thatrejects H0 if X ≥ 8 has a type I error rate of 0.042.

The data were observed sequentially, and the 8th “success” was observed after N =32, and we decided to reject H0 and terminate the study. The conclusion was that π

was significantly greater than 0.1, and π = 8/32 = 0.25 with a 90% confidence interval(0.131,0.406).

Aside: confidence interval for a binomial proportion

• Asymptotic method

• Wilson score method (note: use p = (X +2)/(N +4))

15

Page 24: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 16

• Clopper-Pearson (exact) method

Given x and N, the exact method confidence interval is (πL,πU), where πL and πU satisfythe following:

P[X ≥ x|π = πL] =n

∑k=x

(nk

kL(1−πL)

n−k = α/2

P[X ≤ x|π = πU ] =x

∑k=0

(nk

kU(1−πU)

n−k = α/2

This confidence interval contains π0 values such that H0 : π = π∗ would not be rejected bythe observed data. Suppose π∗ < p, we need to find π∗ such that P[X ≥ 8|π = π∗] = 0.05.And for π∗ > p, we need to find P[X ≤ 8|π = π∗].

1 - pbinom(7, 32, 0.10) # or sum( dbinom(8:32, 32, 0.10) )## [1] 0.01168545 This needs to be 0.05.1 - pbinom(7, 32, 0.12) # or sum( dbinom(8:32, 32, 0.12) )## [1] 0.03193625 This needs to be 0.05.

To solve for π∗, we can use the following relationship between a Binomial random variableand a Beta random variable. If X ∼ Binomial(n, p) and Y ∼ Beta(k,n− k+1), then

P[X ≥ k] = P[Y ≤ p].

So instead of solving for π∗ in 0.05 = P[X ≥ 8|π = π∗] iteratively we can solve 0.05 = P[Y ≤π∗].

qbeta(0.05, 8, 32-8+1)# [1] 0.1309329

Questions

Page 25: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 17

• Was type I error rate controlled?

• Was π unbiased?

π is an estimator such that

π =

{8/Y if Y ≤ 40,X/40 if X < 8,

where Y is the number of trials when 8th “success” happened. (Y has a:::::::::::::::::::negative binomial

distribution)

Eπ [π] =40

∑y=8

8y

Pπ [Y = y]+7

∑x=0

x40

Pπ [X = x]

Eπ [π]−π > 0

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

0.025

0.030

True π

●● ●

Bias

Page 26: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 18

3.2 Repeat until significance

Suppose X is a random sample of size 10 from Normal(µ,σ2). If H0 : µ = 0 is not rejected,take another sample of size 15 and test again with n = 25. If each test has one-sidedα = 0.05, what is the actual type I error rate of this procedure?

• If we have k significance test of size α, the probability of at least one false positiveresult is 1− (1−α)k For α = 0.05,

k 1 2 3 4 5 10P[false positive] 0.05 0.0975 0.143 0.186 0.226 0.401

We need a conditional distribution of Zt given Z1 = z1.

Under H0,

X1 ∼ N(0,σ2/n1

), X2 ∼ N

(0,σ2/n2

).

Z1 =√

n1X1/σ Z2 =√

n2X2/σ

∼ N(0,1), ∼ N(0,1).

Let nt = n1 +n2 and write

Zt =

√nt

σXt =

√nt

σ

n1X1 +n2X2

nt

=

√nt

σ

σ√

n1Z1 +σ√

n2Z2

nt

=

√n1Z1 +

√n2Z2√

nt

Therefore, given Z1 = z1,

Zt =

√n1√nt

z1 +

√n2√nt

Z2,

Page 27: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 19

and Zt > c is equivalent to

Z2 >

√nt√n2

c−√

n1√n2

z1.

The conditional type I error rate given Z1 = z1 is

P[

Z2 >

√nt√n2

c−√

n1√n2

z1

∣∣∣∣Z1 = z1

].

And by intergrating this conditional type I error rate with respect to the distribution of Z1,we get unconditional type I error rate for the second stage (α2):

α2 =∫ c

−∞

[1−Φ

(√nt√n2

c−√

n1√n2

z1

)]φ(z1)dz1.

If we let n1/n2→ 0, the above expression tends to

α2 =∫ c

−∞

(1−Φ(c))φ(z1)dz1 = (1−Φ(c))Φ(c).

And if α = 0.05, i.e., c = 1.645, p2 = 0.05×0.95 = 0.0475.

Now, n1 = 10 and n2 = 15, and

α2 =∫ 1.645

−∞

[1−Φ

(√25√15

1.645−√

10√15

z1

)]φ(z1)dz1

=∫ 1.645

−∞

(1−Φ(2.12−0.82z1))φ(z1)dz1

= 0.033

3.3 Divide by the control mean

Suppose we are interested in comparing two treatment groups. Take random samplesof size 9 from each group (x11, · · · ,x19; x21, · · · ,x29). We want to test H0 : µ1 = µ2. But we

Page 28: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 20

may be worried that the “background noise” is different for these two groups, so we takerandom samples of size 3 representing background. c11,c12,c13 and c21,c22,c23. Thenwe may “normalize” x’s by dividing by the average of the corresponding control group.yi j = xi j/ci·. i = 1,2; j = 1, · · · ,9.

Suppose Xi ∼ Normal(µi,σ2x ), and Ci ∼ Normal(νi,σ

2c ).

What is the distribution of Y?

Suppose that µ1 = µ2 = 120, ν1 = ν2 = 30; σx = 4,σc = 3. Then if we use a t-test on yi j,type I error rate is about ... .

Nx Nc σx σc α

9 3 4 3 0.699 3 4 0.4 0.089 3 4 0 0.05

90 30 4 3 0.729 300 4 3 0.05

A simple remedy is to use a regression approach (analysis of variance).

Y = β0 +β1Xg +β2Xt +β12XgXt + ε,

where Xg = 0 if group 1, 1 otherwise; Xt = 0 if control, 1 otherwise.

Then the expected group means are

Control TreatmentGroup 1 β0 β0 +β2Group 2 β0 +β1 β0 +β1 +β2 +β12

Interpretation of the regression parameters:

β0 Group 1 control mean

Page 29: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 3. SELECTED TOPICS IN BASIC STATISTICS 21

β1 Difference of control means

β2 Treatment - Control in group 1

β12 Group 2T - Group 1T - Difference of control means

Thus testing β12 = 0 is testing for the treatment difference taking into account the controldifference.

Type I error rate by simulation (B = 10,000)

unweighted weighted (1/σ2) weighted (1/s2)Nx Nc σx σc α α α

9 3 4 3 0.026 0.051 0.0639 3 4 0.4 0.002 0.052 0.053

90 30 4 3 0.024 0.050 0.050200 30 4 3 0.050 0.051 0.05390 200 4 3 0.083 0.050 0.05590 200 4 0.4 0.184 0.054 0.054

If ‘fold change’ is desired like in this example, perhaps, we can take logarithm of all valuesbefore fitting the regression model.

log(Y ) = β′0 +β

′1Xg +β

′2Xt +β

′12XgXt + ε

Then what does β ′12 represent?

Page 30: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 4

Brief introduction to simple Bayesiananalysis

4.1 Introduction

• What is the philosophical difference between frequentist and Bayesian statistics?To a frequentist, unknown model parameters are fixed, and only estimable by repli-cations of data from some experiment. A Bayesian thinks of parameters as randomthat have distributions like the data.

• How does it work?

– Start with a prior guess for θ (usually a distribution).– Update the information by combining it with the data X .– Obtain a posterior distribution of θ .– All statistical inferences follow from the posterior distribution.

• Advantages of Bayesian methods.

– ability to incorporate prior information– stopping early does not affect the inference in the way frequentist approaches

do (e.g., inflation of type I error rate).– Interpretation of the result is easier

22

Page 31: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 23

4.2 Bayes theorem

P(A|B) = P(B|A)P(A)P(B)

.

Using this theorem, you can switch the event of interest and condition. For instance, weare usually interested in P[disease | test positive], but the observed data are usually P[testpositive | disease]. As long as you can compute P[test positive] and know P[disease], youcan make the switch.

In terms of updating distributions, the theorem is written as

p(θ |y) = p(y|θ)p(θ)p(y)

,

where p(θ) is the prior distribution, p(y|θ) is the likelihood, and p(θ |y) is the posteriordistribution. It is often written as

p(θ |y) ∝ p(y|θ)p(θ).

4.3 Example

4.3.1 Normal distributions

Suppose we are interested in the long-term systolic blood pressure (SBP) in mmHg ofa particular 60 year old female. We take two independent readings 6 weeks apart, andtheir mean is 130. We know that SBP is measured with a standard deviation σ = 5. (from“Bayesian Approaches to Clinical Trials and Health-Care Evaluation” by Speigelhalter, etal.)

With a Frequentist approach, a 95% confidence interval is

130±1.96(5/√

2) = (123.1,136.9).

Page 32: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 24

With a Bayesian approach, we can incorporate a prior belief that females aged 60 have amean long-term SBP of 120 with standard deviation 10. Let’s also assume that the priordistribution is normal. (prior = normal, likelihood = normal), that is,

θ ∼ N[θ0,σ2/n0], y|θ ∼ N[θ ,σ2/m].

Then

p(θ |y) ∝ p(y|θ)p(θ)

∝ exp(−m(y−θ)2

2σ2

)× exp

(−n0(θ −θ0)

2

2σ2

)= exp

(− 1

2σ2

{m(y−θ)2 +n0(θ −θ0)

2})= exp

(− 1

2σ2

{(θ − n0θ0 +my

n0 +m

)2

(n0 +m)

}+g(y,θ0,n0,m)

)

∝ exp

(− 1

2σ2

(θ − n0θ0 +my

n0 +m

)2

(n0 +m)

).

Therefore, the posterior distribution of θ is

Normal(

n0θ0 +myn0 +m

,σ2

n0 +m

).

In general, if the prior distribution is θ ∼ N(θ0,τ2) and the likelihood is y ∼ N(θ ,σ2

m), thenthe posterior distribution is

p(θ |y)∼ N(

θ0/τ2 + y/σ2m

1/τ2 +1/σ2m,

11/τ2 +1σ2

m

).

Going back to the current example, we have as the prior distribution and the likelihood,

θ ∼ N(120,102),

y|θ ∼ N(130,52/2).

(We can solve for n0 to get n0 = 52/10 = 0.25, which implies that the prior information isequivalent to a sample of size 0.25.)

Page 33: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 25

Continuing, we have the posterior mean and variance:

Mean =n0θ0 +my

n0 +m=

(0.25)(120)+(2)(130)0.25+120

= 128.9

Var =σ2

n0 +m=

52

0.25+2= 3.332.

We can compute a 95% credible interval as:

128.9±1.96(3.33) = (122.4,135.4),

and say the probability that θ is between 122.4 and 135.4 is 95% and mean it.

Page 34: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 26

100 110 120 130 140

Prior distribution

100 110 120 130 140

Likelihood

100 110 120 130 140

Posterior distribution

4.3.2 Beta-Binomial

In the last example, we start with a normal prior, use a normal likelihood, and arrive ata normal posterior. Things are not always that nice. Oftentimes, the posterior distribu-tion does not have a closed form. The last example is a case of a conjugate analysis.

Page 35: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 27

Conjugate models occur when the posterior distribution is of the same family as the priordistribution. Other examples include:

prior likelihood posteriorNormal Normal Normal

Beta Binomial BetaGamma Poisson Gamma

In a phase I or II clinical trial, the response of interest is often a probability (e.g., probabilityof a toxic reaction and of a positive response), and we would like to use a Binomial dis-tribution to model the number of certain events of interest. There, a Beta-binomial modelcan be used

ExampleIn a safety study, we would like to estimate the probability of severe drug-related adverseevent associated with a treatment of interest. We have an access to the data from anothersimilar study that showed 7 out of 117 had a severe adverse event.

Let X be the number of toxic reactions out of m patients in our study, and we have

X |p∼ Binomial(m, p).

And let the prior distribution of p be Beta(a,b).

Beta distribution:

f (p) =Γ(a+b)Γ(a)Γ(b)

pa−1(1− p)b−1,

where 0≤ p≤ 1, a > 0, b > 0, and Γ(s) is defined by

Γ(s) =∫

0ts−1e−tdt

It can be shown that the expectation and variance of a Beta random variable are

E[p] =a

a+b, V [p] =

ab(a+b)2(a+b+1)

.

Page 36: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 28

The posterior distribution of p is

f (p|x) ∝ f (x|p) f (p)

=

[(mx

)px(1− p)m−x

]×[

Γ(a+b)Γ(a)Γ(b)

pa−1(1− p)b−1]

=Γ(a+b+m)

Γ(x+a)Γ(m− x+b)px+a−1(1− p)m−x+b−1.

(Using the fact that for an integer k, Γ(a+k) = Γ(a)a(a+1)(a+2) · · ·(a+k−1).) Therefore,the posterior distribution of p is Beta(a+ x,b+m− x).

What do a and b mean in relation to x and m− x?

If we chose to use Beta(7,110), this prior gives equal weight for the two studies, i.e., apatient in the previous study counts as much as a patient in the new study. If we want toalmost completely disregard the prior information, we would use e.g., Beta(1,1), which isequivalent to having a sample size of 2 (one response / one non-response).

Suppose we want to use the prior information, but discount it so that it is only equivalentto 20 patients with P[toxicity] = 7/117. So the prior distribution of p is Beta(1.2,18.8)

Let’s say in our study with sample of size 40, there were 5 toxicity reactions. Thus thelikelihood function is f (p) = p5(1− p)35, and the posterior distribution is f (p|x) =Beta(1.2+5,18.8+35).

Page 37: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 29

0.0 0.1 0.2 0.3 0.4 0.5

Prior distribution

0.0 0.1 0.2 0.3 0.4 0.5

Likelihood

0.0 0.1 0.2 0.3 0.4 0.5

Posterior distribution

Using the posterior distribution, Beta(6.2,53.8), we can compute E[p] = 6.2/(6.2+53.8) =0.10, sd[p] = 0.039. Moreover,

P[p > 0.1] = 0.489,P[p > 0.15] = 0.121,

P[p > 0.2] = 0.017.

Page 38: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 4. BRIEF INTRODUCTION TO SIMPLE BAYESIAN ANALYSIS 30

From the data alone, a frequentist confidence interval on p is (0.042,0.27). A Bayesiancredible interval is (0.04,0.19), which is much narrower.

Page 39: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 5

Randomization in clinical trials

5.1 Introduction

Randomization assignment of patients or experimental subjects to two or more treat-ments by chance alone.

Main advantages of randomization

• It removes the potential of bias in the allocation of participants to the interventiongroup or to the control group (allocation bias).

• It tends to produce similar (compatible) groups in terms of measured as well asunmeasured confoundersSee confounding by indication in observational studies.

• It guarantees the validity of statistical tests of significancee.g., t-test for comparing two means can be justified on the basis of randomiza-tion alone without further assumptions about the distribution of baseline variables.(Permutation test / randomization test)

31

Page 40: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 32

Randomization is considered so important that the intention-to-treat principle consideredsacrosanct: “Analyze by assigned treatments irrespective of actual treatment received.”

Perceived disadvantages of randomization are often about emotional and ethical issues.→ randomization before consent

5.2 Example: The Salk Vaccine Field Trial

• In 1954, the Public Health Service decided to organize an experiment to test thepolio vaccine developed by Jonas Salk.

• 2 million children in selected school districts throughout the US were involved.

– Should all children have been vaccinated?60,000 cases in 1952; about half in 1953.

– Needed the parents’ consent.

– Half of the children with consent were randomized into vaccine.

• The National Foundation for Infantile Paralysis (NFIP) wanted to vaccinate all grade2 children with consent with grades 1 and 3 acting as controls.

– Polio is a contagious disease!

– Grades 1 and 3 children did not require consent.Systematic differences between groups.

– No way to blind the study.

the Results of the SVF trialsource: Freedman et al. Statistics second edition

The randomized controlled double-blind experiment

size Rate1

Treatment 200,000 26Control 200,000 71

No consent 350,000 46

Page 41: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 33

The NFIP design

size Rate1

Grade 2 (vaccine) 225,000 25Grade 1 & 3 (control) 725,000 54Grade 2 No consent 125,000 44

1Rate of polio cases per 100,000.

5.3 Simple randomization

For each subject, flip a coin to determine treatment assignment. P[treatment 1] = · · · =P[treatment k] = 1/k.

https://cqs.mc.vanderbilt.edu/shiny/GOLD

Problems with simple randomization and how to deal with them.

• Imbalance in treatment allocation

– replacement randomization

– block randomization

– adaptive randomization (biased coin / urn model etc.)

• Imbalance in baseline patient characteristics

– stratified randomization (stratified permuted block randomization)

– covariate adaptive randomization (minimization randomization)

Also response adaptive randomization (play the winner)

Page 42: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 34

5.4 Imbalance in treatment allocation

If the number of patients, N is 20, P[10 and 10] = 0.18. Xt ∼Binomial(20, .5). The probabilityof 7 to 13 split or worse is 26%. The treatment effect variance for 7− 13 split relative to10−10 split is (

17+

113

)/

(1

10+

110

)= 1.098.

7-13 split is only 1/1.098 = .92 as efficient as 10-10 split.

Even if treatment allocation is balanced at the end of trial, there may be a (severe) imbal-ance at some point. Because we monitor trials over time, we prefer to have balance overtime.

5.4.1 Block randomization

To ensure a better balance (in terms of number of patients) across groups over time,consider a block randomization (random permuted blocks).

Block randomization ensures approximate balance between treatments by forcing balanceafter a small number of patients (say 4 or 6). For example, the first 4 patients are allocatedto treatment A or B sequentially based on AABB.

There are 6 sequences of A, A, B, B, and let each sequence have 1/6 chance of beingselected.

AABB ABAB ABBA BAAB BABA BBAA

• What’s wrong with block size of 2? block size of 200?

• Easily applicable to more than 2 groups (A, B, C)

• Easily applicable to unequal group sizes (Na = 40 and Nb = 20).

Page 43: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 35

Why might we want unequal group sizes?

• We may want to have a better estimate of the effect for the new treatment.

• Treatment costs may be greatly different.Given the total sample size and the relative cost of treatment 2 to treatment 1, wecan find the optimal allocation ratio to minimize the total cost. (More in sample sizecomputation)

• Variances may be different.Suppose the means, µ1 and µ2, of treatment groups are being compared using

Z =(X1− X2)− (µ1−µ2)√

σ21/n1 +σ2

2/n2

.

For a given N = n1 + n2, the test statistic is maximized when the denominator isminimized. Solving

∂n1

(σ2

1n1

+σ2

2N−n1

)= 0

we get

n1

N=

σ1

σ1 +σ2.

Therefore, the optimal allocation ratio is r = n1/n2 = σ1/σ2.

Analysis should account for the randomization scheme but often does not. Matts andMcHugh (1978 J Chronic Dis) point out that

• because blocking guarantees balance between groups and increases the power ofa study, blocked randomization with the appropriate analysis is more powerful thannot blocking at all or blocking and ignoring it in the analysis.

• not accounting for blocking in analysis is conservative.

Page 44: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 36

5.4.2 Biased coin and urn model

These techniques are sometimes classified as “adaptive randomization”.

Allocation of i-th patient depends on how many have been randomized to group A (na)and group B (nb).

Any given time, the probability of allocation to group A may be

P[A] =nb

na +nb.

Or the rule may be to use P[A] = 2/3 when nb− na > 5, and P[B] = 2/3 when na− nb > 5.Characteristics of such a randomization scheme are often studied by simulations.

An urn model is one type of biased coin randomization.

• Prepare an urn with one Amber ball and one Blue ball.

• Pick one ball and make the corresponding treatment assignment (A/B).

• Put a ball of the opposite color in the urn.

5.5 Imbalance in baseline patient characteristics

Block randomization and biased coin model ensure that the group sizes are reasonablybalanced. In order to facilitate the comparison of treatment effects, balance on importantbaseline variables is sometimes desired.

• Randomization does not guarantee all the measured variables will be balanced. Andimbalance does not mean randomization did not work.

Page 45: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 37

• Quoting Senn (1994), “It is argued that this practice [testing baseline homogeneity]is philosophically unsound, of no practical value and potentially misleading. Insteadit is recommended that prognostic variables be identified in the trial plan and fit-ted in an analysis of covariance regardless of their baseline distribution (statisticalsignificance).”

• Quoting Piantadosi, “These methods, while theoretically unnecessary, encouragecovariate balance in the treatment groups, which tends to enhance the credibility oftrial results.”

5.5.1 Stratified randomization

Stratified randomization is applied to ensure that the groups are balanced on baselinevariables that are thought to be significant.

• Create strata based on the variables for which balance is sought.e.g., (Male, 65 or younger), (Male, older), (Female, younger), (Female older)

• Randomize to treatments within each stratum. Use block randomization!What’s wrong with

– using simple randomization within a stratum?

– using too many strata?

• Stratification should be accounted for in analysis.

– Pre-randomization stratification and post-randomization stratification (at time ofanalysis) has no clear winner.

– If trial is large, stratification may not be necessary

– Stratification by center is a good idea from practical viewpoints.

∗ allows randomization to be hosted at each site∗ allows sites to be removed and still maintains balance∗ If each stratum has a target size, plans need to be in place to close down

recruitment based on the baseline characteristics. e.g., “We do not needany more (Male, older)”.

Page 46: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 38

– Block randomization is a special type of stratified randomization where strataare defined by ... .

5.5.2 Adaptive and minimization randomization

Adaptive randomization can be used to reduce baseline imbalance:

• Define an imbalance function based on factors thought to be important

• Then use a rule to define P[treatment A] so that the next assignment is more likelyto reduce imbalance.

For example, the factors to balance are sex (male/female) and hypertension (yes/no), andlet the imbalance function be

I = 2× (sex imbalance)+3× (hypertension imbalance).

The patients randomized so far are

Sex Hypertensionmale female yes no

Group 1 10 3 8 5Group 2 8 3 6 5

The next patient is male-non hypertensive. The imbalance will be

I = 2× (11−8)+3× (6−5) = 9 if group 1,I = 2× (10−9)+3× (6−5) = 5 if group 2.

Thus let P[Group 2] = 2/3.

Minimization randomization uses the same idea but use P[Group 2] = 1, to eliminaterandomness when there is some imbalance. Randomize only when to assign the nextpatient to either group gives the same value of I.

Page 47: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 5. RANDOMIZATION IN CLINICAL TRIALS 39

5.6 Response adaptive randomization

As the name suggests, response adaptive randomization methods use the informationabout the response so far to allocate the next patient.

Play the winner: The idea is to allocate more patients in the treatment that seems to beworking better. To apply these methods, it is necessary to have a response rather quickly.Urn model can be used to make treatment assignment imbalance based on the results(success/failure) of each treatment so far. (e.g., put one blue ball if the treatment B yieldssuccess.)

Instead of updating the probabilities of treatment assignment after each patient, we canupdate them after a group of patients’ results are available to reduce administrative bur-den. In a phase II clinical trial, play the winner design may be used to reduce the numberof treatments in consideration. (e.g., only retain the treatment arms that have P[positiveresponse] > 0.4.) Drop the loser has a rule like, “drop the treatments from further con-sideration if P[positive response] < 0.2.

Page 48: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 6

Phase I clinical trials

6.1 Introduction

Phase I clinical trial is the first study in which a new drug is administered in humans.The primary objectives of phase I studies are 1) to collect pharmacokinetic and pharma-cokynamic data, and 2) to establish safety with a specific goal to estimate the maximumtolerated dose (MTD).

Customarily, the MTD is the dose level at which the probability of dose-limiting toxicity(DLT) is 33% (or maybe 20%).

AssumptionHigher dose is more effective.We assume that the highest safe dose is the dose most likely to be effective, in otherwords, we are using dose-related toxicity as a surrogate for efficacy.

Phase I clinical trial is also known as ...

Treatment mechanism Early developmental trial that investigates mechanism of treat-ment effect. (e.g., pharmacokinetics)

40

Page 49: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 41

Dose escalation Design that specifies methods for increase in dose for subsequent sub-jects.

Dose-ranging Design that tests some or all of a prespecified set of doses (fixed doses)

Dose-finding Design that titrates dose to a prespecified optimum based on biological orclinical considerations

In oncology and AIDS trials, patients usually participate in phase I trials, but in otherdisease areas, data are gathered on healthy volunteers.

Reasons patients do not participate in “other” phase I trials

• It may be difficult to recruit from patient population.

• Low risk of serious adverse events justifies participation of healthy volunteers.

• No potential confounding of adverse events with disease or concomitant medica-tions.

The primary reason to recruit patients into a phase I clinical trial is known toxicity. (cyto-toxic drug)

Who are the healthy volunteers?

• usually 18-35 years

• usually male

• non-smoker / non substance abuse

• no symptoms of disease

• no laboratory abnormalities

Page 50: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 42

FDA uses the terminology, “normal volunteers” (Guidance for industry: General consider-ations for the clinical evaluation of drugs), but who are normal?

“With respect to the use of ‘normal’ subjects it should be recognized that few people areliterally normal in all respects. This term should be interpreted with caution and shouldmean volunteers who are free from abnormalities which would complicate the interpreta-tion of the experiment or which might increase the sensitivity of the subject to the toxicpotential of the drug.”

6.2 Non-cancer, non-AIDS phase I

Most phase I studies are placebo controlled to reduce observer bias and facilitate com-parison between active drug and placebo.

In a typical design, subjects are assigned to a cohort of size 8 to 10; 6 to 8 subjectsare assigned to the active treatment (same dose) and 2 to placebo. If deemed safe, thenext cohort of the same size are given one higher dose. The trial is stopped when anunacceptable number of adverse events is observed; the highest safe dose is the targetdose that will be recommended for future trials.

More complex dose administration patters involve the administration of multiple doses toone patient. (Grouped crossover escalation)

Starting doseOne popular starting dose is based on the dose that causes 10% mortality in rodents ona mg/m2 (per body surface) basis (LD10). Usually we use LD10/10 as the starting dose.An FDA document (Guidance for industry. Estimating the maximum safe starting dose ininitial clinical trials for therapeutics in adult healthy volunteers) provides a conversion tablefor this purpose.

IncrementsLinear (arithmetic) D, 2D, 3D, 4D, · · · (known toxic drug)

Log (geometric) D, 2D, 4D, 8D, · · · (typical)

Page 51: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 43

Table 6.1: Conversion of animal dose to human equivalent doses based on body surfacearea

Species mg/kg to mg/m2 animal mg/kg to HED mg/kgMultiply by Multiply by

Human 37 –Child (20kg)∗ 25 –Mouse 3 0.08Hamster 5 0.13Rat 6 0.16Ferret 7 0.19Dog 20 0.54Monkey 12 0.32Baboon 20 0.54

Also Fibonacci and “Modified Fibonacci” sequences for oncology trials.

6.2.1 TGN1412 disaster

Some highlights of the story

• intended for the treatment of B cell chronic lymphocytic leukemia and rheumatoidarthritis

• phase I clinical trial in Britain on March 13, 2006

• 6 healthy volunteers took a sub-clinical dose of 0.1mg/kg (1/500 of the dose foundsafe in animals)

– all male, aged 19 to 34 (median 29.5)

• Drugs were given by intravenous infusion with an interval of around 10 minutesbetween patients

• All men suffered from a cytokine storm (the men’s white blood cells had vanishedalmost completely several hours after administration of TGN1412.)

Page 52: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 44

• All men were hospitalized, but none died.

6.3 3 + 3 design

Many variations exist to these so-called “up-and-down designs” / “3+3” designs, but thebasic idea is:

1. 3 patients are allocated to a dose level.

(a) If there is 0 toxicity reaction then 3 patients will be assigned to the next doselevel.

(b) If there is 1 toxicity reaction then 3 patients will be assigned to the current doselevel.

(c) If there are 2 or 3 toxicity reactions, then the current dose will be closed (tootoxic) and 3 patients will be assigned to the previous dose level.

2. Continue until

• the next dose already has had 6 patients.

• there is no more higher/lower dose

3. The MTD is the highest dose with at most 1 toxicity reaction out of 6. (Generally, theMTD has to have data from 6 patients.)

In general there are four components to the typical dose ranging design:

1. selection of a starting dose

2. specification of the dose increments and cohort sizes

3. definition of dose limiting toxicitiesToxicities that, due to their severity or duration, are considered unacceptable andlimit further dose escalation within the subject. (These need to be pre-defined.)

Page 53: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 45

4. decision rules for escalation and de-escalation

Notes:

• The starting dose is usually the lowest dose, but it does not have to be.

• The dose level for the next patient may not be known when he/she is available to beallocated.

• The best outcome is selection of the dose level that is closest but does not exceedthe MTD.

• This design is not motivated with statistics in mind. The probability of selecting theright dose can be very low. The right dose may not even have the highest probabilityof being selected.

• Estimation of the true dose or the true probability of a toxicity reaction for a givendose is difficult. This process usually underestimates the MTD.

• The frequency of stopping escalation at a certain dose level depends on toxicity rateat that dose as well as the rates at all levels below.

• On average the dose chosen by the 3+3 design has the probability of toxicity ofabout 20% to 25%. The operating characteristics studied by Reiner et al. (1999); Linand Shih (2001); Kang and Ahn (2001, 2002).

Modified Fibonacci dosingIn addition to arithmetic and geometric sequence, the Fibonacci sequence is often used.

Fn = 0, 1, 1, 2, 3, 5, 8, 13, 21, · · ·

Based on this sequence, one possible dose escalation is

Page 54: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 46

Dose D 2×D 3×D 5×D 8×D 13×D 21×DRelative increment 100% 50% 67% 60% 63% 62%

↓Modified Fibonacci

Relative increment 100% 67% 50% 40% 33% 33%Dose D 2×D 3.3×D 5×D 7×D 9.3×D 12.4×D

2

4

6

8

Dose levels

Rel

ativ

e D

ose

1 2 3 4 5 6 7

1

Relative dose escalation

●●

loglinearm.f

Page 55: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 47

6.3.1 Accelerated titration

The basic idea is the same as the up-and-down design, and the key modification is that acohort of one patient is used until the first toxicity reaction is observed. Dose escalationmay be at 100% or 40% in the accelerated steps. Another improvement from 3+3 design isthat this design allows intra-patient dose escalation if no dose-limiting toxicity is observedfor that patient.

6.4 Bayesian approach: CRM

Because of their sequential nature, “3+3” design and its variant tend to yield a biasedunderestimate of the target dose when estimating the MTD. One dose-finding (ranging)design that is not subject to this bias as much is the continual reassessment method(CRM).

One obvious advantage of the CRM is its use of an explicit mathematical model describingthe relationship between dose and toxicity. Parameters underlying a dose-toxicity curveare given as priors. These prior values are updated sequentially and used to find thecurrent “best” estimate of dose that would produce the acceptable risk of a toxic event.

The CRM is an algorithm for updating the best guess regarding the optimal dose. It doesnot require a set of fixed dose levels. The CRM algorithms make no assumptions about

1. the actual dose used

2. the cohort size

3. ordering of doses

4. integer responses

Page 56: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 48

6.4.1 Example

(Dougherty et al. (2000))Suppose we want to find the dose that is associated with 20% toxicity, and the availabledoses are 0.25, 0.50, 0.75, and 1.00. The dose-toxicity relationship is written with a one-parameter logistic response model

log(

pi

1− pi

)= 3+αdi,

where di is the dose level (i = 1, 2, 3, 4) and pi is the toxicity probability for dose i, and α

is an unknown parameter.

0.0

0.2

0.4

0.6

0.8

1.0

Dose levels

Pro

babi

lity

of to

xici

ty

1 2 3 4

with varying a

a=0.50a=0.75a=1.00a=1.25a=1.50

For the prior distribtion of α we choose to use an exponential distribution with mean 1.The actual prior information is usually written in terms of the prior probabilities of toxicityat each dose. In this example, let’s say we have 10%, 20%, 40%, and 80% for each of the

Page 57: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 49

four dose levels (p0i ). At ith dose, we have

di = logit(p0i )−3

Dose PriorLevel Actual dose p0

i (prior guess) di1 0.25 0.10 −5.202 0.50 0.20 −4.393 0.75 0.40 −3.414 1.00 0.80 −1.61

0.0

0.2

0.4

0.6

0.8

1.0

Dose levels

Pro

babi

lity

of to

xici

ty

1 2 3 4

Prior (a=1)

Other possible model is

pi = {(tan−1 xi +1)/2}a.

Page 58: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 50

If we let the prior value of a = 1, we can solve for the corresponding xi of each pi,

xi = tan(2p0i −1)

Dose PriorLevel Actual dose p0

i (prior guess) xi1 0.25 0.10 −1.032 0.50 0.20 −0.683 0.75 0.40 −0.204 1.00 0.80 0.68

Page 59: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 51

0.0

0.2

0.4

0.6

0.8

1.0

Dose levels

Pro

babi

lity

of to

xici

ty

1 2 3 4

Prior (a=1)

0.0

0.2

0.4

0.6

0.8

1.0

Dose levels

Pro

babi

lity

of to

xici

ty

1 2 3 4

with varying a

a=0.25a=0.50a=1.00a=1.50a=2.00

Page 60: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 52

Steps for updating information

1. Treat a cohort of 1 patient at the lowest dose.

2. Obtain a posterior distribution of α.

3. Find the optimal dose that gives 20% toxicity using the posterior distribution.

4. Treat another patient at the dose closest to the optimal dose.

5. Repeat steps 2, 3, 4.

The trial continues until a predetermined fixed sample size is reached, or some other trialterminating condition is satisfied.

With regard to step 2 (finding the posterior distribution), we only need the expected valueof α so that it can be plugged into the dose-toxicity equation, and expectation of α is

E[α] =

∫∞

0 αL(α; d j, t j)g(α)dα∫∞

0 L(α; d j, t j)g(α)dα,

where L is likelihood function for the data, defined as

L(α; d j, t j) =J

∏j=1

[φ(d j;α)]t j [1−φ(d j;α)]1−t j ,

where d j is the dose for the jth patient, and t j = 1 if toxicity 0 otherwise for the jth patient;and

φ(d j;α) =exp(3+αd j)

1+ exp(3+αd j).

This is usually beyond analytical solution, so we use simulations. One result from suchsimulation is given below:

• The posterior means of pi show strong agreement with the prior.

Page 61: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 53

Dose Prior Observed data PosteriorLevel Actual dose p0

i (prior guess) di # patients # toxicity mean sd1 0.25 0.10 −5.20 4 0 0.10 0.052 0.50 0.20 −4.39 18 3 0.19 0.083 0.75 0.40 −3.41 3 2 0.38 0.094 1.00 0.80 −1.61 0 0 0.79 0.03

• The actual doses do not enter into the model.

• A tolerability for dose 4 is estimated with considerable accuracy, even though no onewas ever given the dose.

– → This method should be used with great caution.

The CRM method is sometimes criticized for begin falsely precise. However, it has beendemonstrated that the CRM is more efficient and less biased than classic designs.

Using R

A number of packages and functions are available in R. Over a hundred packages arelisted on cran.r-project.org/web/views/Bayesian.html. For analysis (as opposed todesign) of the data, bcrm (Bayesian continuous reassessment method) seems compre-hensive and easy to use. An alternative is CRM availalbe at https://biostatistics.mdanderson.org/SoftwareDownload/.

Let’s continue with the same example with four dose levels. Recall the prior guess oftoxicity probabilities are 10%, 20%, 40%, and 80% for each of the four dose levels. Dosesare 0.25, 0.50, 0.75, 1.00.

The function, bcrm, is used to compute prior distribution sequentially. The key inputs are:

N The final sample size

tox number of toxicity reactions for each dose

Page 62: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 54

notox number of patients with no toxicity reactions

p.tox0 prior probabilities

dose actual dose for plotting purposes

ff functional form of the dose-response curve. (logit1)

prior.alpha prior distribution for α. (Uniform)

cohort the size of each cohort

target.tox the target toxicity probability

constrain whether if dose skipping is allowed

6.5 Modified Toxicity Probability Interval Design

• Ji Y, Li Y, Bekele BN. Dose-finding in phase I clinical trials based on toxicity proba-bility intervals. Clinical Trials. 2007;4:235-44.

• Ji Y, Liu P, Li Y, Bekele BN. A modified toxicity probability interval method for dose-finding trials. Clinical Trials. 2010;7:653-63.

3 + 3 design is known to underestimate the MTD. In the following, the probabilities ofconcluding a dose as the MTD are studied. Suppose that our task is to select one doseout of 5 as the MTD. The true dose-toxicity associations are depicted below. The pink oneis the desired dose at which the toxicity probability is 33%.

Page 63: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 55

Scenario 1True toxicity

Dose levels

10%

30%

50%

1 2 3 4 5

Scenario 2True toxicity

Dose levels

10%

30%

50%

1 2 3 4 5

Scenario 3True toxicity

10%

30%

50%

1 2 3 4 5

Scenario 4True toxicity

10%

30%

50%

1 2 3 4 5

The line indicates the probability of concluding each dose as the MTD. The point at left ofdose 1 shows the probability of even the lowest dose to be declared too toxic.

It is well-known (among statisticians) that 3 + 3 designs are horrible. CRM has not gainedmuch acceptance from the clinical investigators. The mTPI is placed somewhere in be-tween these two approaches.

3 + 3 CRM mTPIBayesian — Yes Yes

Model based — Yes —Coherent P[Tox] — Yes —Estimate P[Tox] — Yes Yes

Any target — Yes YesAny sample size — Yes YesAny cohort size — Yes Yes

Easy Yes — Yes

Estimation of P[Toxcitiy] at each dose in mTPI uses the Beta-Binomial conjugate model.Typically, we use a non-informative prior for each dose level.

Page 64: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 56

0.0 0.2 0.4 0.6 0.8 1.0

P[Toxicity]

Prior

0.0 0.2 0.4 0.6 0.8 1.0

P[Toxicity]

1 / 5

0.0 0.2 0.4 0.6 0.8 1.0

P[Toxicity]

2 / 8

0.0 0.2 0.4 0.6 0.8 1.0

P[Toxicity]

4 / 12

Decision to go up or down or stay at the current dose is based on the Unit ProbabilityMass (UPM) computed with the posterior distribution. First, an interval around the targetdose is defined as the “target interval”. For example, with Target = 20%, we may chooseto use 14%to22% as the target interval. Then anywhere below 14% will be “too low”, andanywhere above 22% will be “too high”.

And when a decision is sought, we cmpute the UPM for each of the three interval and pickthe maximum UPM. The UPM is basically “Area” / “Width” cmoputed using the posteriordistribution.

Page 65: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 57

0.0 0.2 0.4 0.6 0.8 1.0P[Toxicity]

4/9

For this example, we have

Too low 0.12

Just right 0.74

Too high 1.18

Thus, so the decision is to de-escalate.

With mTPI, the “escalate/stay/de-escalate” decision is simple, and it only depends onthe data at that dose. We can compute for each n-x combination the UPM and specifythe course of action. The following table summarizes the design for the target toxicityprobability of 33% and 33%±5% as the target interval.

Example: Target = 33%

Page 66: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 58

Cumulative sample sizeCumulative Toxicity 1 2 3 4 5 6 7 8 9 · · ·

0 E E E E E E E E E1 D S S S S E E E E2 DU D S S S S S S3 DU DU D S S S S4 DU DU DU D S S5 DU DU DU DU D6 DU DU DU DU7 DU DU DU8 DU DU9 DU

mTPI is simple and ...· · ·

• Allows estimation of toxicity probability.

• Tends to require a larger sample size than 3 + 3.

• Seems to have favorable characteristics.

It is still unclear (as of June, 2015), how the regulatory body sees the mTPI. The followingis a letter from CTEP (Cancer Therapy Evaluation Program -NCI) that provides someguidelines when CRM or mTPI is proposed.

“CTEP appreciates the desire to improve clinical trial designs. At the same time safety ofpatients and resource management are CTEP’s priority.”

1. New patients should not be assigned to a dose level at which 4 or more patientshave been fully evaluated and ≥ 33% of the patients have experienced a DLT.

2. New patients should not be assigned to a dose level above a dose level where≥ 33%of the fully evaluated patients have experienced a DLT.

3. At any dose level, limit the number of patients that should be enrolled (preferably nomore than 12).

Page 67: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 6. PHASE I CLINICAL TRIALS 59

4. After the first 3 patients have been fully evaluated for DLT, if 1 out of 3 DLTs havebeen observed then up to 3 additional patients may be accrued to the dose level.

Page 68: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 7

Phase II clinical trials

7.1 Introduction

Phase II clinical trial A clinical trial designed to test the feasibility of, and level of activityof, a new agent or procedure. (safety and activity)

Some characteristics of a typical phase II clinical trial include:

• it includes a placebo and two to four doses of the test drug.

• when the response is observed quickly, adaptive designs may be beneficial andused because they

– improve quality of estimation of the MED (minimum effective dose (lowest doseof a drug that produces the desired clinical effect)

– increase number of patients allocated to MED.

– allow for early stopping for futility.

The primary objectives of phase II trials are:

60

Page 69: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 61

• to determine whether the drug is worthy of further study in phase III trial. Significanttreatment effect? / dose-response relationship?

• to gather information to help design phase III trial

– determine dose(s) to carry forward

– determine the primary and secondary endpoints

– estimate treatment effects for power/sample size analysis

– estimate recruitment rate

– examine feasibility of treatment (logistics of administration and cost)

– learn about side effects and toxicity

In phase II clinical trials, parallel group designs, crossover designs, and factorial designsare often used.

7.2 Phase II trials in oncology

A phase II clinical trial in oncology generally uses a fixed dose chosen in a phase I trial.The primary objective is to assess therapeutic response to treatment. In the simplestcase, a single treatment arm is compared to a historical control. In other cases, a controlgroup and/or multiple doses are included.

The treatment efficacy is often evaluated on surrogate markers for a timely (quick) evalu-ation of efficacy.

Surrogate outcome an outcome measurement in a clinical trial that substitutes for adefinitive clinical outcome or disease status

• CD4 counts in AIDS study

• PSA (prostatic specific antigen) in prostate cancer study

• blood pressure in cardiovascular disease

Page 70: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 62

• 3 months survival (binary) for survival

• tumor shrinkage for survival

Tumor response to treatment is evaluated according to Response Evaluation Criteria inSolid Tumors (RECIST)

Complete response (CR) Disappearance of all target lesions

Partial response (PR) At least a 30% decrease in the sum of the longest diameter (LD)of target lesions, taking as reference the baseline sum LD

Stable disease (SD) Neither sufficient shrinkage to qualify for PR nor sufficient increaseto qualify for PD, taking as reference the smallest sum LD since the treatment started

Progressive disease (PD) At least a 20% increase in the sum of the LD of target lesions,taking as reference the smallest sum LD recorded since the treatment started or theappearance of one or more new lesions

Generally, objective tumor response is defined as CR or PR in RECIST so that the re-sponse variable has a binary endpoint. In the rest of chapter, we will consider a singlearm trial with a binary response. The hypothesis of interest is one-sided H1 : p > p0, andthe type I error rate is usually 5 to 10%. The power is usually 80 to 90%.

7.3 Two-stage designs

It is crucial that these phase II studies have an opportunity to stop early for toxicity, andthat is accomplished by Data Monitoring Committee (DMC), aka, Data and Safety Monitor-ing Board (DSMB). It is also desired to discard ineffective treatment early, and two-stagedesigns with a futility stop has been popular.

We will discuss the designs proposed by Gehan (1961), Fleming (1982), and Simon(1989), using the following unified notation:

Page 71: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 63

• stage I sample size · · ·n1.

• stage I data · · ·X1 ∼ Binomial(n1, p).

• stage I critical value · · ·r1 so that if X1 ≤ r1 then terminate the study for futility.

• stage II sample size · · ·n2.

• stage II data · · ·X2 ∼ Binomial(n2, p).

• total sample size · · ·nt = n1 +n2.

• total data · · ·Xt ≡ X1 +X2.

• stage II critical value · · ·rt so that if Xt ≤ rt then terminate the study for futility, other-wise conclude efficacy.

7.3.1 Gehan’s design

It is old (1961) and outdated but may be ok to use in limited situations. The design callsfor the first stage with n1 = 14 and r1 = 0, i.e., if no positive response is observed in 14,then stop for futility. The rational is that if true response rate is at least 20%, then X1 = 0is unlikely. In fact, it is 0.044. The second stage sample size depends on the desiredprecision for estimating p, and it ranges between 1 and 86. A typical n2 is 14 so thatnt = 25. With nt = 25, a standard error of p is approximately 0.10; such a standard errorleads to a very wide confidence interval.

7.3.2 Fleming’s design

Fleming (1982) proposed a multistage design for phase II clinical trials. One of its keycharacteristics is stopping early for efficacy.

ExampleH0 : p = 0.15, H1 : p = 0.30. (powered at 0.30)α = .05, β = .2(Reject H0 in stage 1 if X1 ≥ s1.

Page 72: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 64

n1 r1 s1 nt rt α 1−β E0[N] E1[N]29 4 9 47 10 0.0490 0.8013 36.6 36.9

7.3.3 Simon’s design

In his 1989 paper, Simon introduced two criteria to choose a 2 stage design for single armand one sided tests. The optimal design has the smallest expected sample size underH0 (n1 +Ep0[n2]), and the minimax design has the smallest total sample size (n1 +n2). Forp0 = 0.15 and p1 = 0.30,

n1 r1 nt rt α 1−β E0[N] pet0 E1[N] pet1optimal 19 3 55 12 0.048 0.801 30.4 0.68 50.2 0.13

minimax 23 3 48 11 0.046 0.804 34.5 0.54 46.7 0.05single stage −− −− 48 11 0.048 0.819 48.0 0.00 48.0 0.00

Conditional power

To find a good design (sample sizes and critical values), we need to understand theconditional power of a design. The conditional power is the probability of rejecting H0(in stage 2) given the stage 1 result, i.e., conditioned on X1 = x1. Clearly, when X1 > rt ,conditional power is 1, and when X1 ≤ r1 (futility stop), conditional power is 0.

CP(x1) = P[Reject in stage 2|x1] = P[x1 +X2 > rt |x1]

= P[X2 > rt− x1|x1]

=n2

∑x2=rt−x1+1

(n2

x2

)px2(1− p)n2−x2

Conditional power is a function of p, x1 and n2 as well as rt

To obtain the unconditional power, we need to integrate (sum) the conditional power over

Page 73: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 65

all possible x1 values.

ρ(p) =n1

∑x1=0

CP(x1)Pp[X1 = x1]

ρ(p) =n1

∑x1=r1+1

CP(x1)

(n1

x1

)px1(1− p)n1−x1.

Given α and β a design is chosen so that ρ(p0)≤ α and ρ(p1)≥ 1−β .

Unlike in a single-stage situation, there may be more than one good design. Simon usedthe optimal and minimax to choose two reasonable designs among many satisfying thetype I error rate and power constraints.

Expected sample size under the null can be written as

Ep0[nt ] = n1 +n2P[continue to stage 2|p0]

= n1 +n2×P[X1 > r1|p0]

= n1 +n2×n1

∑x1=r1+1

(n1

x1

)px1

0 (1− p0)n1−x1 .

Computing design characteristics

> simon.d(n1=23,r1=3,nt=48,rt=11,p0=.15,p1=.30)[[1]]

x1 pst1.0 pst1.1 cp0 cp11 0 0.0238 0.0003 0.0000 0.00002 1 0.0966 0.0027 0.0000 0.00003 2 0.1875 0.0127 0.0000 0.00004 3 0.2317 0.0382 0.0000 0.00005 4 0.2044 0.0818 0.0255 0.48826 5 0.1371 0.1332 0.0695 0.65937 6 0.0726 0.1712 0.1615 0.80658 7 0.0311 0.1782 0.3179 0.90959 8 0.0110 0.1527 0.5289 0.966810 9 0.0032 0.1091 0.7463 0.9910

Page 74: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 66

11 10 0.0008 0.0655 0.9069 0.998412 11 0.0002 0.0332 0.9828 0.999913 12 0.0000 0.0142 1.0000 1.000014 13 0.0000 0.0052 1.0000 1.000015 14 0.0000 0.0016 1.0000 1.000016 15 0.0000 0.0004 1.0000 1.0000

[[2]]n1 r1 nt rt p0 p1

1 23 3 48 11 0.15 0.30

[[3]]pow0 pow1 pet0 pet1

1 0.0455 0.8035 0.5396 0.0538

Page 75: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 67

0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

x1

cond

ition

al p

ower

● ● ● ●●

●● ● ● ●

● ● ● ●

●● ● ● ● ● ● ●

Given a design, computing operational characteristics such as type I error rate, power,expected sample size is not difficult; however, solving for the optimal, minimax, and otherpreferable designs is not trivial. Simon’s original papers show how to do this.

A very good webpage is http://www.cscc.unc.edu/cscc/aivanova/SimonsTwoStageDesign.aspx

Page 76: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 68

Something in between

The two criteria, optimal and minimax, give two designs that are extreme, and neither mayfit the investigators’ needs. For example, for testing H0 : p = 0.3 with α = 0.05 and β = 0.10at p1 = 0.45, the optimal design and minimax designs are:

n1 r1 nt rt α 1−β E0[N] pet0optimal 40 13 110 40 0.048 0.901 60.8 0.70

balanced 53 18 106 39 0.043 0.903 64.4 0.78minimax 77 27 88 33 0.050 0.901 78.5 0.86

The optimal design tends to have a small n1 and the minimax design tends to have a largen1. Therefore, a simple approach to find a good alternative design is to force n1 = n2.(balanced design of Ye and Shyr, 2007)

A more systematic approach is to express the criteria for optimization as

q(w) = w× (nt)+(1−w)×E0[N],

where 0 ≤ w ≤ 1. q(0) and q(1) correspond to the optimal and minimax designs, respec-tively. Computation shows that the minimax design is the best design with respect to q(w)for w ∈ (0.827,1].

In between the optimal and minimax designs, the following “admissible” designs exist thatoptimize q(w) for certain ranges of w. (Jung, Lee, Kim, George, 2004)

n1 r1 nt rt α 1−β E0[N] pet0 woptimal 40 13 110 40 0.048 0.901 60.8 0.70 (0,0.006)

admissible 1 43 14 104 38 0.050 0.903 60.8 0.70 (0.006,0.136)admissible 2 48 16 101 37 0.050 0.901 61.3 0.75 (0.136,0.182)admissible 3 40 12 94 35 0.048 0.902 62.8 0.58 (0.182,0.303)admissible 4 46 14 91 34 0.049 0.902 64.1 0.60 (0.304,0.827)

minimax 77 27 88 33 0.050 0.901 78.5 0.86 (0.827,1)single stage −− −− 90 34 0.045 0.900 90.0 0.00

Page 77: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 69

7.4 Data analysis following a two-stage design in phaseII clinical trials

The primary objective of a (cancer) phase II clinical trial is to make a correct “go/no-go” decision; however, making a good inference for p is advantageous for planning thefollowing phase III trial.

We have seen before (Chapter 3) that when we terminate a study based on an interimsummary of the data, a usual statistic that we often compute may be biased. In thissection, we will look at the issue of bias in two-stage design in phase II clinical trial indetail. Simon’s design will be our focus, but many general discussions can be applied toother designs as well.

7.4.1 p-value

If we ignore the fact that the data were gathered in a two-stage design and computea p-value as if X ∼ Binomial(nt , p), it is bigger than the true p-value with the followingdefinition/interpretation.

p-value the probability under the null hypothesis that we would observe the data as ormore extreme than what we have observed

The term “as or more extreme” can be interpreted as “as big or bigger evidence againstH0”. In a simple single-stage design, the meaning of this is usually straightforward. Wecan all agree that Z = 2.0 is more extreme (more evidence against H0) than Z = 1.9.However, in two-stage designs, understanding the definition of p-value sometimes getstricky.

Example:H0 : p = 0.3, H1 : p > 0.3; α = 0.05 and the power is 0.80 at p = 0.5. Then the optimal designis: n1 = 15, r1 = 5, nt = 46, rt = 18.

Page 78: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 70

Now suppose that we observe X1 = 7 in stage 1 so that we move on to the second stage.And in stage 2, we observe additional 12 positive responses in n2 = 31 patients (19 in 46total) so that H0 is rejected because Xt = 19 > rt .

If we compute a p-value without taking into account the study design, we might use X ∼Binomial(46,0.3) and compute

pc = P0[X ≥ 19] =46

∑i=19

(46i

)0.3i(1−0.3)46−i

= 0.0681,

where pc is a conventional p-value. H0 is rejected but this p-value is greater than α.

To see this inconsistency clearly, we will rewrite above as

pc = P0[X ≥ 19]

=15

∑x1=0

P0[X2 ≥ 19− x1|X1 = x1]P0[X1 = x1].

From this expression we see that in computing pc, we include sample paths that can notbe realized with this Simon’s design, namely, X1 = 0, X2 ≥ 19; X1 = 1, X2 ≥ 18; · · · ; X2 = 5,X2 ≥ 14. A proper p-value that takes into account the actual sampling scheme used maybe

pp =15

∑x1=6

P0[X2 ≥ 19− x1|X1 = x1]P0[X1 = x1].

In general, for Simon-like two-stage designs, p-value should be calculated

pp =n1

∑x1=r1+1

P0[X2 ≥ xt− x1|X1 = x1]P0[X1 = x1],

if x1 > r1 (i.e., if there is a second stage).

The following simple R script computes this p-value:

Page 79: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 71

pp <- function(n1,r1,nt,rt,x1,xt,p0){x1v <- (r1+1):n1p.val <-sum( (1-pbinom(xt-x1v-1, (nt-n1), p0)) * dbinom(x1v, n1, p0) )pc <- 1-pbinom(xt-1, nt, p0)

if(x1 <= r1){ p.val <- pc <- 1-pbinom(x1-1, n1, p0) }c(p.val=p.val, pc=pc)}

pp(n1=15, r1=5, nt=46, rt=18, x1=7, xt=19, p0=.3)p.val pc

0.04986501 0.06805442

When x1 ≤ r1 so that the trial is terminated in stage 1, we can define

pp = P0[X1 ≥ x1].

Thus we think that “moving on to the second stage” has more evidence against H0 than“terminating in the first stage for futility”, which makes sense.

The proper p-value (pp) has the following characteristics:

• It is always smaller than or equal to pc.

• It is consistent with the hypothesis testing, i.e., pp ≤ α if and only if H0 is rejected.

• If Xt = rt +1, then pp is equal to the level of the test (so-called the actual type I errorrate).

• It does not distinguish different sample paths that lead to the same Xt . That is,evidence against H0 is identical if xt is the same regardless of x1.For example, X1 = 8, X2 = 12 and X1 = 10, X2 = 10 yield the same p-values.

When does this (pp) break down?It breaks down when we allow n2 to be different for various values of X1. In some modifica-tions of Simon’s design (e.g., Banerjee A, Tsiatis AA. Stat Med 2006), the stage 2 sample

Page 80: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 72

size varies with x1. Then, pp can not be computed because we cannot order the samplepaths simply based on Xt .

A bigger concern is that this pp cannot be used when n2 is changed from that planned.An even bigger concern is if the actual n2 is different from that planned, how can were-compute the critical value, rt , to control type I error rate? The answer is not simple!

7.4.2 Point estimate

Because the results from a phase II clinical trial are often used in planning a phase IIIclinical trial, a good estimate of p is often of interest.

MLE

In a single stage design, the MLE of p is p = x/n. For a Simon’s design, we can write thelikelihood, letting Yi denote the individual datum from a Bernoulli(p) population, as follows:

L(p|Y ) =

n1i=1 pyi(1− p)1−yi if ∑

n1i yi ≤ r1

Πnti=1 pyi(1− p)1−yi if ∑

n1i y1 > r1

l(p|X) =

{x1log(p)+(n1− x1)log(1− p) if x1 ≤ r1

xt log(p)+(nt− xt)log(1− p) if x1 > r1

Therefore, the MLE for π is

p(x) =

{x1/n1 if x1 ≤ r1

xt/nt if x1 > r1

We have seen before that this p(x) has a downward bias, i.e., Ep[p(x)] ≤ p. A simpleexplanation is that when p is small at the end of stage 1, we tend to terminate the study,

Page 81: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 73

and this downward bias tends to remain; however when p is large at the end of stage 1,more data are gathered and the upward bias of stage 1 tends to be corrected.

Example: p0 = 0.3, p1 = 0.5, α = 0.05, β = 0.2. Then the minimax design is (n1 = 19,r1 = 6, nt = 39, rt = 16). Further suppose X1 = 8 and X2 = 12 so that Xt = 20.

p =2039

= 0.513.

Whitehead

We can write the bias of the MLE estimator as:

B(p) = Ep[p(x)]− p.

So a good estimator would be

p = p−B(p).

However, B(p) is unknown, so we need to estimate it. Let’s use the current estimate of pin B(p). That is

pw = p−B(pw).

This is Whitehead’s estimator (1986 Biometrika). We can write

pw = p−Epw [p(x)]+ pw,

which leads to

Epw [p(x)] = p.

To find pw, we need to numerically solve for pw that satisfies

Epw [p(x)] =r1

∑x1=0

x1

n1P[X1 = x1|p = pw]+

n1

∑x1=r1+1

n2

∑x2=0

x1 + x2

ntP[X1 = x1|p = pw]P[X2 = x2|p = pw]

= p

In the current example, pw = 0.520.

Page 82: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 74

Koyama

We can write the bias of the MLE estimator as:

B(p) = Ep[p(x)]− p.

So a good estimator would be

p = p−B(p).

However, B(p) is unknown, so let’s use B(p), that is

pk = p−B(p).

This is simpler and more straightforward than Whitehead’s estimator. We can write

pk = p−Ep[p(x)]+ p= 2p−Ep[p(x)].

Solving for pk is considerably easier. First compute

Ep[p(x)] =r1

∑x1=0

x1

n1P[X1 = x1|p = p]+

n1

∑x1=r1+1

n2

∑x2=0

x1 + x2

ntP[X1 = x1|p = p]P[X2 = x2|p = p],

then subtract it from 2p. In the current example, pk = 0.521.

Unbiased estimator

For a general multistage design with early stopping for futility and efficacy, Jung and Kim(2004 Stat Med) found the unbiased estimator of p. They showed that the pair (M,S),where M is the number of stage (when terminated) and S the number of successes, iscomplete and sufficient for p. And clearly x1/n1 is unbiased for p, the uniformly minimumvariance unbiased estimator (UMVUE) is found through Rao-Blackwell theorem.

The expression for pub is complex, but for Simon’s two-stage design (two-stage with onlyfutility stop), it can be written as

pub =∑

n1∧xtx1=(r1+1)∨(xt−n2)

(n1−1x1−1

)( n2xt−x1

)∑

n1∧xtx1=(r1+1)∨(xt−n2)

(n1x1

)( n2xt−x1

) ,

Page 83: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 75

where a∧b = min(a,b) and a∨b = max(a,b).

For the current example, max(r1 + 1, xt − n2) = max(6+ 1, 20− 20) = 7, and min(n1, xt) =min(19,20) = 20, and

pub =∑

20x1=7

( 18x1−1

)( 2020−x1

)∑

20x1=7

(19x1

)( 2020−x1

)= 0.517.

Median estimator

Another simple estimator is the value, p∗0 such that the p-value for testing H0 : p = p∗0is 0.5 by the realized sample path. Many adaptive designs for phase II clinical trialswere originally motivated as a hypothesis testing procedure, and computing this estimatorshould be fairly simple in many designs.

If the test statistic is continuous, this estimator is known as the median unbiased estimator(Cox and Hinkley 1974). It is unbiased for the true median. The proof uses the fact thatthe p-value is distributed Uni f (0,1) under H0.

We need to find p∗0 such that

pp =n1

∑x1=r1+1

Pp∗0[X2 ≥ xt− x1|X1 = x1]Pp∗0[X1 = x1]

=19

∑x1=7

Pp∗0 [X2 ≥ 20− x1|X1 = x1]Pp∗0[X1 = x1]

= 0.5.

p∗0 = 0.500.

Page 84: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 76

Comparisons

To compare these methods, we compute the bias of each estimator for various true valuesof p. Use bias and mean squared error = var + bias2 to compare them. For each estimator,compute p for every sample path (defined by X in [0,nt ]) and compute

Ep[p] =nt

∑x=0

p(x)Pp[X = x].

Mean squared errors can be computed by

MSEp[p(x)] =nt

∑x=0

(p− p)2Pp[X = x].

The following two plots show bias and MSE for the current example.

Page 85: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 7. PHASE II CLINICAL TRIALS 77

0.2 0.3 0.4 0.5 0.6

−0.03

−0.02

−0.01

0.00

0.01

true p

bias

●●

●● ●

● ● ● ● ● ● ● ● ●

●●

●● ●

●●

● ●

●●

MLEWhiteheadUnbiasedKoyamaMedian

0.2 0.3 0.4 0.5 0.6

0.006

0.007

0.008

0.009

0.010

0.011

true p

MS

E

●●

●●

● ●

●●

MLEWhiteheadUnbiasedKoyamaMedian

Page 86: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 8

Treatment effects monitoring

8.1 Introduction

A phase III clinical trial (comparative treatment efficacy phase) is a type of trial designthat assesses the efficacy of a new treatment relative to an alternative, placebo, standardtherapy, or no treatment.

DSMB Data and safety monitoring board

DMC Data monitoring committee

TEMC Treatment effects monitoring committee

• DMC should have no formal involvement with subjects or investigators.

• DMC should interact actively in data analysis, request additional analyses if neces-sary.

• DMC usually meets two to three times a year (or after set number of patients con-tribute the data)

78

Page 87: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 79

Motivations for monitoring treatment effects

• Check protocol compliance (baseline variables)Baseline imbalances alone are not likely to be a cause for much concern, but it canundermine the credibility of a trial, some intervention might be proposed to correctthem.

• Review accrualAccrual tends to be slow at the beginning of the trial. The dropout rate may behigher than expected; the event rate may be lower than planned. Remedial actionsinclude to prolong the accrual length and to add study centers.

• Review resource availabilityMoney! Human resources (loss of irreplaceable expertise), difficulty obtaining raredrugs.

• Review data qualityDMC checks patient eligibility (minor deviations are common), minor deviations inbaseline data acquisition, randomization, misdiagnosis. Deviations occurring morethan 10% of all patients may be a sign of internal quality control problem.Treatment compliance/adherence.

• Report adverse eventsFrequent side effects of low intensity may trigger dose reduction.In contrast, a rarely occurring fatal toxicity could be intolerable in studies wherepatients are basically healthy / have a long life expectancy.

• Monitor treatment efficacyAfter all the previously mentioned checks are cleared, TEMC assesses efficacy dif-ferences.

• Check treatment efficacy

Specific questions for TEMC

• Should the trial continue?There may be secondary outputs from the trial such as secondary questions, database.

Page 88: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 80

• Should the protocol modified?e.g., terminating one of many arms; adjusting timing or frequency of diagnostic testschanging consent process, improving quality of data collection, ...

• Does the TEMC need other views of the data?

• Should the TEMC meet more/less frequently? If the timing is based on “informationtime” the meeting may not occur at the recommended intervals (calendar time)

Reasons for stopping a trial (Table 14.1)

• Treatments are found to be convincingly different.

• Treatments are found to be convincingly not different.

• Side effects are too severe.

• The data are of poor quality.

• Accrual is too slow.

• Definitive information about the treatment becomes available making the study un-ethical or unnecessary.

• The scientific questions are no longer important.

• Adherence to the treatment is unacceptably poor.

• Resources to perform the study are no longer available.

• The study integrity has been undermined.

Factors to consider before terminating a study

• Delays in reporting.

• Baseline differences.

• Bias in response assessment.

Page 89: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 81

• Missing data.

• Credibility of results if stopped early.

Reasons for not stopping early:Increasing precision and reducing errorsSubgroup analyses, interaction effects, secondary endpoints

ExamplesECMO (extracorporeal membrane oxygenation) vs standard treatment. 39 newborn in-fants were enrolled in a trial, and the trial was terminated when 4 deaths in 10 infants thestandard group compared with 0 of 9 on ECMO.

fisher.test(cbind(c(0,9),c(4,6)), alt=’less’)

One sided p-value is 0.054

8.1.1 Composition and organization of TEMC = DMC

TEMC is intellectually and financially independent of the study investigators so that it canprovide objective assessments.

Who should TEMC make recommendations to? Trial sponsor or trial investigators orboth?

“The TEMC has an obligation to inform the investigators of their opinions and recommen-dations about actions that carry ethics implications.”

FDA’s 1989 guideline has a very brief description of data monitoring and DMCs.

NIH policy (1998)

Page 90: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 82

• All sponsored trials must have a monitoring system for safety, efficacy and validity

ICH guidelines (1998)“When a sponsor assumes the role of monitoring efficacy or safety comparisons andtherefore has access to unblinded comparative information, particular care should betaken to protect the integrity of the trial...”

“Any interim analysis that is not planned appropriately (with or without the consequencesof stopping the trial early) may flaw the results of a trial and possibly weaken confidencein the conclusions drawn. · · · If unplanned interim analysis is conducted, the clinical studyreport should explain why it was necessary, the degree to which blindness had to bebroken, provide an assessment of the potential magnitude of bias introduced, and theimpact on the interpretation of the results. ”

“The IDMC should have written operating procedures and main records of all its meetings,...”

“The IDMC is a separate entity from an Institutional Review Board (IRB) or an Indepen-dent Ethics Committee (IEC0, and its composition should include clinical trial scientistsknowledgeable in the appropriate disciplines including statistics.”

DMC membershipData monitoring is a complex decision process and requires a variety of expertise inmedicine, basic science, biostatistics, epidemiology, and medical ethics. (Additionallyrepresentative from a regulatory body)

DMC confidentialityIn general, interim data must remain confidential. DMC rarely releases interim data, andits members must not share interim data with anyone outside of DMC. Data leaks mayaffect

• Patient recruitment

• Protocol compliance

• Outcome assessment

Page 91: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 83

• Market value

Then why not have DMC only use blinded data?Complete objectivity 6= ethical

Revisit the question, “Should DMC include the study investigators?”

FDA draft guidance”Knowledge of unblinded interim comparisons from a clinical trial is not necessary forthose conducting or those sponsoring the trial; further, such knowledge can bias theoutcome of the study by inappropriately influencing its continuing conduct or the planof analyses. Therefore, interim data and the results of interim analyses should generallynot be accessible by anyone other than DMC members.”

Guessing the between-group difference only using blinded data.Suppose in order to check data quality, the pooled variance was computed and reportedin the DMC. For example, “We originally anticipated σ2 = 20; however, the pooled varianceafter n1 = 50 from each group was s2

1p = 25. If the clinical trial scientist or the sponsor hasan access to this information how bad is it?

By itself it is not too bad; if data are from normal populations, variance estimate and meanestimate are independent.

However, without breaking the blind, they can compute the overall mean X1· and overallvariance s2

1o.

δ2 =

(4− 2

n1

)s2

1o−(

4− 4n1

)s2

1p.

DMC meeting format

• Open session

– monitor progress with blinded data

– Sponsor, Executive committee, DMC, SAC

Page 92: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 8. TREATMENT EFFECTS MONITORING 84

• Closed session

– Unblinded data

– DMC and SAC

– Sponsor ?

• Executive session

– DMC only

• Debriefing session

– DMC chair, Sponsor representative, Executive committee representative

Page 93: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 9

Group Sequential Method

9.1 Introduction

Fully sequential method A test of significance is repeated after each observation.

Group sequential method A test of significance is repeated after a group of observa-tions.

Some basic characteristics of a group sequential method

• The response variable needs to be observed immediately.

• Number of stages (or looks) can be 2 to 20.

• Looks are equally spaced. (This is not a critical requirement.)

• At each interim (and final) analysis, compute summary statistic based on the cumu-lative data (This is not a critical requirement.)

• A group sequential method is a strategy to stop early as opposed to an “adaptivedesign” is often viewed as a strategy to extend the study if necessary.

85

Page 94: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 86

• A set of critical values are computed so that the overall α is as specified.

– Haybittle-Peto (1971)This is an ad hoc method in which a very conservative critical value (e.g., Z > 3)is used at every interim test. At the final analysis, no adjustment is used (i.e.,Z > 1.96)It is highly unlikely to stop early.

– Pocock (1977)A ”repeated test of significance” at a constant significance level to analyze ac-cumulating data.

– O’Brien-Fleming (1979)The significance levels increase as the study progress.

Page 95: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 87

9.2 Example

0.0 0.2 0.4 0.6 0.8 1.0 1.2

−4

−2

0

2

4

Information Fraction

Bou

ndar

y

Group sequential boundaries

● ● ● ● ●

● ● ● ● ●

Pocock

O'Brien−Fleming

O’Brien-FlemingSample size

Analysis Fraction n Z Nominal p Spend1 0.205 70 4.56 0.0000 0.00002 0.411 139 3.23 0.0006 0.00063 0.616 208 2.63 0.0042 0.00384 0.821 277 2.28 0.0113 0.00835 1.026 346 2.04 0.0207 0.0122

Total 0.0250

Page 96: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 88

PocockSample size

Analysis Fraction n Z Nominal p Spend1 0.241 82 2.41 0.0079 0.00792 0.483 163 2.41 0.0079 0.00593 0.724 244 2.41 0.0079 0.00454 0.965 325 2.41 0.0079 0.00375 1.207 407 2.41 0.0079 0.0031

Total 0.0250

Sample size is expressed in terms of ratios to sample size of single stage procedure.

Suppose we want to test H0 : µt−µc = 0 and H1 : µt−µc = 1 when Xt ∼ Normal(µt ,σ2) and

Xc ∼ Normal(µc,σ2), where σ = 4. Then for a single stage procedure with equal sample

size for the two groups (one-sided α = 2.5%, β = 10%), (p46 of the lecture note)

Nc = Nt =(zα + zβ )

2(2σ2)

(δ1−δ0)2

=(1.96+1.28)2(2)(42)

(1−0)2

= 337

Page 97: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 89

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.000

0.005

0.010

0.015

0.020

0.025

Information Fraction

Cum

ulat

ive

type

I er

ror

Alpha−spending functions

● ●●

9.3 General applications

Let k = 1, · · · ,K be denote the stages so that we have

X (k)t − X (k)

c =1

ntk

ntk

∑i=1

Xti−1

ntk

nck

∑i=1

Xci

∼ Normal(

µt−µc,σ2

ntk+

σ2

nck

),

Page 98: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 90

where ntk and nck are the cumulative sample sizes for the treatment and control groups.Note that this is not a conditional distribution but a marginal distribution.

Define “information” as Ik = (σ2/ntk +σ2/nck)−1. Roughly speaking, information is square

of what appears in the denominator of the test statistic, Z. When nk = ntk = nck, Ik =(2σ2/nk)

−1.

Define the test statistic for stage k as

Zk =X (k)

t − X (k)c√

2σ2/nk= (X (k)

t − X (k)c )√

Ik.

The vector, (Z1, · · · ,Zk) is multivariate normal because each Zk is a linear combination ofthe independent normal variates Xti and Xci. The marginal distribution of Zk is

Zk ∼ Normal((µt−µc)

√Ik,1

).

How about the covariance of Zk1 and Zk2 for k1 < k2?

Cov(Zk1,Zk2) =Cov({X (k1)

t − X (k1)c }

√Ik1, {X

(k2)t − X (k2)

c }√

Ik2

)=Cov

({X (k1)

t − X (k1)c }, {X (k2)

t − X (k2)c }

)√Ik1

√Ik2

=[Cov

(X (k1)

t , X (k2)t

)+Cov

(X (k1)

c , X (k2)c

)]√Ik1

√Ik2

Cov(

X (k1)t , X (k2)

t

)=Cov

(1

nk1

nk1

∑i=1

Xi,1

nk2

nk1

∑i=1

Xi +1

nk2∑Xi

)

=1

nk1

1nk2

Var

(nk1

∑i=1

Xi

)=

1nk2

σ2

Cov(Zk1,Zk2) = σ2(

1nk2

+1

nk2

)√Ik1

√Ik2

=√

Ik1/Ik2.

Therefore,

Page 99: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 91

• (Z1, · · · ,ZK) is multivariate normal.

• E[Zk] = (µt−µc)√

Ik, k = 1, · · · ,K, and

• Cov(Zk1,Zk2) =√

Ik1/Ik2, 1≤ k1 ≤ k2 ≤ K.

General decision rule for a group sequential design is

After group k = 1, · · · ,K−1if |Zk| ≥ ck stop and reject H0.otherwise continue to group k+1.

After group Kif |Zk| ≥ cK stop and reject H0.otherwise stop for futility.

The test’s type I error rate can be expressed as

P{|Zk| ≥ ck for some k = 1, · · · ,K} .

The critical values, ck, are chosen so that the above probability is equal to α. And thepower of the study at δ1 is

P

{K⋃

k=1

(|Z j|< c j, for j = 1, · · · ,k−1 and |Zk| ≤ ck

)}.

Evaluation of this probability requires knowing the distribution of (Z1, · · · ,ZK). Refer totables of cK values or a computer software.

For a Pocock method, the critical values are constant, so ck =CP(K,α). For the exampleabove, CP(5,0.05) = 2.41. For an O’Brien-Fleming method, the critical values have theform, ck = CB(K,α)

√K/k. For the same example, we have CB(5,0.05) = 2.04. From this

we can compute the critical values, 2.04√

5/4 = 2.28; 2.04√

5/3 = 2.63, and so on. Moregenerally, if stage sample sizes are different, use Ik, that is, ck =CB(K,α, I)

√IK/Ik.

Page 100: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 92

9.3.1 Beta blocker heart attack trial

Seven analyses (including the final one) were planned (corresponding to the timing of theDMC meetings) using O’Brien-Fleming bounds with two-sided type I error rate of 5%. Theprimary outcome was survival, and log-rank test was used.

If Pocock boundary had been used, N = 7 and α = 0.05 give Z = 2.485. Therefore, the trailwould have been stopped at the same point.

9.3.2 non-Hodgkin’s lymphoma

Pocock 1983 Clinical Trials: A Practical Approach. A trial was conducted in patientswith non-Hodgkin’s lymphoma for two drug combinations (cytoxanprednisone -CP- andcytoxan-vincristine-prednisone -CVP-). The primary endpoint was tumor shrinkage (Yes/No).

Statistical analyses were planned after approximately 25 patients. With 5 looks and one-sided α = 0.05. The Pocock procedure requires a significance level of 0.0169 at eachanalysis.

Page 101: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 93

gsDesign(k=5, test.type=1, alpha=0.05, n.fix=1, sfu=’Pocock’)

Tumor shrinkageCP CVP p-value

Analysis 1 3/14 5/11 0.201Analysis 2 11/14 13/24 0.338Analysis 3 18/40 17/36 0.846Analysis 4 18/54 24/48 0.087Analysis 5 23/67 31/59 0.039

9.4 Alpha-spending

“Classical” group sequential designs have equal information (sample size) at every stage,but we may want to be a little more flexible. And when Ik is not a constant we might wantto change α spent accordingly.

Decompose the rejection region.

R = P{|Zk| ≥ ck for some k = 1, · · · ,K}= P{(|Z1| ≥ c1) or (|Z1|< c1 and |Z2| ≥ c2) or · · ·}= P{|Z1| ≥ c1}+P{|Z1|< c1 and |Z2| ≥ c2}+P{|Z1|< c1 and |Z2|< c2 and |Z3| ≥ c3}+ · · ·= α(I1)+(α(I2)−α(I1))+(α(I3)−α(I2)−α(I1))+ · · ·

The biggest advantage of alpha-spending approach is its flexibility; neither the numbernor timing of the interim analyses need to be specified in advance. The monitoring plancan be changed during the trial and still type I error rate is preserved.

Alpha-spending functionsO’Brien-Fleming α(t) = 2

[1−Φ

(zα/2/

√t)]

Pocock α(t) = α log(1+(e−1)t)Kim-DeMets (Power) α(t,θ) = αtθ (for θ > 0)Hwang-Shih-DeCani α(t,φ) = α

1−e−φ t

1−e−φ (for φ 6= 0)

Page 102: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 94

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

0.025

Information fraction

alph

a sp

endi

ng

O−FPocockPower (1)Power (0.5)Power (2)HSDC (4)HSDC (−4)

alpha spending functions

9.5 One-sided test

If ”stop for futility” is not an option, the same boundary can be used. If a futility stop is anoption, then

Page 103: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 95

After group k = 1, · · · ,K−1if Zk ≥ bk stop and reject H0.if Zk ≤ ak stop for futility (accept H0).

After group Kif Zk ≥ bK stop and reject H0.if Zk < aK stop for futility.

Note that aK = bK ensures that the test terminates at analysis K.

9.6 Repeated confidence intervals

If we compute unadjusted confidence intervals Xso far±1.96σ/√nso far at the end of each

stage, we get low coverage probabilities. Armitage, McPherson, Rowe (“Repeated signif-icance tests on accumulating data”. JRSS-A 1969) computed the actual coverage proba-bilities (Table 2).

Number of looks Overall probability thatall intervals contain θ

1 0.952 0.923 0.894 0.875 0.86

10 0.8120 0.7550 0.68∞ 0

The idea of repeated confidence intervals (RCIs) is to use an adjusted value, ck(α,K),instead of 1.96 so that the overall coverage probability is 1−α/2. The value of ck(α,K) isthe critical value (border) for each stage and depends on α and K if Pocock boundary isused, and additionally k if O’Brien-Fleming boundary is used.

Example:

Page 104: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 96

Suppose we use a 6-stage group sequential design of O’Brien-Fleming type with a two-sided α = 5%. The critical values are:

> gsDesign(k=6, test.type=2, alpha=0.025, sfu=’OF’)

Symmetric two-sided group sequential design with90 % power and 2.5 % Type I Error.Spending computations assume trial stopsif a bound is crossed.

SampleSize

Analysis Ratio* Z Nominal p Spend1 0.172 5.03 0.0000 0.00002 0.343 3.56 0.0002 0.00023 0.515 2.90 0.0018 0.00174 0.686 2.51 0.0060 0.00475 0.858 2.25 0.0123 0.00796 1.030 2.05 0.0200 0.0105

Total 0.0250++ alpha spending:O’Brien-Fleming boundary

* Sample size ratio compared to fixed design with no interim

The critical values are 5.03, 3.56, 2.90, 2.51, 2.25, 2.05.

First, let’s confirm that the critical values have the form ck = COB(K,α)√

IK/Ik. The fi-nal critical value COB(6,α) = 2.05, and assuming the looks are equi-distant (same groupsample size), we have

c1 = 2.05√

6/1 = 5.03 c2 = 2.05√

6/2 = 3.56

c3 = 2.05√

6/3 = 2.90 c4 = 2.05√

6/4 = 2.51

c5 = 2.05√

6/5 = 2.25 c6 = 2.05√

6/6 = 2.05

Then after stage 1, we would use 5.03 in place of the regular 1.96 when computing a 95%

Page 105: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 97

confidence interval. In general after stage k (k = 1, · · · ,6),

(xkt− xkc)± ck

√2σ2√

mk,

where m is per-group sample size for each stage.

With a Pocock design, the critical value ck is a constant. (ck = 2.45 for k = 1, · · · ,6)

This method (RCI) is consistent with the corresponding hypothesis testing. Only when isH0 rejected in stage k, the confidence interval for that stage will exclude the null value.Thus, we can use the idea of “inverting hypothesis test” to get the same confidence inter-val. (more later)

9.7 P-values

Recall how we construct a proper p-value for a Simon’s two-stage design in phase IImethodology. We needed to define “more or as extreme as the observed data”. To beable to do this, we need to have an ordering of all the sample paths. In a simple single-stage design, the ordering is usually based on z-values (or absolute value of z-values iftwo-sided test), i.e., the bigger the observed z, the stronger the evidence against H0. Thena one-sided p-value is computed by

p = P0[Z ≥ z].

With a group sequential design, or more generally, with a multi-stage design with pre-specified group-wise sample sizes, the following orderings have been proposed. Notation:(k′,z′)� (k,z) to denote (k′,z′) is above (k,z).

• Stage-wise ordering.(k′,z′)� (k,z) if any of the following is true:

1. k′ = k and z′ ≥ z.

Page 106: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 98

2. k′ < k and z′ ≥ bk′ (upper critical value).

3. k′ > k and z≤ ak (lower critical value).

• MLE ordering.(k′,z′) � (k,z) if z′/

√Ik′ > z/

√Ik. Originally proposed in connection with a test for a

binomial proportion The bigger value of the MLE gets a higher order. Sometimescalled “sample mean ordering” because this is equivalent to ordering based on thesample mean (one-sample) or the difference of sample means (two-samples).

• Likelihood ratio ordering.(k′,z′)� (k,z) if z′ > z. (Stages do not matter.)

• Score test ordering.(k′,z′)� (k,z) if z

√Ik′ > z

√Ik.

Whichever ordering is used, we can compute a one-sided p-value is

P0[(T,ZT )� (k∗,z∗)]

For example, if we use a stage-wise ordering and test terminates in the K−1 stage withZK−1 > bK−1 (reject H0).

p =∫

b1

g1(z;0)dz+ · · ·+∫

z∗gK−1(z;0)dz.

In the above expression, gk(z;θ) is a density function of z in stage k. Conceptually, thedensity function of z in k stage depends on all the data in the previous stages, 1 · · ·k− 1,requiring multivariate integration.

Armitage, McPherson, Rowe (1969) derived a recursive formula so that the computationis much simplified, requiring only a succession of univariate integrations. For k = 2, · · · ,K,

gk(z;θ) =∫

Ck1

gk−1(µ;θ)

√Ik√∆K

φ

(z√

Ik−µ√

Ik−1−∆kθ√∆k

)dµ,

where Ck1 is the continuation region of the stage k1, and ∆k is the increment information,Ik− Ik−1.

If stage-wise ordering is used, it automatically ensures that item the p-value is less thanthe significance level α if and only if H0 is rejected.

Page 107: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 9. GROUP SEQUENTIAL METHOD 99

Once we define the ordering to use with the group sequential test then we can compute ap-value for testing H0 : θ = 0 by “inverting hypothesis test”. A (1−α/2) confidence intervalis a collection of θ ′0 such that H ′0 : θ = θ ′0 would be accepted with the observed samplepath. (More details with the adaptive designs next chapter)

Page 108: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 10

Two-stage adaptive designs

10.1 Introduction

Much of discussion in the literature for flexible designs in phase III clinical trial methodolo-gies revolves around 2 stage designs. Practically speaking, implementing flexible clinicaltrials beyond two stages is difficult, and perhaps these multi-stage flexible designs addonly little to the designs with just two stages. Moreover, phase III clinical trials are forconfirmatory purposes, and adaptively changing the design more than once in the middleof a confirmatory trial is not seen favorable by the regulatory figure.

So we will only consider two-stage adaptive designs. Two-stage group sequential designsare examples of such designs.

10.2 Background

We will look at unmasked (unblinded) two stage designs in which all the information fromstage 1 is available. Design of the second stage (sample size and critical value) may be

100

Page 109: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 101

specified as functions of stage 1 data. If both sample size and critical value are constantsin stage 1 data, then it reduces to a two-stage group sequential design.

Adaptive designs can be categorized into the following two types:

• prespecified designsDesign of the second stage (e.g., sample size and critical value) is specified beforethe first stage. There is nothing to decide at the end of stage 1. The design of thesecond stage is defined flexibility as functions of stage 1 data. Group sequentialdesigns fall into this category.

• unspecified designsDesign of the second stage is not specified in advance and determined after stage1 data are observed.

Characteristics of these types of designs:

prespecified designs

• Type I error can be controlled.

• Type II error can be controlled. (you can specify the power.)

• You can compute design characteristics of the design (e.g., expected and maximumsample size) prior to initiation of the study.

unspecified designs

• Type I error can be controlled.

• These designs give much flexibility to handle unexpected situations (e.g., variancemuch bigger than anticipated).

Something in between: Not specifying the stage 2 sample size is unrealistic because itmakes it impossible to budget such a clinical trial. Instead of leaving the stage 2 unspeci-fied, maybe we should specify the maximum sample size. And perhaps, we may want to

Page 110: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 102

specify the minimum of the power for the second stage, P[Reject H0 in stage 2 | Stage 1data].

With these specifications, unspecified designs start to look like prespecified ones.

What do they say about adaptive designs

PhRMA (2006) “A clinical study design that uses accumulating data to decide how tomodify aspects of the study as it continues, without undermining the validity andintegrity of the trial.”“... changes are made by design, and not on an ad hoc basis; therefore, adaptationis a design feature aimed to enhance the trial, not a remedy for inadequate planning.”

EMA (2006) “A study design is called ‘adaptive’ if statistical methodology allows the mod-ification of a design element (e.g. sample-size, randomisation ratio, number of treat-ment arms) at an interim analysis with full control of type I error rate.”“adaptive designs should not be seen as a means to alleviate the burden of rigorousplanning of clinical trials.”

FDA (2010) “... adaptive design clinical study is defined as a study that includes aprospectively planned opportunity for modification of one or more aspects of thestudy design and hypotheses based on analysis of data (usually interim data) fromsubjects in the study.”

10.3 Toward conditional power

What are good two-stage adaptive designs?

• What do we use to compare different designs?

– Power between µ0 and µ1

– Expected sample size at different µs.

• Design of stage 1 tends to be more influential in terms of these characteristics.

Page 111: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 103

• “Optimality” is not the only driving force to choose a design. A design with a verysmall n1 may have different objectives than those with a large n1. (Ambitious designsvs. Insurance-type designs)

To test H0 : δ = 0, where δ = µt−µc, we take random samples from

Xt ∼ Normal(µt ,σ2t ) Xc ∼ Normal(µc,σ

2c ).

Assume the true variances are equal and known: σ2t = σ2

c = σ2. Also assume the samplesizes are equal in the control and treatment groups: n1t = n1c = n. Then

X1t ∼ Normal(µt ,σ2/n1) X1c ∼ Normal(µc,σ

2/n1)

and

Z1 =

√n1(X1t− X1c)√

2σ.

The distribution of Z1 is Z1 ∼ Normal(√

n1ξ ,1), where ξ = δ/√

2σ .

Also define ζ =√

n1ξ for the stage 1.

In stage 1, we observe Z1 and use the following decision rule:

• If Z1 < k1, stop for futility.

• If Z1 > k2, stop and reject H0.

• If k1 < Z1 < k2 then continue to stage 2.

In stage 2, we take a sample of size n2(z1) from each arm, and define

Z2 =

√n2(z1)(X2t− X2c)√

2σ.

Conditioned on Z1 = z1 ∈ (k1,k2),

Z2 ∼ Normal(√

n2(z1)ξ ,1).

The decision rule at the end of stage 2 is:

Page 112: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 104

• If Z2 ≤ c(z1), stop and conclude futility.

• If Z2 > c(z1), stop and conclude efficacy.

We can use Z1 and Z2 to construct a two-stage design, and we can also construct a teststatistic that combines the test statistics from both stages.

Let

Zw =Z1 +Z2√

2.

If Z1 and Z2 are independent

Zw ∼ Normal

√n1 +

√n2(z1)√

2,1

)

Under H0, Zw has the standard normal distribution. Zw is rarely used because it weightsstage 1 and stage 2 data differently. To give equal weight to every datum, we shouldconstruct a test statistic that uses

Y =

(∑

n1i=1 X1ti +∑

n2(z1)i=1 X2ti

)−(

∑n1i=1 X1ci +∑

n2(z1)i=1 X2ci

)n1 +n2(z1)

=(n1X1t +n2(z1)X2t)− (n1X1c +n2(z1)X2c)

n1 +n2(z1)

=n1(X1t− X1c)+n2(z1)(X2t− X2c)

n1 +n2(z1)

=

√2σ(√

n1Z1 +√

n2(z1)Z2)

n1 +n2(z1)

∼ Normal(√

2σξ ,2σ2

n1 +n2(z1)

).

Thus if we let

Zu =

√n1 +n2(z1)Y√

2σ,

Page 113: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 105

then we have

Zu ∼ Normal(√

n1 +n2(z1)ξ , 1),

if Z1 and Z2 are independent.

It is useful to write Zu as a weighted average of Z1 and Z2 as follows:

Zu =

√n1√

n1 +n2(z1)Z1 +

√n2(z1)√

n1 +n2(z1)Z2

Because even the existence of the second stage depends on Z1, we need to think aboutstage 2 conditioned on stage 1 data.

Conditioned on Z1 = z1,

Zu|(Z1 = z1) =

√n1√

n1 +n2(z1)z1 +

√n2(z1)√

n1 +n2(z1)Z2

∼ Normal

( √n1√

n1 +n2(z1)z1 +

n2(z1)√n1 +n2(z1)

ξ ,n2(z1)

n1 +n2(z1)

).

The original decision rule at the end of stage 2 was written in terms of Z2, i.e., Z2 > c(z1)then reject H0. This can be written in terms of Zu. Suppose the critical value that goeswith Zu is cu(z1). Conditioned on Z1 = z1 we have

cu(z1) =

√n1√

n1 +n2(z1)z1 +

√n2(z1)√

n1 +n2(z1)c(z1).

The decision rule at the end of stage 2 can be written in terms with Z2 and Zu.

There exist many different normalization schemes. For example, Zu can be rescaled tohave a variance of 1.√

n1 +n2(z1)√n2(z1)

Zu ∼ Normal

( √n1√

n2(z1)z1 +

√n2(z1)ξ , 1

)

Page 114: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 106

10.4 Conditional power functions

Conditional power is the probability of rejecting H0 in stage 2 conditioned on the first stagedata. Let A(z1,ξ ) to denote the conditional power at ξ given Z1 = z1. Then we have

A(z1,ξ ) = Pξ [Z2 > c(z1)|Z1 = z1].

The conditional distribution of Z2 given Z1 = z1 is

Z2|(Z1 = z1)∼ Normal(√

n2(z1)ξ , 1),

and

A(z1,ξ ) = 1−Φ

[c(z1)−

√n2(z1)ξ

].

The conditional type I error rate is A(z1,ξ0). Specifically, when ξ0 = 0, we have

A(z1,0) = 1−Φ [c(z1)] .

The conditional power at the alternative, ξ1, is

A(z1,ξ1) = 1−Φ

[c(z1)−

√n2(z1)ξ1

].

To specify a design that has type I error rate of α, we need to pick the conditional powerfunctions (and other design parameters such as critical values and sample sizes) so that

α =∫

k2

g1(z1,ξ0)dz1 +∫ k2

k1

A(z1,ξ0)g1(z1,ξ0)dz1

= α1 +∫ k2

k1

A(z1,0)g1(z1,0)dz1,

where g1(z1,ξ ) is the probability density function of Z1 ∼ Normal(√

n1ξ ,1). Similarly, toensure power of ρ ≡ 1−β , we need

1−β = ρ =∫

k2

g1(z1,ξ1)dz1 +∫ k2

k1

A(z1,ξ1)g1(z1,ξ1)dz1

= ρ1 +∫ k2

k1

A(z1,ξ1)g1(z1,ξ1)dz1,

Page 115: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 107

Given stage 1 is already designed (n1, k1 and k2), we can choose to use any A(z1,0) andA(z1,ξ1) as long as they satisfy these α and ρ conditions. Then we can find the criticalvalue and sample size for stage 2 using

A(z1,ξ0) = 1−Φ

[c(z1)−

√n2(z1)ξ0

]A(z1,ξ1) = 1−Φ

[c(z1)−

√n2(z1)ξ1

]This relationship can be used to derive the following:

n2(z1) =

(zA(z1,ξ0)− zA(z1,ξ1)

)2

(ξ1−ξ0)2

c(z1) = zA(z1,ξ0)+ξ0

ξ1−ξ0

(zA(z1,ξ0)− zA(z1,ξ1)

)

In above, we specified two A(z1,ξ ) functions and solve for n2(z1) and c(z1); however,we can specify any two of the four “design elements” and solve for the remaining two.Perhaps, we want to specify A(z1,ξ0) so that the type I error rate is controlled and n2(z1)so that the sample size is controlled. In this case, we have

c(z1) = zA(z1,ξ0)+ξ0√

n2(z1)

A(z1,ξ1) = 1−Φ

[zA(z1,ξ0)−

√n2(z1)(ξ1−ξ0)

]

Example To test H0 : µt − µc = 0 and H1 : µt − µc > 0. Assume σ is known to be 4. Wewant one sided α to be 0.025 and power to be 0.90 at µt−µc−1.

ξ0 = 0ξ1 = 1/

√2σ = 1/4

√2 = 0.1768.

For a single stage design, the sample size is

N =(z0.025 + z0.10)

2

0.17682 = 337

Let’s decide to look at the data when n1 = 135 observations are available from each group.(approximately 40% of N) We also need to decide how much of α and β we want to “spend”

Page 116: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 108

in stage 1. Let’s choose α1 = 0.01 and β1 = 0.025.

α1 = P0[Reject H0 in stage 1]= P0[Z1 > k2] where Z1 ∼ Normal(0,1)

Then k2 = 2.326.

β1 = P1[Accept H0 in stage 1]= P0[Z1 < k1] where Z1 ∼ Normal(

√n1ξ1,1) = Normal(2.05,1)

Then k1 = 0.094.

Also let’s set the maximum sample size to be 500 (approximately 50% increase from N).

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

z1

Con

ditio

nal p

ower

ζ0 k1 ζ1 k2

Page 117: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 109

0.0 0.5 1.0 1.5 2.0

0

100

200

300

400

500

z1

n 1+

n 2(z

1)

ζ0 k1 ζ1 k2

n1

N

First, let’s look at some stage 1 design characteristics:

Some stage 1 characteristicsStage 1

µ ξ ζ Accept Continue Reject0 0 0 0.538 0.453 0.0100.5 0.0884 1.027 0.175 0.728 0.0971 0.1768 2.054 0.025 0.582 0.393

Page 118: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 110

If we use a flat A(z1 ξ0), we need the conditional type I error rate to be

A f lat(z1,ξ0) = 0.015/0.5374 = 0.0331

Similarly, a flat conditional power is

A f lat(z1,ξ1) = (0.9−0.3928)/0.5822 = 0.8710

These A-functions give rise to

n2(z1) =(z0.0331− z0.8710)

2

(0.1768−0)2 = 282

c(z1) = z0.0331 = 1.837

Design with flat A0 and flat A1Stage 1 Stage 2 This design Single stage

µ ξ n1 Accept Continue Reject Reject Power E[N] Max N Power N0 0 135 0.538 0.452 0.010 0.015 0.025 262.5 417 0.025 3370.25 0.044 135 0.337 0.628 0.035 0.086 0.121 311.9 417 0.125 3370.50 0.088 135 0.175 0.728 0.097 0.263 0.360 340.1 417 0.367 3370.75 0.133 135 0.074 0.710 0.216 0.462 0.679 335.1 417 0.681 3371 0.177 135 0.025 0.582 0.393 0.507 0.900 299.1 417 0.900 337

Now consider A-functions of the form A(z1,ξ0) = a0+a1(z1−k1)2 and A(z1,ξ1) = b0+b1(z1−

k1). We can use any A functions as long as they satisfy α and power conditions.

First we pick a0 (the value of A(z1,ξ0) at z1 = k1 to be 0.002 and solve for a1 so that

α2 = 0.015 =∫ k2

k1

A(z1,ξ0)g1(z1,ξ0)dz1.

Numerical integration finds a1 = 5.2547

A(z1,ξ0) = 0.002+5.2547(z1− k1)2.

Page 119: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 111

Similarly for A(z1,ξ1), by specifying b0 = 0.75, we find b1 = 1.004 so that

A(z1,ξ1) = 0.75+1.004(z1− k1).

Using these two A functions, we can compute n2(z1), and it turns out max{n1 + n2(z1)} >500, and we need to modify the design a little. It is relatively simple to make small modifi-cations to the design because we understand how the design elements A(z1,ξ0), A(z1,ξ1),n2(z1), and c(z1), are interrelated.

First while fixing A(z1,ξ0), we “tap” n2(z1) so that max{n1 + n2(z1)} = 500. This actionchanges A(z1,ξ1) slightly resulting a smaller power than 0.90. To make the power 0.90again, we add a constant to the new A(z1,ξ1) but capping the resulting n1 +n2(z1) at 500.The final design is shown below graphically.

Page 120: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 112

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

z1

Con

ditio

nal p

ower

ζ0 k1 ζ1 k2

A(z1, ξ0)

A(z1, ξ* )

A(z1, ξ1)

Page 121: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 113

0.0 0.5 1.0 1.5 2.0

0

100

200

300

400

500

z1

n 1+

n 2(z

1)

ζ0 k1 ζ1 k2

n1

N

Design with flat A0 and flat A1Stage 1 Stage 2 This design Single stage

µ ξ n1 Accept Continue Reject Reject Power E[N] Max N Power N0 0 135 0.538 0.452 0.010 0.015 0.025 264.2 500 0.025 3370.25 0.044 135 0.337 0.628 0.035 0.085 0.120 303.8 500 0.125 3370.50 0.088 135 0.175 0.728 0.097 0.264 0.360 318.4 500 0.367 3370.75 0.133 135 0.074 0.710 0.216 0.464 0.680 302.7 500 0.681 3371 0.177 135 0.025 0.582 0.393 0.507 0.900 264.7 500 0.900 337

In the literature, many specific A functions have been proposed. A few examples include:

Page 122: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 114

• Proschan & Hunsberger (1995)

APH(z1,ξ ) = 1−Φ

[√

n1

√(k2−ξ

√n1)2− (z1−ξ

√n1)2

]• Chen, DeMets & Lan (2004)

ACDL(z1,ξ ) = 1−Φ

[√2zα − z1−ξ

√n1

]A(z1,z1) “Conditional power under the current trend”

10.5 Unspecified designs

The minimum requirement to control type I error rate is to pre-specify A(z1,ξ0) functionthat satisfies α condition. Then after the first stage, when the actual z1 from the data areavailable, pick n2(z1) so that conditional powers at any a value of ξ (other than ξ0) can beset.

If we allow even A(z1,ξ0) to be specified after the fist stage, type I error rate cannot becontrolled. There exist many (in fact infinite number of) A(z1,ξ0) functions that give thedesired value of α1 (0.01 in our example). Depending on z1 the required sample size toguarantee a certain conditional power differs. We cannot choose an A(z1,ξ0) functionthat gives the minimum sample size for the observed z1. Roughly speaking, when theconditional type I error rate at the observed z1 is large, the required sample size is smallfor the same conditional power. All three conditional type I error rates in the following plotgive α = 0.025.

Page 123: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 115

0.0 0.5 1.0 1.5 2.0

0.00

0.05

0.10

0.15

0.20

0.25

z1

Con

ditio

nal p

ower

ζ0 k1 ζ1 k2

10.6 Ordering of sample space

To compute p-values and confidence interval (through inverting hypothesis tests), weneed to define an ordering of sample space; however, this task is difficult because ofsample size difference for potential values of z1.

One useful fact (not too difficult to show) is that the decision rule, “reject if Z2 > c(z1)” isequivalent to the rule “reject if stage 2 (conditional) p-value is less than A(z1,ξ0) evaluatedat the observed z1.” So we can compute a (conditional) p-value just using the stage 2 data,

Page 124: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 116

P0[Z2 > z2], and compare it to the conditional type I error rate computed at the observedz1.

Suppose the red line in the previous plot is used as A(z1,ξ0), then

• if z1 = 2.0 and stage 2 conditional p-value is 0.10 then H0 will be rejected becausethis p-value is less than A(z1,ξ0) i.e., below the red line.

• if z1 = 1.0 and stage 2 conditional p-value is 0.10 then H0 will not be rejected.

Therefore, we need an ordering of the sample space that takes into account not only thesample size of stage 2, n2(z1), but also the conditional type I error rate for the stage 2,A(z1,ξ0).

Suppose we choose to use A(z1,ξ0) and A(z1,ξ1) shown in the plot below.

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Z1

Con

ditio

nal P

ower

s

A(z1,ζ1)A(z1,ζ0)=.005+.1118(z1−k1)

2

Page 125: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 117

And let’s consider the following 5 sample paths indicated by the conditional p values. Canwe order the strength of evidence against H0 for these data?

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Z1

Con

ditio

nal P

ower

s

A(z1,ζ0)=.005+.1118(z1−k1)2

• When z1 is the same, the second stage sample size is the same, so it should besimple to order the sampling paths. The smaller p value, the stronger evidenceagainst H0.Blue � Red � Yellow.

• When the conditional p values are the same, then we can order them by the strengthof evidence in the first stage.Blue � Black � Green.

• The black and red dots should indicate equal strength of evidence because theyboth result in “just” rejecting H0.

So in the above picture, the only unclear ordering is between Green and Yellow.

Page 126: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 118

The third rule gives a hint as to how to proceed; the data leading to the black and reddots indicate that those data have just enough evidence to reject H0. The blue dot is fora sampling path that gives stronger evidence against H0; we could reject H ′0 that is moreextreme.

We can find a value of ξ ∗0 (equivalently, µ∗0 ) so that A(z1,ξ∗0 ) goes through the blue dot,

and say we could have rejected H∗0 : µ = µ∗0 .

Technically speaking, we can find ξ ∗ by solving the following:

p = A(z1,ξ∗0 ) = 1−Φ

[c(z1)−

√n2(z1)ξ

∗0

]Note that p = P0[Z2 > z2], and once a value of z1 is observed, we can evaluate c(z1) andn2(z1), so the only unknown quantity in the above expression is ξ ∗0 .

From the above picture, we know the ordering is: Blue � Red = Black � Green � Yellow.And “some as or more extreme” than the observed is anything on and below the line, andwe can compute the p-value by computing∫

k2

g1(z1,ξ0)dz1 +∫ k2

k1

A(z1,ξ∗0 )g1(z1,ξ0)dz1

Page 127: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 119

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Z1

Con

ditio

nal P

ower

s

A(z1,2.29)A(z1,.42)A(z1,.29)A(z1,0)A(z1,−.22)

This method (ordering) guarantees that the p-value and the corresponding hypothesistesting are consistent (p-value < α iff H0 is rejected).

And it can be shown that when n2(z1) and cu(z1) {critical value for the combined statistic}are constants, this ordering reduces to the stage-wise ordering.

10.7 Predictive power

With an unspecified design, some people are reluctant to use the conditional power todetermine the design of the second stage. One issue is that where to compute the condi-tional power is not always clear.

The original alternative is usually a reasonable choice (A(z1,ξ1)). However, when the

Page 128: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 10. TWO-STAGE ADAPTIVE DESIGNS 120

observed z1 is much different (smaller) from ξ1 we may not be interested in the conditionalpower at ξ1 but at some smaller value that is still clinically meaningful. (Minimum clinicallyrelevant alternative = ξ

†1 )

Another popular choice is ξ ≡ z1/√

n1 (“alternative under the current trend”).

Or maybe we should compute the conditional power at somewhere in between ξ1 and ξ†1 .

Average? Now we are talking like a Bayesian because we are talking about an averageof ξs which are, for a frequentist, parameters. Maybe we have a prior distribution of ξ

(or equivalently µ). and a posterior distribution of ξ after the first stage, π(ξ |z1), and wecan compute a weighted average of the conditional power with respect to the posteriordistribution. Something like ∫

−∞

A(z1,ξ )π(ξ |z1)dξ ,

and this is often called a predictive power given the stage one data.

The conditional power is a frequentist concept, and it is computed at one value of ξ . Thepredictive power is a Bayesian concept, and it is a weighted average of the conditionalpower with respect to a posterior distribution of ξ .

Page 129: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 11

Factorial design

Factorial clinical trials (Piantadosi) Experiments that test the effect of more than onetreatment using a design that permits an assessment of interactions among thetreatments

The simplest example of a factorial design is 2 treatment, 2 treatment groups (2 by 2)designs. With this design, one group receives both treatment, a second group receivesneither, and the other two groups receive one of A or B.

Treatment BTreatment A No Yes Total

No n n 2nYes n n 2n

Total 2n 2n 4nFour treatment groups and sample sizes in a 2×2 balanced factorial design

Alternatives to a 2×2 factorial design

• Two separate trials (for A and for B)

• Three arm trial (A, B, neither)

121

Page 130: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 122

Two major advantages of factorial design (but not simultaneously)

• allows investigation of interactions (drug synergy)

Drug synergy occurs when drugs interact in ways that enhance effects or side-effects of those drugs.

• reduces the cost (sample size) if the drugs do not interact.

Some requirements for conducting a clinical trial with factorial design

• The side effects of two drugs are not cumulative to make the combination unsafe toadminister.

• The treatments need to be administered in combination without changing dosage ofthe individual drugs.

• It is ethical not to administer the individual drugs. A and B may be given in additionto a standard drug so all groups receive some treatment.

• We need to be genuinely interested in studying drug combination, otherwise sometreatment combinations are unnecessary.

Some terminology

• Factors (how many different treatments are in consideration)

• Levels (2 if yes/no)

• 2k factorial studies have k factors, each with two levels (presence/absence)

• Full factorial design has no empty cells.

• Unreplicated study has one sample per cell (obviously not very common in clinicalstudies)

• Fractional factorial designs (some cells are left empty by design)

Page 131: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 123

• Complete block designs / Incomplete block designs

• Latin squares

11.1 Notation using cell means

Notation (cell means)

Treatment BTreatment A No Yes

No η00 η01Yes η10 η11

Here, η represents the mean of each treatment group. Consider a saturated model:

ηi j = µ +αi +β j + γi j,

where i = 0,1, and j = 0,1.

Treatment BTreatment A No Yes

No µ +α0 +β0 + γ00 µ +α0 +β1 + γ01Yes µ +α1 +β0 + γ10 µ +α1 +β1 + γ11

Then we have 8 parameters to estimate from 4 data, and we need to propose somerestrictions so the parameters are estimable. One such restriction, α0 = 0, β0 = 0, γ00 =γ01 = γ10 = 0, leads to:

Treatment BTreatment A No Yes

No µ µ +β1Yes µ +α1 µ +α1 +β1 + γ11

Page 132: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 124

and α1, β1, and γ11 are estimable because we can write

µ = η00

α1 = η10−η00

β1 = η01−η00

γ11 = η11−η01−η10 +η00

With this formulation, α1 is the effect of treatment A, β1 is the effect of treatment B, and γ11is the interaction effect. (If the effects of A and B are additive with no interaction, γ11 = 0.)

For each cell, the observations are

Y0i = µ + ε0i

YAi = µ +α1 + εAi

YBi = µ +β1 + εBi

YABi = µ +α1 +β1 + γ11 + εABi

and we assume Var(ε0i) =Var(εAi) =Var(εBi) =Var(εABi) = σ2. We can test for any treat-ment effect by testing H0 = α1 = β1 = γ11 = 0.

11.2 Efficiency when no interaction

The observed mean responses are:

Treatment B

Treatment A No Yes

No Y0 YB

Yes YA YAB

Note if we assume the sample size in each cell is n,

Var(Y0) =Var(YA) =Var(YB) =Var(YAB) =σ2

n

Page 133: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 125

Then the interaction effect may be estimated by

γ11 = (YAB− YB)− (YA− Y0)

and

Var( ˆγ11) =4σ2

n

The treatment A effect can be estimated as

α1 = YA− Y0,

and its variance is

Var(α1) =2σ2

n.

If no interaction is present then γ11 ≈ 0, and α1 = YAB−YB can also be used to estimate α1.If we use the average of α1 and α1 to estimate α, this estimator has a smaller variance.

α1 =α1 + α1

2=

(YA− Y0)+(YAB− YB)

2

β1 =β1 + β1

2=

(YB− Y0)+(YAB− YA)

2

Var(α1) =14

Var (YA− Y0 + YAB− YB) =σ2

n

In order to have the same efficiency in a two-arm trial (A vs. placebo), we would need 2npatients in each treatment arm.

var (α1) =2σ2

2n=

σ2

n.

So if we were to test A and B in two separate experiments we would need 2n per arm × 4arms (A and placebo, B and placebo), totaling 8n subjects. Noticing we are repeating theplacebo in these hypothetical experiments, we decide to use a 3-arm experiment with A,B, and placebo arms. Then we would require a total of 6n subjects for the same precision.

Example

Page 134: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 126

Treatment BTreatment A No Yes

No 10 40Yes 30 60

If there is a synergistic effect, then η11 > 60.

Treatment BTreatment A No Yes

No 10 40Yes 30 80

Treatment BTreatment A No Yes

No 10 40Yes 30 120

In the last situation, the treatment effects may be multiplicative.

Treatment BTreatment A No Yes

No log(10) = 1 log(40) = 1.60Yes log(30) = 1.48 log(120) = 2.08

Suppose the samples of size 20 yield the following estimates of the cell means.

Treatment BTreatment A No Yes

No 9.83 40.05Yes 28.94 59.76

Page 135: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 127

Assuming no interaction, to estimate the drug A effect we compute either

α1 = YA− Y0 = 28.94−9.83 = 19.11

or

α1 = YAB− YB = 59.76−40.05 = 19.71

or their average (19.11+19.71)/2 = 19.41.

How bad is it to estimate α1 this way when there is actually a significant interaction?

E[(α1 + α1)/2] =12

E[(YA− Y0)+(YAB− YB)]

=12((µ +α1)−µ +(µ +α1 +β1 + γ11)− (µ +β1))

= α1 +γ11

2.

11.3 Example: the Physician’s Health Study I (1989)

Read all about it on http://phs.bwh.harvard.edu/.The Physician’s Health Study was a randomized clinical trial designed to test the followingtwo theories:

• daily low-dose aspirin use reduces the risk of cardiovascular disease

• beta carotene reduces the risk of cancer

Population hierarchy

• 261,248 US male MDs aged 40 to 84

• 112,528 responded to questionnaires

Page 136: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 128

• 59,285 willing to participate

• 33,332 willing and eligible MDs enrolled in run-in (18 weeks of active aspirin andbeta-carotene placebo)

run-in period Eligible patients are monitored for treatment compliance

• 22,071 randomized

Beta-caroteneAspirin Active Placebo TotalActive 5,517 5,520 11,037

Placebo 5,519 5,515 11,034Total 11,036 11,035 22,071

Major findings

• The trial’s DSMB stopped the aspirin arm several years ahead of schedule on1988/1/25 because it was clear that aspirin had a significant effect on the risk ofa first myocardial infarction. (reduced the risk by 44%)

• There were too few strokes or deaths to base sound clinical judgement regardingaspirin and stroke or mortality.

• The beta-carotene arm terminated as scheduled on 1995/12/12 with the conclusionthat 13 years of supplementation with beta-carotene produced neither benefit norharm. Beta-carotene alone was not responsible for the health benefit seen amongpeople who ate plenty of fruits and vegetables.

• Over 300 other findings have emerged from the trial so far.

11.4 Treatment interactions

Factorial designs are the only way to study treatment interactions. Recall the interactionterm is estimated by γ11 = (YAB− YB)− (YA− Y0), and its variance is Var(γ11) = 4σ2/n. This

Page 137: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 129

variance is 4 times as large as for the either A or B main effect, and to have the sameprecision for an estimate of an interaction effect, the sample size has to be four timesas large. This means, the two main advantages of the factorial designs (efficiency andinteraction objectives) cannot be satisfied simultaneously.

When there is an AB interaction, we cannot use the estimators, α1 and β1, which are onlyvalid with no interaction effect. In fact, we cannot talk about an overall main effect in thepresence of an interaction. Instead, we can talk about the effect of A in the absent of B,

α1 = YA− Y0,

or the effect of A in the presence of B

α′1 = α1 + γ11 = YAB− YB.

Some additional notes

• In the 2× 2× 2 design (23 design), there are 3 main effects and 4 interactions pos-sible. The number of high order interactions will grow quickly with k, but oftentimes,they are (assumed to be) 0.

• A “quantitative” interaction does not affect the direction of the treatment effect. Forexample when treatment B is effective either with or without treatment A, but themagnitude of its effectiveness changes.

• With a “qualitative” interaction, the effects of A are reversed with the presence of B.In this case, an overall treatment A effect does not make sense.

• The factorial design can be analyzed with linear models (analysis of variance mod-els).

Limitations of factorial designs

• A higher level design can get complex quickly.

• Test for interaction requires a large sample size (or have a very low power if thestudy is powered for the main effects).

Page 138: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 11. FACTORIAL DESIGN 130

• Combination therapy may be considered as a treatment in its own right.

Of further interest...

• Partial (fractional) factorial designs have missing cells by design (especially whenhigher order interactions are assumed to be zero)

Page 139: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 12

Crossover design

Crossover trials are those in which each patient is given more than one treatment, eachat different times in the study, with the intent of estimating differences between them.

In a simple 2×2 design (or AB/BA design), patients are randomized to either “A then B”group or “B then A” group.

PeriodGroup I II

AB Treatment A Treatment BBA Treatment B Treatment A

2 Treatments / 2 Periods / 2 Sequences

P1 P2S1 A B n1S2 B A n2

2 Treatments / 2 Periods / 4 Sequences

P1 P2S1 A B n1S2 B A n2S3 A A n3S4 B B n4

2 Treatments / 4 Periods / 2 Sequences

P1 P2 P3 P4S1 A B A B n1S2 B A B A n2

131

Page 140: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 132

12.1 Some characteristics of crossover design

• All subjects receive more than one treatment (not simultaneously)

• Each subject acts as own control. Therefore, the treatment groups are comparablewithout relying on randomization.

– Treatment periods (order of A and B) are often randomly assigned.– Baseline characteristics are identical with regard to many patient characteris-

tics, but not with regard to their recent history of exposure to other potentiallyeffective treatments. carryover effects

– The comparability of the treatment groups is not guaranteed by the structure ofthe trial alone. The investigators need to estimate the carryover effects.

• Crossover designs are not used ...

– with any condition that treatment could effect considerable change.– for acute illness

• Crossover designs are most suitable for treatments intended for rapid relief of symp-toms in chronic diseases, where the long-term condition of the patient remains fairlystable.

PrecisionThe primary strength of crossover trials is increased efficiency. Suppose the treatmenteffects are

Xt ∼ Normal(µt ,σ2)

Xc ∼ Normal(µc,σ2),

and we are interested in µt−µc. In a parallel design (with per group sample size of n), wehave

∆ = Xt− Xc ∼ Normal(

µt−µc,2σ2

n

).

With a TC/CT crossover design with sample size of n,

var(∆) =2σ2

n−2cov(Xt , Xc)

=2σ2

n(1−ρtc) ,

Page 141: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 133

where ρtc is the within-subject correlation of responses on treatments T and C. Therefore,a crossover design is more efficient given ρtc > 1.

RecruitmentSome patients may hesitate to participate in a clinical trial if there is a 50% probability ofnot receiving any effective treatment. With a crossover design, everyone is guaranteed toreceive the test drug.

On the other hand, the patients may hesitate to participate in a crossover trial becausethey will go through more than one treatment, especially when outcomes are assessedwith diagnostic procedures such as X-ray, blood drawing, lengthy questionnaires.

Carryover effectsThe biggest concern is the possibility that the treatment effect from one period mightcontinue to be present during the following period. A sufficiently long “washout” periodbetween the treatments may prevent significant carryover effects (but how long is suffi-ciently long?). If there are baseline measurements that represent patient’s disease status,this can be checked against their baseline levels.

If the treatment effects a permanent change or cure in the underlying condition, the treat-ment given after could look artificially superior.

DropoutsIn a crossover design, the trial duration tends to be longer than a comparable study usingindependent groups, which may cause more dropouts. Also because every patient takemore than one treatment, dropouts due to severe side effects may also increase. Theconsequences of dropouts are more severe in crossover trial; a simple analysis cannotuse only the data from the first period.

And in general analysis is more complex.

Page 142: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 134

12.2 Analysis of 2×2 crossover design

P1 P2

S1 = AB YA1 = β0 YB2 = β0 +β1 +β2

S2 = BA YB1 = β0 +β1 YA2 = β0 +β2 +β3

β0 · · · Treatment A effectβ1 · · · Increment of treatment effect due to B.β2 · · · Period effectβ3 · · · Carryover effect

Treatment B effect is β0 +β1. β2 may be considered as carryover effect for S1. We cannotestimate the treatment-period interaction because that term would only appear in YA2 celland would not be separately estimable from β3.

Suppose no treatment by period interaction, i.e., the carryover effects are the samefor both sequences. This means that β3 = 0. We can then estimate β2 (period effect) by

β2 =12(YB2− YB1 + YA2− YA1)

There are two estimates of the increment of treatment effect (or B-effect − A-effect), onefrom each period. We use their average to estimate β1.

β1 =12(YB2− YA2 + YB1− YA1) .

Similarly, there are two estimates for β0, which we can take average of to get

β0 =12(YA1 +(YB1 + YA2− YB2)) .

More generally, we need to consider the case when carryover effect is non-zeroThis means that the incremental effects are different for treatment A and for B. We canestimate β3 by

β3 = YA2− YB2 + YB1− YA1

Page 143: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 135

If β3 is not 0, then we must estimate the treatment difference (β1) as

β1 = YB1− YA1,

using only the data from the first period. Obviously, β0 is estimated by YA1. And the periodeffect is estimated by

β2 = YB2− YB1.

VariancesSuppose that each Y is estimated with variance σ2/n and that the within-person correla-tion of response is ρ. When β3 = 0,

Var{β1}=14

Var{YB2− YA2 + YB1− YA1}

=14

Var{(YB2− YA1)+(YB1− YA2)}

=14

(2

σ2

n(1−ρ)×2

)=

σ2

n(1−ρ).

Similarly, Var{β2}= σ2(1−ρ)/n. Furthermore,

Var{β0}=14

Var{(YA1− YB2)+(YB1 + YA2)}

=14

(2

σ2

n(1−ρ +1+ρ)

)=

σ2

n

And when β3 6= 0,

Var{β0}=σ2

n, Var{β1}=

2σ2

n, Var{β2}=

2σ2

n.

Var{β3}= · · · (Homework),

Page 144: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 136

which is at least twice as large as Var{β2} for ρ ≥ 0. Therefore, any crossover trial de-signed to detect main effects of treatment will have lower power for carryover effect, whichis critical to detect because its presence affects both the analysis and interpretation of thetrial. With the presence of a clinically important carryover effects, a crossover design isno more efficient than an independent-groups trial.

A two-stage procedure may be used: the presence of carryover effects is tested first witha type I error rate of 5 ∼ 10% before moving on to the primary hypothesis testing of thetreatment effects. Estimates will be different depending on the conclusion from the firststage.

12.3 Examples

Capecitabine/Erlotinib Followed of Gemcitabine Versus Gemcitabine/Erlotinib Fol-lowed of Capecitabinehttp://clinicaltrials.gov/ct2/show/NCT00440167

This crossover trial is performed in advanced and metastatic pancreatic cancer not pre-viously exposed to chemotherapy. The study compares a standard arm with gemcitabineplus erlotinib to an experimental arm with capecitabine plus erlotinib. It is the first trialof its kind to incorporate second-line treatment into the study design. Patient who fail onfirst-line therapy are switched to the comparator chemotherapy without erlotinib. The trialtherefore not only compares two different regimens of first-line treatment, it also comparestwo sequential treatment strategies.

Colchicine Randomized Double-Blind Controlled Crossover Study in Behcet’s Dis-easehttp://clinicaltrials.gov/ct2/show/study/NCT00700297

Method: patients were randomized at the study entry to take either colchicine or placebo.At 4 months, they were crossed over. Those who were taking colchicine went on placeboand those on placebo went on colchicine. Each patient tried therefore, both colchicine andplacebo. The primary outcome was the effect of colchicine on the disease activity index,the IBDDAM (16-17). To calculate the overall IBDDAM of the baseline, the IBDDAM of

Page 145: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 137

the last 12 months (prior to the study) of each manifestation was calculated and addedtogether. The overall disease activity index was then divided to the number of months (12months) to have the mean activity index per month. IBDDAM was then measured every 2months (in the middle and at the end, in each arm of the study). The total IBBDAM of the 4months was then divided by 4 to have the mean activity index per month. The secondaryoutcome was to see how the individual symptoms responded to colchicine (IBDDAM ofeach manifestation).

Statistical analysis: The analysis was done by the intention to treat method. As the differ-ence between IBDDAM before and after treatment had normal distribution Student T testfor paired samples were used to evaluate the outcome in the colchicine and the placebogroup. As the Levene’s test showed the homogeneity of variance, ANOVA (one way) wasused to test the effect of treatment (colchicine and placebo) and gender on patients’ out-come. The dependent variable was the difference between IBDDAM (before and after thetreatment). The independent variables were the treatment, and the gender. SPSS 15 wasused for all statistical calculations.

A Placebo-Controlled, Cross-Over Trial of Aripiprazolehttp://clinicaltrials.gov/ct2/show/record/NCT00351936

Primary endpoint: Evaluate the effects of aripiprazole on weight, Body Mass Index (BMI),and waist/hip circumference.

This study is a ten-week, placebo-controlled, double-blind, cross-over, randomized trial ofthe novel antipsychotic agent, aripiprazole, added to 20 obese stable olanzapine-treatedpatients with schizophrenia or schizoaffective disorder. The advantage of the crossoverdesign is that each subject will act as their own control and fewer subjects will be required.

The double-blind, placebo-controlled, crossover study will consist of two random order4-week treatment arms (aripiprazole 15 mg or placebo) separated by a 2-week adjuvanttreatment washout. Following baseline, subjects will be randomized, double-blind, to ei-ther aripiprazole or placebo for 4 weeks. After the initial 4 weeks of medication patients willbe reassessed, have a 2-week washout period and then crossover to the other treatmentfor another 4 weeks.

Data management and statistical analysis will be provided by Dr. David Schoenfeld fromthe Massachusetts General Hospital, Biostatistics Center.

Page 146: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 138

12.4 Analysis of simple crossover design

An AB/BA crossover design may be analyzed using a regression model of the form,

E[Y ] = β0 +β1T +β2P+β3T ×P,

where Y is the response, and T and P are indicator variables for treatment group andperiod, respectively. The carryover effect is the treatment by period interaction. Becauseobservations within a subject are correlated, weighted least squares estimates must beobtained using an appropriate covariance matrix.

β = (X ′Σ−1X)−1X ′Σ−1Y.

A popular choice for Σ is the block diagonal structure where within each individual obser-vations are assumed to have a correlation of ρ and between individuals the observationsare assumed to be independent.

Σ = σ2×

1 ρ 0 0 · · ·ρ 1 0 0 · · ·0 0 1 ρ · · ·0 0 ρ 1 · · ·...

......

... . . .

The form of β depends on the model parametrization. If

T =

{12 if Treatment 1−1

2 if Treatment 2P =

{12 if Period 1−1

2 if Period 2

Then

β =

Y..

YA.− YB.12{(YA1− YB1)− (YA2− YB2)}(YA1− YA2)+(YB1− YB2)

.

Example: Cushny and Peebles (1905)Cushny AR, Peelbes AR (1905) “The action of optial isomers II: Hyoscines” J Physiology.32: 501-510.

Page 147: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 139

• Clinical trial of the effect of 3 hypnotic drugs on duration of sleep

– Study population: inmates of the Michigan Asylum for Insane

– Patients were given an active treatment on each alternate evening. A typicaltreatment plan was:X C X C X C Y C Y C Y C Z C Z C Z C X Y Z X Y Z X Y Z,where ‘C’ is the control evening where no treatment was given.

• These data were used in “The probable error of a mean” (1908) Biometrika. 6(1):1-25.

The original table of the data

Controls Drug X Drug Y Drug Zpatient N mean N mean increase N mean increase N mean increase

1 9 0.6 6 1.3 0.7 6 2.5 1.9 6 2.1 1.52 9 3.0 6 1.4 -1.6 6 3.8 0.8 6 4.4 1.43 8 4.7 6 4.5 -0.2 6 5.8 1.1 6 4.7 0.04 9 5.5 3 4.3 -1.2 3 5.6 0.1 3 4.8 -0.75 9 6.2 3 6.1 -0.1 3 6.1 -0.1 3 6.7 0.56 8 3.2 4 6.6 3.4 3 7.6 4.4 3 8.3 5.17 8 2.5 3 6.2 3.7 3 8.0 5.5 3 8.2 5.78 7 2.8 6 3.6 0.8 6 4.4 1.6 5 4.3 1.59 8 1.1 5 1.1 0.0 6 5.7 4.6 5 5.8 4.710 9 2.9 5 4.9 2.0 5 6.3 3.4 6 6.4 3.511 - - 2 6.3 - 2 6.8 - 2 7.3 -

Page 148: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 140

0

2

4

6

8

Hou

rs o

f sle

ep

1 2 3 4 5 6 7 8 9 10

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Remarks

• Ethical issues? Potential risks of giving drugs to the mentally ill

• The primary interest was to study differences between treatments; treatment se-quences were of incidental interest

• Drug X seem to have no effect, and Drugs Y and Z seem to have about the same

Page 149: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 141

positive influence in inducing sleep

• Sequences were not wisely chosen. X C X C X C Y C Y C Y C Z C Z C Z C X Y ZX Y Z X Y Z

• Patients did not receive an equal number of treatments (missing data?)

Example: Hills and Armitage (1979)Hills M, Armitage P (1979) “The two-period cross-over clinical trial”. Br J Clin Pharmacol.8: 7-20.

• Children with enuresis were treated with a new drug or placebo for 14 days

• The primary data are number of dry nights out of 14.

An estimate of within-subject differences (treatment effects) is δ = YA−YB. The periodeffects may be estimated by

Z1 =δ1− δ2√

var(δ1)+ var(δ2),

and Z is approximately normally distributed under H0. Similarly the overall treatment effectcan be estimated by

Z2 =δ1 + δ2√

var(δ1)+ var(δ2),

and this is approximately normal under H0.

z1 =2.82−1.25√

0.84122 +0.86272= 1.30

z2 =2.82+1.25√

0.84122 +0.86272= 3.38

Page 150: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 142

12.5 A two-period crossover design for the comparisonof two active treatments and placebo

By GG Koch, IA Amara, BW Brown, T Colton, and DB Gillings (1989).

Consider sequences of treatments TT, TC, and CT.

1. The first period is parallel group design to address direct use in all patients

2. The second period for TT versus TC is a parallel group comparison design to ad-dress T versus C for patients who received T during the first period.

3. The second period for TT versus CT enables “delayed start” assessment of T relativeto C if dropout during the first period is minimal and non-informative.

4. The second period for CT versus TC is for assessment of T relative C if carryovereffects are small.

5. If T − C from 1, 2, 4 are similar (carryover effects of T to T, T to C, C to T are small),then an overall analysis of treatment effect differences have a very high power.

6. More patients are allocated to receive T within each period.

P1 P2

S1 =CT β0 β0 +β1 +β2

S2 = TC β0 +β1 β0 +β2 +β3

S3 = T T β0 +β1 β0 +β1 +β2 +β3 + τ

β0 · · · Treatment C effectβ1 · · · Increment of treatment effect due to T .β2 · · · Period effect (for C)β3 · · · Carryover effect (for T )

Page 151: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 143

τ could represent additional treatment effects for longer duration.

Period 1 comparison between T and C is for primary treatment effects, and period 2comparisons address effects of delayed start (CT vs. TT) and of long-duration effects.

Now consider TT, TC, CT, and CC.

1. This design can estimate all the parameters in the TT, TC, CT case.

2. CC vs. CT enables estimation of treatment effects with run-in period.

3. Relatively unethical to have many patients assigned to receive C.

P1 P2

S0 =CC β0 β0 +β2

S1 =CT β0 β0 +β1 +β2

S2 = TC β0 +β1 β0 +β2 +β3

S3 = T T β0 +β1 β0 +β1 +β2 +β3 + τ

β0 · · · Treatment C effectβ1 · · · Increment of treatment effect due to T .β2 · · · Period effect (for C)β3 · · · Carryover effect (for T )

τ could represent additional treatment effects for longer duration.

Example: Pincus T et al. (2004) “Patient preference for placebo, acetaminophen (parac-etamol) or celecoxib efficacy studies (PACES): two randomised, double blind, placebocontrolled, crossover clinical trials in patients with knee or hip osteoarthritis”. Ann RheumDis. 63: 931-939.

Page 152: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 144

12.6 Latin squares

When there are k treatments and each patient is to receive all k treatments. Then thereare k! possible sequences. Three treatments yield 6 sequences, four treatments yield 24,and five yield 120.

k = 3: ABC, ACB, BAC, BCA, CAB, CBA

The idea is to use a reduced number of sequences (reduced sample size) but maintain agood “representation”, i.e., every treatment is represented in every period with the samefrequency.

P1 P2 P3S1 A B CS2 B C AS3 C A B

P1 P2 P3S1 A C BS2 B A CS3 C B A

There are 6!/(3!)(3!) = 20 ways to choose 3 sequences from 6, but only 2 of those areLatin squares.

12.7 Optimal designs

There is an extensive literature on optimal choice of sequences for measuring treatmenteffects in the presence of carryover.

• More advanced theory · · ·

• Optimality depends on assumptions about carryover effects

Concerns about carryover can be reduced by using designs with more than two periods.(Laska E, Meisner M, Kushner HB. (1983) “Optimal crossover designs in the presence ofcarryover effects”. Biometrics. 39(4): 1087-1091.

Page 153: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 145

Consider treatments A and B in two sequences: AABB and BBAA. This design is notuniquely optimal, but it can be used to estimate treatment effects with more efficiencythan using data from period 1.

P1 P2 P3 P4AABB µ11 = µ +π1 + τa µ12 = µ +π2 + τa +λa µ13 = µ +π3 + τb +λa µ14 = µ +π4 + τb +λbBBAA µ21 = µ +π1 + τb µ22 = µ +π2 + τb +λb µ23 = µ +π3 + τa +λb µ24 = µ +π4 + τa +λa

Note µ is the overall mean, π is the period effect, τ is the treatment effect, and λ is thecarryover effect.

To obtain an unadjusted (for carryover effect) treatment effect (B−A), use the followingweights.

P1 P2 P3 P4AABB −1/4 −1/4 1/4 1/4BBAA 1/4 1/4 −1/4 −1/4

• Weights sum to 1 for B and −1 for A to form a contrast B−A.

• Weights sum to 0 over sequence and period.

− 14

µ11−14

µ12 +14

µ13 +14

µ14 +14

µ21 +14

µ22−14

µ23−14

µ24

= (τb− τa)+(λb−λa)/4

When carryover effects are present, we can construct weights so that carryover effectswill be eliminated.

P1 P2 P3 P4AABB −w1 −w2 w3 w4BBAA w1 w2 −w3 −w4

Page 154: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 146

Constraints on w’s.

• w1 +w2 +w3 +w4 = 1

• w2−w3 +w4 = 0

−w1µ11−w2µ12 +w3µ13 +w4µ14 +w1µ21 +w2µ22−w3µ23−w4µ24

=−w1τa−w2(τa +λa)+w3(τb +λa)+w4(τb +λb)+w1τb +w2(τb +λb)−w3(τa +λb)−w4(τa +λa)

= (w1 +w2 +w3 +w4)τb− (w1 +w2 +w3 +w4)τa− (w2−w3 +w4)λa +(w2−w3 +w4)λb

= τb− τa

Let σ2 be the within-patient variance and n be the number of patients per sequence. Thevariance of the unadjusted estimator is

2

{(14

)2

+

(14

)2

+

(14

)2

+

(14

)2}

σ2

n= 0.5

σ2

n.

And for adjusted estimator:

2{

w21 +w2

2 +w23 +w2

4} σ2

n.

If we pick w1 = 4/10, w2 = 2/10, w3 = 3/10, w4 = 1/10, we have

2

{(4

10

)2

+

(2

10

)2

+

(310

)2

+

(1

10

)2}

σ2

n= 0.6

σ2

n.

If we only use data from the first period

2{

12 +02 +02 +02} σ2

n= 2

σ2

n.

The adjusted estimator has slightly higher variance, but it is unbiased with presence ofcarryover.

Page 155: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 12. CROSSOVER DESIGN 147

William’s squareWhen an even number of treatments are considered in the same number of periods,William’s square gives an optimal design. (Williams EJ (1949). “Experimental designsbalanced for the estimation of residual effects of treatments”. Australian Journal of Scien-tific Research. Series A2. 149-168.) It is a Latin square design in which every treatmentprecedes every other treatments exactly once.

P1 P2 P3 P4sequence 1 A B C Dsequence 2 B D A Csequence 3 C A D Bsequence 4 D C B A

Page 156: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

Chapter 13

Pragmatic clinical trials

Compare and contrast pragmatic trials and explanatory trials on...

• Inclusion criteria

• Interventions

• Follow-up

• Intention-to-treat

148

Page 157: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 13. PRAGMATIC CLINICAL TRIALS 149

13.1 Superiority, Noninferiority, and Equivalence

13.1.1 Hypotheses

Superiority

H0 : µt−µc = 0. H0 : pt− pc = 0.H1 : µt−µc > 0. H1 : pt− pc > 0.

Powered at µt−µc = δs.

To increase likelihood of positive conclusion (H1) in a superiority trial:

• Small σ

• Large µt−µc from the relevant populations

Noninferiority

H0 : µt−µc =−δi. H0 : pt− pc =−δi

H1 : µt−µc >−δi. H1 : pt− pc >−δi

Powered at µt−µc = 0.

To increase likelihood of positive conclusion (H1) in a noninferiority trial:

• Small σ

• Small |µt−µc| from the relevant populations (assuming µt < µc).

Equivalence

H0 : µt−µc <−δe or µt−µc > δe. H0 : pt− pc <−δe or pt− pc > δe

H1 :−δe ≤ µt−µc ≤ δe. H1 :−δe ≤ pt− pc ≤ δe

Page 158: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 13. PRAGMATIC CLINICAL TRIALS 150

Powered at µt−µc = 0.

To increase likelihood of positive conclusion (H1) in an equivalence trial:

• Small σ

• Small |µt−µc| from the relevant populations (assuming µt < µc).

13.2 Sample size

Sample size formula for a continuous endpoint:

1Nt

+1

Nc=

(δ1−δ0)2

σ2(zα + zβ )2

Let Nt = rNc and solve for Nc to get

Nc =2(zα + zβ )

2σ2

r(δ1−δ0)2 (13.1)

N does not depend on the values of δ1 and δ0. What matters is their difference.

For the test regarding proportions

N =

(zα

√2p(1− p)+ zβ

√pt(1− pt)+ pc(1− pc)

)2

(pt− pc)2 , (13.2)

where p = (pt + pc)/2. N is different for different values of pt and pc even when theirdifferences are the same.

Suppose we are testing H0 : pt = pc, H1 : pt > pc with α = 0.025 and β = 0.10 when pt− pc =0.20.

pc 0.01 0.05 0.10 0.20 0.30 0.40 0.50 0.60pt 0.021 0.025 0.30 0.40 0.50 0.60 0.70 0.80N 50 65 82 108 124 130 124 108

Page 159: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 13. PRAGMATIC CLINICAL TRIALS 151

Sample size is larger near 0.5 because the variance of a Binomial random variable isnp(1− p), which is maximized at p = 0.5.

Sometimes, test for proportions are parameterized in odds ratios or risk ratios.

Odds ratio 2 2 2 2 2 2 2 2pc 0.01 0.05 0.10 0.20 0.30 0.40 0.50 0.60pt 0.0198 0.0952 0.182 0.333 0.462 0.571 0.667 0.750N 3210 691 377 231 187 178 182 203

Risk ratio 1.98 1.90 1.82 1.67 1.54 1.43 1.33 1.25Difference 0.0098 0.0452 0.082 0.133 0.162 0.171 0.167 0.150

Sample size for equivalence (α,β ,δ ) = Sample size for superiority (α,β/2,δ ).

13.2.1 Sample size adjustment for ITT analysis

Dropouts and drop-ins have no effect on the numerator of (13.1), and only a small effecton the numerator of (13.2) but pose a major impact on the denominators, which are theexpected difference of the means and proportions under the alternative, and they getdiluted with dropouts and drop-ins.

Suppose Rtc and Rct denote the proportions of the subjects switching from treatment groupto control group, from control group to treatment group, respectively. In the following, letX ′t and X ′c be the observed averages of those assigned to t group and c group; let Xt andXc be the averages of those receiving each treatment.

X ′t = Xt(1−Rtc)+ XcRtc,

X ′c = Xc(1−Rct)+ XtRct ,

Page 160: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 13. PRAGMATIC CLINICAL TRIALS 152

Thus

µ′t ≡ E[X ′t ] = µt(1−Rtc)+µcRtc,

µ′c ≡ E[X ′c] = µc(1−Rct)+µtRct ,

and

(µ ′t −µ′c)

2 = (µt(1−Rtc−Rct)−µc(1−Rct−Rtc))

= (µt−µc)2(1− (Rtc +Rct))

2.

Similarly for the test of proportion difference

p′t = E[pt ] = pt(1−Rtc)+ pcRtc

p′c = E[pc] = pc(1−Rct)+ ptRct

(p′t− p′c)2 = (pt(1−Rtc−Rct)− pc(1−Rct−Rtc))

2

= (pt− pc)2 (1− (Rtc +Rct))

2

Thus, the necessary sample size will increase by the factor 1/(1− (Rtc+Rct))2. For exam-

ple, if you guess Rtc = 0.05 and Rct = 0.10, then the final sample size will be close to 40%more. 1/(1− .10− .05)2 = 1.38.

Rtc +Rct 0.05 0.10 0.15 0.20 0.25 0.30 0.50Multiplier 1.11 1.23 1.38 1.56 1.78 2.04 4.00

We can also compute the power of the study as a function of Rtc +Rct when the samplesize is computed assuming Rtc +Rct = 0. Assuming wlog δ0 = 0, σ = 1 and r = 1 in (13.1)and solving for zβ , we get

zβ =

√N√2

δ1− zα . (13.3)

With subjects switching the treatment groups, we have δ ′1 = δ1(1− (Rtc+Rct)), where δ1 isthe true treatment difference under the alternative for a treatment-received analysis, and

Page 161: Tatsuki Koyama, PhD - Vanderbilt Universitybiostat.mc.vanderbilt.edu/wiki/pub/Main/OsakaUnivSummer2015/Osa… · – propensity score matching – propensity score as weights –

CHAPTER 13. PRAGMATIC CLINICAL TRIALS 153

δ ′1 is the true treatment difference for an intention-to-treat analysis. And the actual powerof the study is a function of

zβ ′ =

√N√2

δ′1− zα . (13.4)

Solving for β ′ in (13.4) using the original N (13.3) to get:

zβ ′ = (zα + zβ )δ ′1δ− zα .

= (zα + zβ )(1− (Rtc +Rct))− zα

= zβ (1− (Rtc +Rct))− zα(Rtc +Rct).

For α = 0.025, the following table shows decrease in power as a function of Rtc +Rct .

Rtc +Rct 0 0.05 0.10 0.15 0.20 0.25 0.30 0.50Power 0.90 0.87 0.83 0.79 0.74 0.68 0.62 0.37Power 0.80 0.76 0.71 0.66 0.61 0.56 0.50 0.29

Price et al. (2011) “Leukotriene antagonists as first-line or add-on asthma-controller ther-apy" NEJM 364(18)