power and sample size when multiple endpoints are considered

10
PHARMACEUTICAL STATISTICS Pharmaceut. Statist. 2007; 6: 161–170 Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/pst.301 Power and sample size when multiple endpoints are considered Stephen Senn 1 and Frank Bretz 2, * ,y 1 Department of Statistics, University of Glasgow, Glasgow, Scotland, UK 2 Statistical Methodology, Clinical Information Sciences, Novartis Pharma AG, Basel, Switzerland A common approach to analysing clinical trials with multiple outcomes is to control the probability for the trial as a whole of making at least one incorrect positive finding under any configuration of true and false null hypotheses. Popular approaches are to use Bonferroni corrections or structured approaches such as, for example, closed-test procedures. As is well known, such strategies, which control the family- wise error rate, typically reduce the type I error for some or all the tests of the various null hypotheses to below the nominal level. In consequence, there is generally a loss of power for individual tests. What is less well appreciated, perhaps, is that depending on approach and circumstances, the test-wise loss of power does not necessarily lead to a family wise loss of power. In fact, it may be possible to increase the overall power of a trial by carrying out tests on multiple outcomes without increasing the probability of making at least one type I error when all null hypotheses are true. We examine two types of problems to illustrate this. Unstructured testing problems arise typically (but not exclusively) when many outcomes are being measured. We consider the case of more than two hypotheses when a Bonferroni approach is being applied while for illustration we assume compound symmetry to hold for the correlation of all variables. Using the device of a latent variable it is easy to show that power is not reduced as the number of variables tested increases, provided that the common correlation coefficient is not too high (say less than 0.75). Afterwards, we will consider structured testing problems. Here, multiplicity problems arising from the comparison of more than two treatments, as opposed to more than one measurement, are typical. We conduct a numerical study and conclude again that power is not reduced as the number of tested variables increases. Copyright # 2007 John Wiley & Sons, Ltd. Keywords: multiple testing; multiple endpoints; power y E-mail: [email protected] *Correspondence to: Frank Bretz, Statistical Methodology, Clinical Information Sciences, Novartis Pharma AG, 4002 Basel, Switzerland. Copyright # 2007 John Wiley & Sons, Ltd.

Upload: stephen-senn

Post on 06-Jul-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Power and sample size when multiple endpoints are considered

PHARMACEUTICAL STATISTICS

Pharmaceut. Statist. 2007; 6: 161–170

Published online in Wiley InterScience

(www.interscience.wiley.com) DOI: 10.1002/pst.301

Power and sample size when multiple

endpoints are considered

Stephen Senn1 and Frank Bretz2,*,y

1Department of Statistics, University of Glasgow, Glasgow, Scotland, UK2Statistical Methodology, Clinical Information Sciences, Novartis Pharma AG, Basel,

Switzerland

A common approach to analysing clinical trials with multiple outcomes is to control the probability for

the trial as a whole of making at least one incorrect positive finding under any configuration of true and

false null hypotheses. Popular approaches are to use Bonferroni corrections or structured approaches

such as, for example, closed-test procedures. As is well known, such strategies, which control the family-

wise error rate, typically reduce the type I error for some or all the tests of the various null hypotheses to

below the nominal level. In consequence, there is generally a loss of power for individual tests. What is

less well appreciated, perhaps, is that depending on approach and circumstances, the test-wise loss of

power does not necessarily lead to a family wise loss of power. In fact, it may be possible to increase the

overall power of a trial by carrying out tests on multiple outcomes without increasing the probability of

making at least one type I error when all null hypotheses are true.

We examine two types of problems to illustrate this. Unstructured testing problems arise typically

(but not exclusively) when many outcomes are being measured. We consider the case of more than

two hypotheses when a Bonferroni approach is being applied while for illustration we assume

compound symmetry to hold for the correlation of all variables. Using the device of a latent variable it

is easy to show that power is not reduced as the number of variables tested increases, provided that the

common correlation coefficient is not too high (say less than 0.75). Afterwards, we will consider

structured testing problems. Here, multiplicity problems arising from the comparison of more than

two treatments, as opposed to more than one measurement, are typical. We conduct a numerical study

and conclude again that power is not reduced as the number of tested variables increases. Copyright

# 2007 John Wiley & Sons, Ltd.

Keywords: multiple testing; multiple endpoints; power

yE-mail: [email protected]

*Correspondence to: Frank Bretz, Statistical Methodology,Clinical Information Sciences, Novartis Pharma AG, 4002Basel, Switzerland.

Copyright # 2007 John Wiley & Sons, Ltd.

Page 2: Power and sample size when multiple endpoints are considered

1. INTRODUCTION

The marginal cost of obtaining further measure-ments is so much less than the marginal cost ofstudying further patients, whether that cost ismeasured in money or in delay, that it would seemonly commonsense to measure and analyse manythings when running a clinical trial. However, thegeneral perception within the drug development isfrequently the opposite: having many outcomesfrom a clinical trial is a bad thing. In a regulatorycontext, the problem of multiplicity is usuallytaken extremely seriously. For example, JohnLewis (then of the Medicines Control Agency)wrote in 2000, ‘Much of the ICH-E9 guidelineis concerned directly or indirectly with avoidingor dealing with different types of multiplicity’[1 (p. 157)]. The italics are ours, not Lewis’s.However, we consider the word ‘avoiding’ usedhere to be significant.

Of course, the concerns regarding multipleendpoints reflect a well-known fact upon whichall on both sides of the regulatory divide agree. Ifmany outcomes are measured, and each is used toprovide a test of efficacy of size a; then theprobability of at least one being significant isgreater than a: Furthermore, other things beingequal, this probability increases with the numberof outcomes measured. However, where there isless agreement is to what extent this matters.Besides, the type I error rate that would becontrolled by only having one outcome per trialwould be the probability of making one type Ierror per trial and it is not at all clear that this isscientifically relevant. In fact, at least four argu-ments militate against the view that the type I errorrate per trial is relevant.

The first argument is that if two trials are run,each of which uses one of two correlated endpoints(for example, in asthma one might consider forcedexpiratory volume in one second, FEV1, for onetrial and peak expiratory flow, PEF, for the other),the probability of making at least one type I erroris greater than if both had been studied in one trial[2]. The reason is that the correlation betweenthe two tests is positive provided they aremeasured on the same patients but zero under

the null hypothesis if measured in different trials.However, in general if A is the event that the first issignificant and B is the event that the second issignificant we have PðA[ BÞ ¼ PðAÞ þ PðBÞ�PðA\ BÞ with PðA\ BÞ ¼ PðAÞPðBÞ under thenull hypothesis in the case of two trials butPðA\ BÞ > PðAÞPðBÞ in the case of a single trialso that PðA[ BÞ is smaller in the latter case.

The second argument is that if tight control ofthe type I error is appropriate, then it would seemto be more appropriate at the level of the drugdevelopment programme as a whole. Yet suchcontrol is not formally exercised. For example,there is a long-standing tradition of requiring twosignificant trials. If the usual standard of a two-sided type I error rate per trial of 0:05 ¼ 1

20is

observed, then, if only two trials are run, thiswould control the one-sided type I error rate per

programme at 140� 1

40¼ 1

1600and the two-sided rate

at 2� 11600¼ 1

800¼ 0:00125 [2–6]. In practice, how-

ever, there is no requirement for exactly two phaseIII trials to be run and a sponsor might gainregistration if two out of three trials weresignificant.

The third argument is that increasingly thescientific community at large is judging the effectsof treatments using meta-analysis. For this pur-pose, nobody cares much what the originalintentions of the experimenter were. To return tothe example of asthma considered above, themeta-analyst will prefer the situation that all trialsof a similar type measured both FEV1 and PEF(and will pay no attention to any adjustmentsemployed) rather than that half had used FEV1

only and the other half PEF only.The fourth argument is that many methods for

dealing with the type I error rate assume, which isnot at all obviously reasonable, that the study’sdesigners are mandated on behalf of all scientificposterity to make the decision (subject of course tothe regulator’s approval) as to the efficacy of atreatment. Now, although this is (approximately)a model as to how drug regulation works inpractice, it cannot be a model as to how scienceworks. In fact, a general argument against theNeyman–Pearson (NP) testing approach and infavour of P-values is that even if two scientists

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

162 S. Senn and F. Bretz

Page 3: Power and sample size when multiple endpoints are considered

sign up to the NP approach there is no guaranteethat the decision of one to reject a null hypothesiswill be acceptable to the other and vice versa ifthey habitually use different type I error rates [7,8].Similarly, if one scientist favours a differentapproach to adjusting for multiplicity than an-other they need to communicate the results ofunadjusted tests so that each can perform hispreferred adjustment.

Of course, there is one potential problemwith multiple outcomes that all statisticians,whatever their philosophy, can agree is serious.This is the possibility that many things willbe measured but only a misleading subset willbe reported [2]. The business of drug develop-ment requires utmost good faith: relevant infor-mation of which the sponsor is aware must beshared with the regulator. In principle, one couldclaim that this duty is discharged by the simpleexpedient of providing the regulator with all thedata, provided that the data that will be collectedare pre-specified in the protocol. In practice, ofcourse, even regulators who reanalyse submissionswould run the danger of missing some relevantfeatures of the data and being misled by selectiveanalyses.

Thus, multiplicity and how to deal with it is anissue that continues to divide scientists. Forexample, Ken Rothman has argued that alladjustments are unnecessary [9] and John Nelderhas argued similarly [10]. One might call this theper test standard. Jim Lindsey has made aparticularly strongly worded criticism of the cultof the single endpoint as follows, ‘In spite of itsname, present confirmatory inference is essentiallynegative: confirming that a null hypothesis can berejected. This perhaps explains why confirmatoryphase III trials, requiring tens of thousands ofsubjects while testing one endpoint as imposed byregulatory agencies, do not yield scientific infor-mation that is proportional to their size and cost’[11] On the other hand, Peter Bauer and RobertO’Neill have argued strongly in favour of tightcontrol and (implicitly at least) defended what onemight call a per trial standard [12,13]. In betweenthese two extremes, Cook and Farewell [14] haveput an intermediate view.

Here, we take the view that the economics ofdrug development make it logical that many thingsshould be studied in a given clinical trial. However,we also accept that regularity reality is, and islikely to remain, that some sort of per trial controlof the type I error rate is required of sponsors.Therefore, our aim is to see whether these points ofview can be reconciled. For example, if it were thecase that the necessity of controlling the type Ierror rate led to such a loss of power overall thattrials were likely to be unsuccessful, then theeconomic argument for measuring many thingswould be vitiated. In fact, we shall show that thesituation as regards power is not so gloomy andthat, in fact, power can be reasonably preserved oreven increased by measuring many outcomesdespite maintaining strict control of the per trialtype I error rate.

The plan of this paper is as follows. In Section 2,we consider different sorts of power that can bedefined for multiple testing problems. In Section 3,we discuss unstructured testing problems. Thesearise typically (but not exclusively) when manyoutcomes are being measured. In Section 4, we willconsider structured testing problems. Here, multi-plicity problems arising from the comparison ofmore than two treatments, as opposed to morethan one measurement are typical. In Section 4, wemake some tentative conclusions. A particulartype of testing problem that leads to multiplicity,that of repeated sequential analysis, will not becovered. For recent treatments of the question ofmultiplicity that cover similar ground to thispaper, the reader is referred to Chuang-Steinet al. [15] and Offen et al. [16].

2. TYPES OF POWER

Suppose that we consider testing a number of nullhypotheses, H0;1;H0;2; . . . ;H0;k; for which we havea series of alternative hypotheses H1;1;H1;2; . . . ;H1;k: We might suppose that we do this using aseries of test statistics T1;T2; . . . ;Tk for which wehave defined critical regions R1;R2; . . . ;Rk; butthis would imply that we were testing each

Power and sample size when multiple endpoints are considered 163

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

Page 4: Power and sample size when multiple endpoints are considered

hypothesis marginally and independently, whichneed not necessarily be the case. (However, teststhat are mainly of this sort will be considered inSection 3.) Suppose instead that we define theevent that H0;i is rejected in favour of H1;i to be Si:Then many conventional schemes to adjust formultiplicity are designed to control probabilities ofdisjunctive events of the form D ¼ S1 [ S2 � � � [Sk ¼

Ski¼1 Si: Occasionally, one may be interested

in probabilities of conjunctive events of the formC ¼

Tki¼1 Si:

Note that the event D corresponds to rejectingat least one null hypothesis, whereas the event Ccorresponds to rejecting all null hypotheses. Inconnection with these events, one may consider PðDÞ as disjunctive power and PðCÞ as conjunctivepower. (These probabilities have been referred toas minimal and complete power, respectively, byWestfall et al. [17], but these terms have thedisadvantage that minimal power is never less thancomplete power.) Also, of course, the power itselfdepends on the particular alternative hypothesesconsidered. Full disjunctive power applies in thecase where all null hypotheses are false, althougheven this, of course, depends on the magnitude ofthe various treatment effects.

Note also that controlling PðDÞ4a for the casewhere all H0;i are true does not satisfy on its ownregulatory requirements, since, for example, test-ing may be carried out in a circumstance whereone of the hypotheses is a ‘stalking horse’hypothesis, known to be false. Thus, the closedtesting principle needs to be observed in order tocontrol not only the PðDÞ but also the probabilityof all true null hypotheses being rejected, giventhat other false null hypotheses have been rejected[18]. For example, in the case of a trial with threeactive treatments and a placebo, a global F-test atlevel a followed by pairwise comparisons of alltreatments does not observe closed-test principlessince, if it is accepted that all active treatments arebetter than placebo, it would require a furtherF-test of the three active treatments beforeproceeding to pairwise comparisons amongst theactive treatments.

In the rest of this paper, we consider the simplecase where disjunctive power can be calculated

whilst controlling type I error rates by observingclosed testing principles. In Section 3, we considerunstructured problems involving unordered hy-potheses. In Section 4, we consider more struc-tured cases arising where the hypotheses may beordered.

3. UNSTRUCTURED ANALYSES

In this section, we consider disjunctive andconjunctive powers when dealing with unstruc-tured problems. A typical example would be wherewe have many pre-specified endpoints to examine,decline to make a choice between them or even toprioritise them, and elect to make a Bonferronicorrection.

We consider the case of k outcomes that aremeasured on the same patients and are jointlymultivariate normal distributed. We suppose thatcontrasts comparing two treatments have beencalculated for each outcome. The simple resultsthat follow are asymptotic and assume that thenuisance parameters are known. Given thisassumption we can, without loss of generality,take the case where all contrasts have been dividedby their standard deviations. It follows thatmarginally the contrast Ti for outcome i isdistributed as NðDi; 1Þ; where Di is the non-centrality parameter governing the power of thetest associated with the contrast. Under H0;i thenDi ¼ 0:

We now consider the simplest of all possiblenon-trivial cases to specify the joint distribution ofT ¼ ðT1;T2; . . . ;TkÞ

0; that is to assume that allnon-centrality parameters are equal so that Di ¼ Dfor all i, and all correlations between differentmeasures are identical so that rij ¼ r; i 6¼ j: Itthen follows that we can simplify the problemconsiderably by invoking the presence of a latentvariable, L� NðD;rÞ: Each of the Ti is correlatedwith L with correlation

ffiffiffir

pbut are conditionally

independent. We prove that this structure issufficient to describe the joint distribution of T:

Since L and Ti are jointly normally distributedwith identical mean D; the regression of Ti on L;

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

164 S. Senn and F. Bretz

Page 5: Power and sample size when multiple endpoints are considered

that is to say E½Ti jL ¼ l�; is

corrðL;TiÞSDðTiÞSDðLÞ

l ¼ffiffiffir

p 1ffiffiffir

p l ¼ l

Hence, Ti ¼ lþ ei and Tj ¼ lþ ej ; and E½Ti� ¼E½Tj� ¼ E½L� ¼ D; where ei and ej are normallydistributed independent of l with expectation zero,i 6¼ j: However, VarðTiÞ ¼ VarðLÞ þ VarðeiÞ ¼ rþVarðeiÞ ¼ 1: Hence, VarðeiÞ ¼ 1� r and by ana-logy VarðejÞ ¼ 1� r: However,

CovðTi;TjÞ ¼VarðlÞ þ covðei; ejÞ

¼rþ covðei; ejÞ ¼ r; i 6¼ j

Hence, covðei; ejÞ ¼ 0 and Ti;Tj ; i 6¼ j are uncor-related given L:

We suppose that each test is carried out with amarginal type I error rate of a: The key forobtaining the disjunctive power is to consider theconditional type II error rate bðTi jL ¼ l; aÞ for anyoutcome. Since these probabilities are condition-ally independent we may calculate the conditionaldisjunctive power as

PðDjL ¼ l; aÞ ¼ 1�Yki¼1

bðTijL ¼ l; aÞ

¼ 1� ½bðTi jL ¼ l; aÞ�k

Finally, in order to calculate the unconditionalpower we integrate over the distribution of L toobtain

PðD;D; aÞ ¼Z 1�1

f ðl;DÞf1� ½bðTi jL ¼ l; aÞ�kg dl

¼ 1�Z 1�1

f ðl;DÞ½bðTijL ¼ l; aÞ�k dl ð1Þ

where f ðl;DÞ is the distribution of the latentvariable under the particular value of the alter-native hypothesis being considered for powerdetermination.

Figure 1 shows the power calculated using (1) asa function of the correlation coefficient r for 2, 5or 10 endpoints where the power for a singleendpoint would be 80%. It can be seen that forvalues of the correlation coefficient less than 0.7,the power is improved by having more endpoints.

In fact, by inverting (1) one can also establishcritical values of r; that is to say values for which

the power would be less as a consequence ofhaving multiple endpoints than it would be havinga single endpoint. These critical values are plottedfor a number of endpoints between 2 and 20 inFigure 2.

Note that the reversed pattern is seen whenlooking at conjunctive power. Similar powerfunctions as Equation (1) can be derived. But toillustrate the key effects, consider for simplicitytwo co-primary endpoints in a clinical trial, bothof which are to be declared significant in order forthe clinical trial to be successful. Assuming amarginal power of 80% for each single endpoint,the conjunctive power of statistical significance forboth endpoints is 0:64 ¼ 0:8� 0:8 under indepen-dence. We therefore require a high correlation tohave reasonable power of all endpoints beingsignificant and, of course, the more the endpointsthe lower the power. In fact, the conjunctive powerwill never be larger than 80% and sample sizes forclinical trial have to be adjusted accordingly.

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1 Outcome

2 Outcomes

5 Outcomes

10 Outcomes

Correlation

Pow

erFigure 1. Disjunctive power for trials withmultiple equally correlated endpoints where theBonferroni adjustment has been applied and thepower for each individual test at the 2.5% level for

a one-sided (unadjusted) test would be 80%.

Power and sample size when multiple endpoints are considered 165

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

Page 6: Power and sample size when multiple endpoints are considered

Figure 3 illustrates the equivalent relativesample size required for a trial with manyoutcomes to have conjunctive power of 80%

compared with a trial that would have 80% powerfor a single endpoint. The correlation structure isas previously assumed and the clinically relevantdifference as a ratio of the standard deviation (theeffect size) is assumed identical for all outcomes.

4. STRUCTURED ANALYSES

In some cases, the test problems are structured inthe sense that the correlations(s) are known andcan be expressed in closed form. This is in contrastto Section 3, where the correlation(s) in the datawas unknown and could at best be estimated.

In many confirmatory (phase III) clinical trials,two dose levels (or regimens) of an experimentaltreatment and one comparator treatment (placeboor a standard therapy) are used. The motivationfor including two dose levels of the experimentaltreatment differs from trial to trial but, forexample, could encompass the need to demon-strate superiority over the comparator within thespecific dose range. Alternatively, both dose levelsmay be considered clinically viable. The intentionmay then be to register the high dose for patientswith poorer prognosis or simply that the higher ofthe two doses is needed to up-titrate those patientswho failed to provide adequate clinical responseon the lower dose. A common goal for such studiesis to infer the benefit of experimental treatmentwhilst maintaining the type I error rate at a pre-specified level.

In such cases, the responses are independent,since only one measurement per patient is con-sidered. Correlations are introduced, however,through the structured comparisons. To fix theideas, we consider data obtained from a controland two dose regimens (say) that are normallydistributed with means mi; i ¼ 0 (control), 1, 2,and common variance s2: Let H0;i : mi � m040;i ¼ 1; 2; denote the two null hypotheses of interest.Dunnett’s many-to-one test [19] is perhaps themost popular test for the comparison of severaltreatments with a control. This test takes themaximum T ¼ maxðT1;T2Þ of both pairwisecomparisons Ti ¼ ð %Xi � %X0Þ=

ffiffiffiffiffiffiffiffiffiffiffiffiffi2S2=n

p; where %Xi

is the arithmetic mean of treatment group i, S2 is

0 5 10 15 200.72

0.74

0.76

0.78

Number of outcomes

Bre

ak e

ven

corr

elat

ion

Figure 2. Values of the correlation between mea-sures at which disjunctive power for many outcomesis less than it would be for a single nominatedoutcome, where power for each outcome is 80% fora one-sided unadjusted test at the 2.5% level, but

the Bonferroni correction has been applied.

0 0.2 0.4 0.6 0.8 10

1

2

1 Outcome

Correlation coefficent

2 Outcomes

5 Outcomes

10 Outcomes

Rat

io o

f sa

mpl

e si

zes

requ

ired

Figure 3. Relative sample sizes required for aclinical trial to have conjunctive power of 80%when the effect size is identical for all outcomes

and a common correlation applies.

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

166 S. Senn and F. Bretz

Page 7: Power and sample size when multiple endpoints are considered

the pooled variance estimate, and n is the groupsample size. Dunnett’s test fully exploits thecorrelation r ¼ CorrðT1;T2Þ between the pairwisecomparisons to the control group in order todetermine the nominal level to which the minimumP value is compared. Note that r ¼ 0:5 underequal allocation.

In order to illustrate the impact of multiplicityin the context of structured testing problems, weconsider the disjunctive power for different multi-ple test procedures. The methods under investiga-tion include the Bonferroni procedure (based onthe Dunnett-type pairwise comparisons to thecontrol, but without taking the correlations intoaccount), Simes’ modified Bonferroni procedure[20], and Dunnett’s test as described above. We setup the power study such that a standardized effectsize of 1 is detected with 80% marginal powerwhen applying an unadjusted t-test. That is, wehave chosen s ¼ 1; a ¼ 0:025 (one sided), and ni ¼17: Two different scenarios will be considered.Figure 4 gives the power results for the case thatonly one treatment is efficacious, that is, m0 ¼ m1 ¼04d ¼ m241: Figure 5 gives the power results for

the case that one treatment is efficacious and theefficacy of the second treatment is somewherebetween the efficacy of the control and the othertreatment, that is, m0 ¼ 04d ¼ m14m2 ¼ 1:

If only one of the two treatments is efficacious,the disjunctive power is smaller than the marginalpower for that treatment, since the adjustment formultiplicity leads to a larger critical value and thusreduces the probability of success. If, however, onetreatment is fully efficacious (that is, m2 � m0 ¼ 1;as in Figure 5) and the second treatment is stillgood (achieving at least 80% of the effect sizem2 � m0), the disjunctive power is larger than themarginal power. This is because the probabilitygain for detecting at least one out of twoefficacious treatments overcomes the loss in powerdue to multiplicity adjustment. This is alsoillustrated in Figure 6, where for the Dunnett’stest we have fractionalized the disjunctive powerP(D)=P(reject treatment 1 or 2) (event D inFigure 6) into the components P(reject treatment1) (event A), P(reject treatment 2) (event B) andthe conjunctive power P(C)=P(reject treatments1 and 2) (event C).

Figure 4. Power results for the case that only one treatment is efficacious. Details are given in the text.

Power and sample size when multiple endpoints are considered 167

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

Page 8: Power and sample size when multiple endpoints are considered

Figure 5. Power results for the case where one treatment is efficacious and the efficacy of the secondtreatment is between the efficacy of the control and the other treatment. Details are given in the text.

Figure 6. Fractionalized probabilities for the Dunnett’s test. Details are given in the text.

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

168 S. Senn and F. Bretz

Page 9: Power and sample size when multiple endpoints are considered

Similarly as in Section 3, we thus conclude thatin many practically relevant situations the prob-ability of success of a clinical trial is increasedwhen investigating multiple treatments. The resultsremain qualitatively the same if more than twotreatments are compared with a control. Thisconclusion is in contrast to the usual impressionthat testing multiple hypotheses always leads to aloss in power (see also, Hsu [21 (pp. 236–240)],who also discussed the problem of power, multi-plicity and sample size computations). One reasonfor this misconception may lie in the inadequatepower calculation often encountered in clinicalpractice, where the sample size is calculatedmarginally at a Bonferroni type I error rate ratherthan using the disjunctive power. In the examplediscussed above, current statistical practice wouldrecommend the use of 21 patients for eachtreatment group (calculated at the one-sided levela=2 ¼ 0:0125), whereas considering the disjunctivepower for the Dunnett’s test requires only n ¼ 14patients per group.

5. CONCLUSIONS

The power results shown in Figures 1 and 2indicate that unless the correlation is high one maylose little and perhaps even gain in terms ofdisjunctive power when considering many end-points in an unstructured case. On the other hand,for more structured problems Figures 4 and 5show the importance of selecting the right measurefor the probability of success of a clinical trial. Useconjunctive power in, for example, fixed drugcombination studies or studies involving two ormore co-primary endpoints. Use disjunctive powerin, for example, studies involving multiple com-parisons with a control or in studies involvingmultiple endpoints, where it is sufficient to show atleast one significance. However, the results alsosuggest to always investigate different scenariosand document them in the study protocol to betteraddress the uncertainty about the parameters(rather than simply focusing on one single pointin the alternative). Related to this is also the

investigation of different multiple test procedures.For example, the likelihood ratio test under thesimple tree-order restriction [22] is not widely usedin practice, although it outperforms the standardDunnett’s test, when only one treatment iseffective. Further considerations beyond the in-vestigation of operating characteristics mayequally be important, such as the availability ofsimultaneous confidence intervals and unbiasedestimates.

ACKNOWLEDGEMENTS

We thank Dr Willi Maurer for encouraging us to workon this topic and for his helpful comments throughoutthe preparation of this manuscript and two reviewers foruseful suggestions.

REFERENCES

1. Lewis JA. Discussion of the paper by Senn. TheStatistician 2000; 49:156–157.

2. Senn SJ. Statistical issues in drug development.Wiley: Chichester, 1997.

3. Darken PF, Ho S-Y. A note on sample size savingswith the use of a single well-controlled clinical trialto support the efficacy of a new drug. Pharmaceu-tical Statistics 2004; 3:61–63.

4. Fisher LD. One large, well-designed, multicenterstudy as an alternative to the usual FDA paradigm.Drug Information Journal 1999; 33:265–271.

5. Maca J, Gallo P, Branson M, Maurer W. Reconsi-dering some aspects of the two-trials paradigm.Journal of Biopharmaceutical Statistics 2002; 12:107–109.

6. Rosenkranz G. Is it possible to claim efficacy if oneof two trials is significant while the other just showsa trend? Drug Information Journal 2002; 36:875–879.

7. Senn SJ. Consensus and controversy in pharmaceu-tical statistics (with discussion). The Statistician2000; 49:135–176.

8. Senn SJ. Dicing with death. Cambridge UniversityPress: Cambridge, 2003.

9. Rothman KJ. No adjustments are needed formultiple comparisons. Epidemiology 1990; 1:43–46.

10. Nelder JA. From statistics to statistical science –reply. Journal of the Royal Statistical Society SeriesD – The Statistician 1999; 48:269.

11. Lindsey JK. Some statistical heresies. Journal of theRoyal Statistical Society Series D – The Statistician1999; 48:1–26.

Power and sample size when multiple endpoints are considered 169

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

Page 10: Power and sample size when multiple endpoints are considered

12. Bauer P. Multiple testing in clinical-trials. Statisticsin Medicine 1991; 10:871–890.

13. O’Neill RT. Secondary endpoints cannot be validlyanalyzed if the primary endpoint does not demon-strate clear statistical significance. Controlled Clin-ical Trials 1997; 18:550–556.

14. Cook RJ, Farewell VT. Multiplicity considerationsin the design and analysis of clinical trials. Journal ofthe Royal Statistical Society Series A – Statistics inSociety 1996; 159:93–110.

15. Chuang-Stein C, Stryszak P, Dmitrienko A,Offen W. Challenge of multiple co-primary end-points: a new approach. Statistics in Medicine 2007;26:1181–1192.

16. Offen W, Chuang-Stein C, Dmitrienko A, LittmanG, Maca J, Meyerson L, Muirhead R, Stryszak P,Boddy A, Chen K, Copley-Merriman K, Dere W,Givens S, Hall D, Henry D, Jackson JD, Krishen A,Liu T, Ryder S, Sankoh AJ, Wang J, Yeh CH.Multiple co-primary endpoints: medical and statis-tical solutions – a report from the Multiple End-

points Expert Team of the Pharmaceutical Researchand Manufacturers of America. Drug InformationJournal 2007; 41:31–46.

17. Westfall PH, Tobias RD, Rom D, Wolfinger RD,Hochberg Y. Multiple comparisons and multiple testsusing the SAS system. SAS: Cary, NC, 1999.

18. Marcus R, Peritz E, Gabriel KR. On closed testingprocedures with special reference to ordered analysisof variance. Biometrika 1976; 63:655–660.

19. Dunnett CW. A multiple comparison procedure forcomparing several treatments with a control.Journal of the American Statistical Association1955; 50:1096–1121.

20. Simes RJ. An improved Bonferroni procedure formultiple tests of significance. Biometrika 1986;73:751–754.

21. Hsu JC. Multiple comparisons: theory and methods.Chapman & Hall: London, 1996.

22. Robertson T, Wright FT, Dykstra RL. Orderrestricted statistical inference. Wiley: New York,1988.

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 161–170DOI: 10.1002/pst

170 S. Senn and F. Bretz