‘equivalence is different’ – some comments on therapeutic equivalence

4
‘Equivalence is Different’ – Some Comments on Therapeutic Equivalence Stephen Senn * Department of Statistics, University of Glasgow, 15 University Gardens, Glasgow, G12 8QQ, UK It is a pleasure to be invited to comment on this set of papers, which is an impressive collection of current thinking on both sides of the regulatory divide as well as from interested academic third parties. At the time of writing the pharmaceutical industry and also the drug regulatory agencies are the subject of unpleasant criticisms (in my view largely mistaken) regarding standards of proof of efficacy and safety. As these papers clearly illustrate, however, such issues are taken very seriously indeed by those who work directly or indirectly in drug development and the science of drug regula- tion consists of an unending and vigorous debate on approaches to evidence and proof. In the general spirit of this debate, I have decided to be critical of the individual contributions and pick out debata- ble points. I hope that the contributors will forgive me for not dwelling overlong on the many excel- lent, sensible and uncontroversial points they make. Bristol (2005) considers alternative approaches to comparing safety of pharmaceuticals where effi- cacy also has to be taken into account. The author is surely correct that the former is meaningless without reference to the latter and he gives some valuable practical examples of the problem. Bristol implies, but perhaps does not make completely clear, that the three approaches considered, univariate, stepwise and maximum, can be implemented using the same statistic: the maximum of the standar- dised differences in efficacy and safety. What differs is the critical value, C, employed. The maximum procedure is much more liberal than the univariate approach. (In the discussion that follows I shall accept the author’s formulation in terms of one-sided tests at the 5% level, a point he discusses, although my own view is that one-sided tests at the 2.5% level are the regulatory norm, c.f. Senn, 1997. I note also that 2.5% one-sided is the standard adopted by Tsong and Hung, 2005.) Consider the critical value for the maximum procedure. The asymptotic values may be found as follows. Con- struct two new orthogonal variables T , for the total of and D for the difference between the standar- dised efficacy and safety statistics. The joint density under the null hypothesis of T and D is the product of two Normal distributions each with mean zero, but the former with variance 2ð1 þ rÞand the latter with variance 2ð1 rÞ. D may take any value at all but the condition on T that must be satisfied is ðT < 2C þ DÞ\ðT < 2C DÞ . Depending on the sign of D, satisfaction of one half of this condition guarantees the other half and vice versa. The value of C may be found such that the joint integration of the density of T and D over the relevant region is 0.05. For the cases considered in the paper where r ¼ 0:1; 0:5 and 0:9, I find critical values of 0.834, 0.092 and 1.437. However, the case where r ¼ 0 is particularly simple and revealing. The critical value is 0.762, corresponding to requested significance levels for each of the two univariate tests of only 0.223. The product of these is 0.05, which is the joint type I error rate, because since r ¼ 0, the tests are independent. I do not believe, however, that any regulator will accept the liberal approach that this implies. This can be seen clearly by looking at Table 1 and the row for which D x D ¼ 0; D y ¼0:5 and r ¼ 0:1. The power of the procedure for the maximum approach is about 19% but only 5% for the univariate approach. However, if the null hypothesis is taken to be, “at least one of efficacy and safety is unsatisfactory,” then for this combination of para- meters it is true. Hence, we are speaking of type I error rates and not power and the value of 5% is to be preferred to that of 19%. * Corresponding author: e-mail: [email protected], Phone: +44 141 330 5141, Fax: +44 141 330 4814 Biometrical Journal 47 (2005) 1, 104 107 DOI: 10.1002/bimj.200410088 # 2005 WILEY-VCH Verlag GmbH &Co. KGaA, Weinheim

Upload: stephen-senn

Post on 06-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ‘Equivalence is Different’ – Some Comments on Therapeutic Equivalence

‘Equivalence is Different’ – Some Commentson Therapeutic Equivalence

Stephen Senn*

Department of Statistics, University of Glasgow, 15 University Gardens, Glasgow, G12 8QQ, UK

It is a pleasure to be invited to comment on this set of papers, which is an impressive collection ofcurrent thinking on both sides of the regulatory divide as well as from interested academic thirdparties. At the time of writing the pharmaceutical industry and also the drug regulatory agencies arethe subject of unpleasant criticisms (in my view largely mistaken) regarding standards of proof ofefficacy and safety. As these papers clearly illustrate, however, such issues are taken very seriouslyindeed by those who work directly or indirectly in drug development and the science of drug regula-tion consists of an unending and vigorous debate on approaches to evidence and proof. In the generalspirit of this debate, I have decided to be critical of the individual contributions and pick out debata-ble points. I hope that the contributors will forgive me for not dwelling overlong on the many excel-lent, sensible and uncontroversial points they make.

Bristol (2005) considers alternative approaches to comparing safety of pharmaceuticals where effi-cacy also has to be taken into account. The author is surely correct that the former is meaninglesswithout reference to the latter and he gives some valuable practical examples of the problem. Bristolimplies, but perhaps does not make completely clear, that the three approaches considered, univariate,stepwise and maximum, can be implemented using the same statistic: the maximum of the standar-dised differences in efficacy and safety. What differs is the critical value, C, employed. The maximumprocedure is much more liberal than the univariate approach. (In the discussion that follows I shallaccept the author’s formulation in terms of one-sided tests at the 5% level, a point he discusses,although my own view is that one-sided tests at the 2.5% level are the regulatory norm, c.f. Senn,1997. I note also that 2.5% one-sided is the standard adopted by Tsong and Hung, 2005.) Considerthe critical value for the maximum procedure. The asymptotic values may be found as follows. Con-struct two new orthogonal variables T, for the total of and D for the difference between the standar-dised efficacy and safety statistics. The joint density under the null hypothesis of T and D is theproduct of two Normal distributions each with mean zero, but the former with variance 2ð1þ rÞandthe latter with variance 2ð1� rÞ. D may take any value at all but the condition on T that must besatisfied is

ðT < 2C þ DÞ \ ðT < 2C � DÞ .Depending on the sign of D, satisfaction of one half of this condition guarantees the other half andvice versa. The value of C may be found such that the joint integration of the density of T and D overthe relevant region is 0.05. For the cases considered in the paper where r ¼ 0:1; 0:5 and 0:9, I findcritical values of –0.834, �0.092 and –1.437. However, the case where r ¼ 0 is particularly simpleand revealing. The critical value is –0.762, corresponding to requested significance levels for each ofthe two univariate tests of only 0.223. The product of these is 0.05, which is the joint type I errorrate, because since r ¼ 0, the tests are independent. I do not believe, however, that any regulator willaccept the liberal approach that this implies. This can be seen clearly by looking at Table 1 and therow for which Dx � D ¼ 0;Dy ¼ �0:5 and r ¼ 0:1. The power of the procedure for the maximumapproach is about 19% but only 5% for the univariate approach. However, if the null hypothesis istaken to be, “at least one of efficacy and safety is unsatisfactory,” then for this combination of para-meters it is true. Hence, we are speaking of type I error rates and not power and the value of 5% is tobe preferred to that of 19%.

* Corresponding author: e-mail: [email protected], Phone: +44 141 330 5141, Fax: +44 141 330 4814

Biometrical Journal 47 (2005) 1, 104–107 DOI: 10.1002/bimj.200410088

# 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 2: ‘Equivalence is Different’ – Some Comments on Therapeutic Equivalence

Hauschke et al. (2005) consider the vexed problem of establishing a safe limit for the dose of adrug. As they explain, requiring no effect is not logical and however unpleasant the thought andapparently contradictory the concept, clinical colleagues may have to establish a ‘safe level of toxi-city’. The authors make skilful use of Fieller’s theorem and monotonicity. It is perhaps worth notinghere that Fieller (1907–1960) worked for the Boots company from 1934 to 1942 (Irwin and Van Rest,1961) and was thus a pharmaceutical statistician at the time of his famous paper (Fieller, 1940). Theexample they give is a very nice one exhibiting many interesting features including clear heteroscedas-ticity (Bartlett’s test yields a value of 24.2 on 5 DF, p < 0.001). Whether or not Bartholomew’s test(Bartholomew, 1961) is robust under the null hypothesis, this raises in my mind the issue as towhether a modelling approach might not be an attractive option. After all, the doses being investigatedare in mg/kg and their eventual application in man will involve interspecies scaling based on verystrong parametric assumptions. Surely, we need to believe in more than just monotonicity? Are weprepared to say nothing about a dose of 40 mg/kg that we did not study but that its effect can be noless than 30 mg/kg and no more than 50 mg/kg? Also, the observed heteroscedasticity is exactly whatone would expect for count data, for which a Poisson distribution is a possible candidate. Poissonregression using a saturated model treating the six groups as unordered levels of one factor yields aresidual deviance of 24.5 on 25 degrees of freedom. However, if we have five dose groups nestedwithin treatment (taking vehicle as the zero dose) and fit orthogonal polynomials, then the first twosuffice to produce a residual deviance of 27.3 on 27 degrees of freedom, a remarkably good fit. Forthis quadratic dose response model, the fitted values are 2.5, 4, 7, 11.8, 21.4 and 25, compared togroup means of 2.6, 3.8, 6.4, 14.0, 20.0, 25.0 but the model has the advantage that predictions andstandard errors on predictions are easily issued for any intermediate dose.

Lange and Freitag (2005) have provided a thought-provoking, extensive and energetic, empiricalinvestigation and review of a controversial issue that will be familiar to all who have run and analysedtrials in this area: the specification (or not) of the clinically irrelevant difference. I am confident thatthe paper will prove to be an invaluable resource to workers in this field. In my view, however, wewill eventually come to see that the pre-specification by the sponsor of a non-inferiority margin doesnot form part of any rational approach to analysing such trials, although it may be highly relevant forplanning them. The reason is that ultimately the issue as to what is or is not irrelevant must be adecision to be taken by the consumer including, but not exclusively limited to, the regulator acting onthe patient’s behalf. In planning a trial it will be necessary for the sponsor to anticipate what mostconsumers would accept but mere pre-specification cannot pre-empt this decision. Otherwise wewould have the absurdity of sponsor A (say) being able to claim efficacy because the pre-specifiedmargin of non-inferiority in a trial in asthma was 150 ml of forced expiratory volume in one secondand the lower confidence interval was at �120 ml but sponsor B not being able to do so because themargin specified was 50 ml, despite the fact that the lower confidence limit was at �80 ml. In thelong run regulators themselves may not be able to avoid setting limits and perhaps even specific lossfunctions, an approach that has been proposed by Lindley (1998) for bioequivalence trials. One furtherpoint is worth mentioning; in synthesising results from so many areas, Lange and Freitag were forcedto find some scale on which they could be compared. This has led them to using ‘sigma-divided’measures of overlap and they briefly acknowledge problems with such measures. These include thatfact that the size of the effect measured on this scale is not only a function of the effect of thetreatment but also of the variability in the sample, which in turn will reflect the degree of homogene-ity of the patients, the trial design and the precision of measurement (c.f. Senn, 1997).

R�hmel (2005) tackles the problem of binary outcomes with his usual algorithmic insight and ex-pertise. His recent retirement from the “BfArM” will be a considerable loss to drug regulation butwill, I suspect, prove to be the gain of statistical research. It is somewhat surprising that something asinnocent as the analysis of 2� 2 tables can be so controversial but the extensive and growing litera-ture on the analysis of the 2� 2 table, to which R�hmel himself has made distinguished contributions(I particularly like his paper on the subject with the late Berndt Streitberg, Streitberg and R�hmel,1991), show that it is indeed so. I have nothing of substance to add in the way of commentary to thispaper except to note that the results are deeply worrying. Any statistician working in drug develop-

Biometrical Journal 47 (2005) 1 105

# 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 3: ‘Equivalence is Different’ – Some Comments on Therapeutic Equivalence

ment has to tackle a wide variety of problems and cannot be expert in them all. In many matters he orshe is reliant on the expertise of others. R�hmel’s paper shows that there is no room for complacency.

Wellek (2005) similarly expertly considers the problem of binary outcomes. He draws attention tothe fact that randomized decisions are not acceptable in practice. In my view, however, one should becareful in concluding that this necessarily leaves the door open for superior alternatives to Fisher’sexact type tests. Some such alternatives effectively smuggle in a randomising device. An example canbe found in a recent issue of Biometrics (Ivanova and Berger, 2001). From one point of view, anyresort to the sample space as a means of deciding between ‘significance’ and ‘non-significance’ thatcannot be justified in terms of likelihood considerations is effectively using auxiliary randomisation.In this connection, it is pleasing to see the good frequentist properties of the Bayesian procedure thatWellek investigates since all Bayesian procedures have likelihood at their heart.

Hung, Wang and O’Neill (2005) from the FDA look at the issue of indirect comparison to placebousing previous data comparing the control treatment to the placebo arm. The set up is analogous tothat of an incomplete blocks experiment with trials including the current trial forming the blocks, apoint that is made very clearly by Tsong and Zhang (2005) in a companion paper from the FDA.Hung et al. draw attention to the constancy assumption involved in establishing the effect of the treat-ment compared to placebo. They make an interesting connection between a Bayesian approach and apreviously published frequentist one. I am not, however, surprised that the Bayesian approach thatthey investigate pays no penalty in terms of uncertainty about the constancy assumption since it im-plicitly assumes it. This can be seen by noting that as s2

PC0 goes to zero the uncertainty about theeffect of control to placebo in the current trial goes to zero. An alternative approach would use arandom effects meta-analysis to establish the size of trial by control interaction (where ‘control’ inthis context would be control-placebo difference). To estimate this would require a meta-analysis witha number of trials and to apply it would require an unverifiable assumption of exchangeability be-tween past and future trials. Note that this component of variation is estimated more precisely as moretrials are included but that its expected size does not decrease with increasing numbers of patientsstudied nor even with increasing numbers of trials. The uncertainty about the average difference overall trials may be reduced to zero by increasing the number of trials to infinity but the uncertaintyabout the effect in a single given trial will not. In other words there is a difference between beliefabout the average effect of all future trials of the control compared to placebo and belief about thecorresponding average effect in a given randomly chosen trial, such as the trial comparing the newtreatment to the control. (In my view, the formulation in Hung et al. does not make clear the differ-ence between the two.) Note also that if consistency with the rest of drug development is sought, therandom variation that may exist between treatment and control from trial to trial is irrelevant. Theonly thing that matters is the effect in the trial actually studied. To require more than this would be torequire more than is currently required of placebo-controlled treatment trials which only yield, afterall, an estimate of the true causal effect in the current trial and not what effect would be seen in othertrials.

Tsong and Zhang (2005) consider what changes if any need to be adopted when considering super-iority and non-inferiority in the ‘cross-trial’ situation. A small but extremely pleasing detail in theirpaper is worth drawing attention to: they explicitly include a calculation of the simulation error. Thissimple step is often overlooked by authors but is extremely helpful to the reader. They consider,amongst other matters, the effect of variance heterogeneity on certain testing approaches. They aresurely right to flag this. Pooling variances except from two groups being compared is not usuallyworth it unless degrees of freedom for error are rare (c.f. Senn, 2000). The two group t-test is robustwhen sample sizes are equal and in any case the equality of variances is part of the conventional nullhypothesis usually tested, as pointed out many years ago by Fisher, who in discussing the way that hehad adapted Student’s t-test to the two sample case wrote:

“It has been repeatedly stated, . . . that our method involves the “assumption” that the two variancesare equal. This is an incorrect form of statement; the equality of variances is a necessary part of thehypothesis to be tested, namely that the two samples are drawn from the same normal population(Fisher, 1925, pp. 124–125).”

106 S. Senn: Comments on Therapeutic Equivalence

# 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 4: ‘Equivalence is Different’ – Some Comments on Therapeutic Equivalence

However the situation is more complicated when a hypothesis is being tested, such as that of non-inferiority by some margin, that does not posit strict equality. Also Fisher’s defence does not applywhen variance estimates are drawn from populations that are not involved in the hypothesis beingtested. In fact, taking external variance estimates from other trials can be extremely dangerous.Whereas randomisation may, at least in the absence of differential treatment effects, produce similarvariances between groups within a single trial, there is nothing in the conduct of trials that forces onetrial to be like another as regards degree of homogeneity of patients. The issue is similar to that raisedin discussion of sigma divided measures in the discussion of the paper by Lange and Freitag (2005).

I hope that authors, editors and readers will forgive me that I have only been able to discuss a veryfew of the issues raised by this interesting series of papers. This issue of the Biometrical Journal willrepay careful study. Gudrun Freitag and Axel Munk are to be congratulated on organising the collo-quium in D�sseldorf reported in these pages and they and the editors of the journal for puttingtogether this interesting issue.

References

Bartholomew, D. J. (1961). Ordered tests in the analysis of variance. Biometrika 48, 325–332.Bristol, D. R. (2005). Superior Safety in Noninferiority Trials. Biometrical Journal 47, 75–81.Fieller, E. C. (1940). The biological standardization of insulin. Journal of the Royal Statistical Society (Supple-

ment) 1, 1–54.Fisher, R. A. (1925). Statistical Methods for Research Workers in: Statistical Methods, Experimental Design and

Scientific Inference. Bennet, J. H., Ed., Oxford University, Oxford.Hauschke, D., Slacik-Erben, R., Hensen, S. and Kaufmann, R. (2005). Biostatistical Assessment of Mutagenicity

Studies by Including the Positive Control. Biometrical Journal 47, 82–87.Hung, H. M. J., Wang, S. J. and O’Neill, R. T. (2005). A Regulatory Perspective on Choice of Margin and

Statistical Inference Issue in Non-Inferiority Trials. Biometrical Journal 47, 28–36.Irwin, J. O. and Van Rest, E. D. (1961). Edgar Charles Fieller, 1907–1960. Journal of the Royal Statistical

Society, Series A 124, 275–277.Ivanova, A. and Berger, V. W. (2001). Drawbacks to integer scoring for ordered categorical data. Biometrics 57,

567–570.Lange, S. and Freitag, G. (2005). Choice of Delta: Requirements and Reality – Result of a Systematic Review.

Biometrical Journal 47, 12–27.Lindley, D. V. (1998). Decision analysis and bioequivalence trials. Statistical Science 13, 136–141.R�hmel, J. (2005). Problems with existing procedures to calculate exact unconditional P-values for non-inferior-

ity/superiority and confidence intervals for two binomials and how to resolve them. Biometrical Journal 47,37–47.

Senn, S. J. (1997). Statistical Issues in Drug Development. John Wiley, Chichester.Senn, S. J. (1997). Testing for individual and population equivalence based on the proportion of similar responses

[letter; comment]. Statistics in Medicine 16, 1303–1306.Senn, S. J. (2000). Consensus and controversy in pharmaceutical statistics (with discussion). The Statistician 49,

135–176.Streitberg, B. and R�hmel, J. (1991). Alternatives to Fisher’s exact test?. Biometrie und Informatik in Medizin und

Biologie 22, 139–146.Tsong, Y. and Zhang, J. J. (2005). Testing Superiority and Non-Inferiority Hypothesis in Active Controlled Clin-

ical Trials. Biometrical Journal 47, 62–74.Wellek, S. (2005). Statistical Methods for the Analysis of Two-Arm Non-inferiority Trials with Binary Outcomes.

Biometrical Journal 47, 48–61.

Biometrical Journal 47 (2005) 1 107

# 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim