the closed test procedure - sascommunity closed test...in the closed test procedure every single...

13
\ \.. The Closed Test Procedure Dr L. Banken, Dr H.U. Burger, Dr S. Kristiansen, M. Pasquier Hoffmann-La Roche Ltd. Basel 1. Introduction The closed test procedure introduced by Marcus, Peritz and Gabriel (1976) is one of the most general and efficient test procedures to control the experimentwise or multiple type I error rate when multiple hypotheses have to be tested. The procedure is based on a set of hypotheses which is closed under intersection. This set can be ordered in a hierarchy of hypotheses with the elementary hypotheses at the bottom and global hypothesis at the top and in which every hypothesis is a subset of a hypothesis one level below. Using this hierarchy the closed test procedure is able to control the multiple a-level in an efficient way. In the first part this paper will give a rough introduction to major ideas of multiple testing and to the approach of the closed test procedure. In the second part the implementation (as a macro) within SAS® is discussed with emphasize on how to specify the user interface, how to create the closed set of hypotheses automatically, how to test the hypotheses and how to present the results at the end. 2. Theoretical Background 2.1 THE PROBLEM OF MULTIPLE TESTING The approach of testing simultaneously several hypotheses differs slightly from the concept of dealing with only one single hypothesis. Let us remember: A main idea of hypothesis testing is that every level a test for a single hypothesis controls the type I error rate in the sense that under the assumption that the null hypothesis is true the probability of rejecting it is less or equal a. Historically there has been a strong feeling that we need a similar concept also f()r handling multiple hypotheses and that simply applying individual tests at a given ('local') level a is not a good approach because then - under the assumption that all null hypothescs arc simultaneously true - the probability of falsely rejecting one of them would exceed the given 501

Upload: hoangdang

Post on 17-Mar-2018

237 views

Category:

Documents


2 download

TRANSCRIPT

\

\..

The Closed Test Procedure

Dr L. Banken, Dr H.U. Burger, Dr S. Kristiansen, M. Pasquier

Hoffmann-La Roche Ltd. Basel

1. Introduction

The closed test procedure introduced by Marcus, Peritz and Gabriel (1976) is one of the most general and efficient test procedures to control the experimentwise or multiple type I error rate when multiple hypotheses have to be tested.

The procedure is based on a set of hypotheses which is closed under intersection. This set can be ordered in a hierarchy of hypotheses with the elementary hypotheses at the bottom and global hypothesis at the top and in which every hypothesis is a subset of a hypothesis one level below. Using this hierarchy the closed test procedure is able to control the multiple a-level in an efficient way.

In the first part this paper will give a rough introduction to major ideas of multiple testing and to the approach of the closed test procedure. In the second part the implementation (as a macro) within SAS® is discussed with emphasize on how to specify the user interface, how to create the closed set of hypotheses automatically, how to test the hypotheses and how to present the results at the end.

2. Theoretical Background

2.1 THE PROBLEM OF MULTIPLE TESTING

The approach of testing simultaneously several hypotheses differs slightly from the concept of dealing with only one single hypothesis. Let us remember: A main idea of hypothesis testing is that every level a test for a single hypothesis controls the type I error rate in the sense that under the assumption that the null hypothesis is true the probability of rejecting it is less or equal a. Historically there has been a strong feeling that we need a similar concept also f()r handling multiple hypotheses and that simply applying individual tests at a given ('local') level a is not a good approach because then - under the assumption that all null hypothescs arc simultaneously true - the probability of falsely rejecting one of them would exceed the given

501

level a. Especially, if all tests are independent the probability would be l-(l-a)k where k is the number of hypotheses. One early solution for overcoming this problem has been the Bonferroni adjustment. That means that for testing k hypotheses the single tests should be performed at the level a/k. Another approach is Fisher's LSD test where the global (null-) hypothesis HoI n ... n Hok is tested at level a before and the single tests at level a are only performed if the global hypothesis could be rejected. This leads to the following definition of controlling the type I error rate in a multiple setup:

Definition 1: A multiple test procedure is said to keep the global level a if the probability of rejecting at least one true null hypothesis is not larger than a when all null hypotheses are simultaneously true.

It is clear that both the Bonferroni procedure and Fishers LSD test keep the global level. But problems still arise if only a subset of the set of null hypotheses is true. While false null should be rejected, any of the true null hypothesises should only be rejected with probability a. Then, the same argumentation as above restricted to this subset yields

Definition 2: A multiple test procedure is said to keep the multiple level a if the probability of rejecting at least one true null hypothesis is not larger than a, independently of which subset of the set of null hypotheses actually is true.

Note that by definition the global level is kept, too, if the multiple level is kept. It is also clear that the multiple level is the more adequate concept for transcripting the approach of significance tests to multiple hypotheses~ The Bonferroni procedure keeps the multiple level while Fisher's LSD test does not.

2.2 SOME EXAMPLES

Main examples where multiple testing problems arise are problems comparing different groups within linear models. Let us assume the following one way design

Xg = a j + Eij , i = l, ... ,k ,j = l, ... ,n j ,

where Eij are independent, N(O,a2) distributed random variables and a" ... ,ak represent the different groups to be compared, i.e. the expectation of X within these groups. Then, one of the interesting problem is the pairwise problem, i.e. the problem of comparing each group with each other:

Let X j . be the mean of X in the i-th group, S2 the usual estimate of the variance and v = k(k-l )/2. Then, general procedures for this problem are:

• Bonferroni • Bonferroni, Holms .

Special procedures for pairwise comparison in linear models are, e.g.:

• Tukey • Hochberg • Tukey-Kramer • Multiple stage tests

501,

'._. ___ 0 __ .' __ •

~ ".- ~-- -

( :.

\

\\

The Bonferroni procedure uses for testing the level a/v instead of a as mentioned above.

The Bonferroni-Holms procedure (Holms, 1979) is a further development of Bonferroni. It uses for testing the hypothesis with the I-largest p-value the level a/(v-l+i). It starts with that hypothesis with the smallest· p:'value in ascending order and stops if for the first time the p­value exceeds the given level.· Then, only those hypotheses already tested are rejected.

The Tukey test (Tukey, 1953) for equal sample sizes assumes that all ~. are iid. Then, the main idea is the following equivalence

With an appropriate standardization sNn the distribution of maxi Xi. - mini Xi is the studentized range distribution and, hence, tabulated. Thus, we get a test of the form

eij if IX.-X·I - 1 I. J.

s; ql,v,a sIFt

= 0 if IXi.-XjJ

> ql,v,a sIFt

Variations of the Tukey test for unequal sample sizes are the GT2 (Hochberg) test (Hochberg, 1974), based on the Sidak inequality and the maximum modulus distribution, and the Tukey­K ..... mer test (Kramer, 1956) which is similar to the Tukey test but uses another standardization according to the unequal sample sizes.

Finally, Multiple stage procedures (see e.g. Welsch, 1977, or Eino~ et ai, 1975) look tirst at the homogeneity of all k means at level a k and, in the case of rejection, all subsets, i.e. all hypotheses comparing k-l means, are tested at level a k_1 and so on. If on any stage one test for a set of means is not significant, all hypotheses corresponding to any subset of this set is considered also not to differ signiticantly and none of these subsets are tested. The different tests now differ in the levelsaw .. ,ak and in the test statistics used.

The test of Newman-Keuls uses a studentized range statistic and ap=a. But this procedure do not control the multiple level a. (An easy example to see this is: Assume live very different groups each consisting of two equal means, then, the probability of one false rejection is ::::: 1-(l-a)5 .)

The test of Ryan, Einot, Gabriel and Welsch uses a range- or F-test and the adjusted levels a p = 1-(I-a)p/k if p < k-l and a p = a if p ~ k-l .

With the exception of the test of Newman-Keuls all mentioned procedures keep the multiple level. An overview is also available in SAS/STAT User's Guide, 1990.

2.3 THE CLOSED TEST PROCEDURE

For the following disGussion of the closed test procedure we use the pairwise problcm for the one way design of chapter 2.2 with k=4 groups. Using the notation (1,2) for the hypothesis

503

HoIZ = {al = az} etc, the pairwise problem discussed above can be written as a set of different elementary hypotheses:

{ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) } .

Based on this set of elementary hypotheses a set that is closed under intersection may be constructed. The resulting closed set of hypotheses then includes the following single hypotheses

{ (1,2,3,4), (1,2,3), (1,2,4), (2,3,4), (1,2/3,4), (1,3/2,4), (1,4/2,3), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) },

where e.g. (1,2,4) denote the hypothesis Holz n H ol4 = {al = az = a4} and (1,2/3,4) Ho12 n Ho34 = {al = ~, ~ = a4}·

In the closed test procedure every single hypothesis of this set is tested at level a using a suitable test. An elementary hypothesis is only rejected if its test is signiticant and the tests of all single hypotheses which are subsets of it. For example the hypothesis (3,4) is only rejected if all the tests. for the hypotheses (1,2,3,4), (1,2,4), (2,3,4), (1,2/3,4) and (3,4) are signiticant.

Let's assume that a single hypothesis is 'true'. The Closed Test procedure keeps the multiple level a, since every elementary hypothesis which is a superset of this 'true' hypothesis can only be rejected if the test of this 'true' hypothesis is significant.

A simple way to perform the closed test procedure is to build up a hierarchy within the set of single hypotheses where tests are connected which are direct sub- and supersets of each other:

Oosed test procedure - 4 groups - Pairwise problem

(1,2,3,4)

~~ (1,2,3,)(1,2,4)(1,3,4)(2,3,4) (1,2/3,4)(1,3/2,4)(1,4/2,3)

\ /"

(1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

Then, the closed test procedure for this test problem can be performed in the following way: For every single hypothesis choose a suitable a-level test. The single hypotheses are tested in the order of the hierarchy starting with the hypothesis on the highest level. A single hypothesis will only be tested if all its supersets one level above are significant. Otherwise it will be not be tested and regarded as nonsignificant.

The Closed Test procedure has some special features:

• One can combine different a-level tests for testing the single hypotheses. The choice of those tests has an influence on the efticiency of the closed test procedure but not on keeping the type I error level. In a parametric setup like above, F-tests for linear hypotheses or contrasts could be used (with a nearly arbitrary model design like in

504

\

\.

Proc GLM of SAS) or nonparametric procedures like the Kruskal-Wallis test or 10nckheere's test could be applied. . ' • The principle of the closed test procedure does not depend on a special testing problem. It only assumes a closed set of hypotheses and uses the natural order therein.

Further examples are the neighborhood (trend) problem where only those groups side by side are compared. It is defined by the set of elementary hypotheses

{ (1,2), (2,3), (3,4) }.

Another example is the One to many problem, i.e. the cOmparison of one special group with all other, with the elementary hypotheses

{ (1,2), (1,3), (1,4) }.

The corresponding closed sets of hypotheses are

CLOSED TEST PROCEDURE - 4 GROUPS· NEIGHBORS

(1,2,3,4)

(1~1.~ XX

(1,2) (2,3) (3,4)

and

CLOSED TEST PROCEDURE - 4 GROUPS - MANY-TO-1

Again, known efficient procedures can be used for testing the singlc hypotheses within the closed test procedure.

Remarks (Comparison to other multiple test procedures): .

1. In literature (e.g. SAS STAT User's Guide, 1990) the closed test procedure is recommcnded to be more powerful than many other multiple test procedures including the multiple stage methods. 2. Classical procedures for the test problems above like the procedures of Tukcy, Schcffc,

505

Tukey-Kramer and Hochberg are using special model assumptions (normal distribution). Nonparametric analogues do not exist. Additionally they are appropriate only for one special test problem like the pairwise problem. The procedures of Bonferroni and Bonferroni/Holms and the closed test procedure are based on general methods which are in principle independent of model assumptions. Especially, nonparametric analogues exist. Furthermore, these tests could be used in principle for all three test problems mentioned above and also for a lot of other ones. For more information we refer to Bauer (1991).

3. Some of the classical procedures including multiple stage methods are implemented in SAS/STAT®, proc GLM, mainly for the pairwise problem. (One procedure is given for the One-to-many case). Tests for the Trend and other problems, especially the nonparametric ones, are not available. Closed test procedures are missing completely.

3. The Implementation within SAS

In the following we describe an implementation of the closed test procedure as a SAS macro. referred to as CTP-macro.

3.1 THE GENERAL CONCEPT

The principle of the closed test procedure could be applied for a lot of different testing problems, e.g problems comparing more than two treatments within a clinical trial or problems dealing with more than one trial etc. We focus now our interest on one major application, i.e. on handling two-sided problems of comparing different treatments or groups as already introduced above. We will generally assume that we have a linear model with one classification variable the levels of which have to be tested in some kind. But we will keep the design as flexible as possible and also do not specify the kind of (elemenlmy) hypotheses we are interested in.

There are several problems in implementing a closed test procedure. First, we have to handle the stopping rules within the procedure. One approach would be to program the procedure in such a way that it actually stops testing according to the hierarchy after receiving nonsignificant p-values. But this would cause - among other - that the user would have to specify the a-level in advance. A better way is to calculate the maximum p-values. If e.g. in the setup of the first example in section 2.3 P(1.2,3,4)' P(1,2,3) and P(1,2) etc. denote the p-values according the hypotheses (1,2,3,4), (1,2,3) and (1,2) etc. then the maximal p-values PM(I.2.H) etc. are detlned by

PM(1,2,3,4)

PM(1.2,3)

= max{ P(I,2,3.4) }

= max{ P(1.2,3)' PM(1,2,3,4) }

= PM(l,2) = max { PM(1.2,3)' PM(1.2,4)' PM(1,2+3,4) }

=

The stopping rule now corresponds to the fact that maximal p-valucs exceeding a specified a­

level are carried forward to all corresponding hypotheses below such that their maximal p­values also exceed the a-level and, hence, have to be rejected, tod. However, note that these maximal p-values have no real meaning like the p-value of a single test has. Hence. evelY user

506

Another topic is how to report these maximal p-values .. Ontf way to do that is to use niainly graphics (with an additional table) which visualizes the hierarchy of the corresponding hypotheses as shown in the examples above. Additionally, graphics could also give a good insight on the results and a better control on them could be achieved, cf. therefore the example in the last section.

507

The CTP-macro now deals on the one side with several hypotheses

• pairwise problems • One to many problems • Neighborhood (trend) problems • User defined problems

and on the other side it uses several test procedures for handling the different hypotheses

• F-tests for linear contrasts in linear models with normal distributed variables • Kruskal-Wallis tests in a nonparametric setup

Depending on the model assumptions the testing problems and the tests could be combined in every possible way. The F-tests and the Kruskal-Wallis test are already implemented. It is planned to add a CHI-square test and other nonparametric procedures especially for neighborhood problems in the next future.

3.2 THE STRUCTURE AND THE MAIN IDEAS OF THE CTP-MACRO

For validation, maintenance and further development issues a good structure of the macro is necessary. Using the whole macro facility within SAS, the macro is split up into the several (sub-) macros assigned to the different logical steps within the procedure:

Scanning the table description file

I Creating the hierarchy of hypotheses

~~ Creating the set of contrast stateunts

Applying Proc GLU

Creating the different analyses data sets

Applying I another procedure like Proc NPAR1WAY

~ /Ithe output

Calculating the maximal p·value

I Reporting the results in a table

Reporting the results 1n graphical form

508

\

\,

The first step within the algorithm is to automatically generate the elementary and combined hypotheses. For the three mentioned examples the elementary hypotheses are created by the program after specifying the type of hypotheses by one single item. For other test problems the set of elementary hypotheses has to be specified by the user himself Within a SAS data set. Then, on the basis of this set of elementary hypotheses the whole·hierarchyis prepared again automatically by combining every hypothesis on one level \Vilhan elemeritary hypothesis for getting a hypothesis one level above. Thereby, mainly the data step facility and proc SQL are used.

. In a parametric setup, the hypotheses are tested using the contrast facility of proc GLM. Therefore, the different contrast statements have to be created in advance corresponding to the single hypotheses within the hierarchy. These statements are then written to an ASCII tile and included within the call of proc GLM. The nonparametric version applying the Kruskal­Wallis test is based on proc NPAR1WAY. Therefore, according to the different hypotheses, the data set has to be modified within a loop before calling proc NP AR1 WAY. (But NP AR1 WAY has thereby some known drawbacks: It only provides the user with a~ymptotic tests and writes the output not into an outputfile such that the output has to be scanned for getting the results.) For implementing other test procedures the macro containing the NP AR1 WAY call could easily replaced by another containing the new procedure.

In the nonparametric setup combined hypotheses like (1,2/3,4) are handled by the Fisher­combination where single hypotheses are tested separately and the resulting p-values are combined to a new one using the CHI-square distribution with two degrees of freedom of the p-values under the null hypothesis. In the parametric setup such an approach is not nccessary because there combined hypotheses could be formulated again as contrast statements.

In the next step then, on the basis of the resulting p-values the maximal p-values are calculated within an algorithm making use of proc SQL again. After that the results are ready to be reported in the last step of the macro.

3.3 THE INTERFACE TO THE USER: INPUT AND OUTPUT

For running the closed test procedure the following information has to be handed over:

• the name of the analyses data set and the name of the grouping variable. • the model underlying the procedure and the test methods to be used. • the labels for the different values of the grouping variable. • headers and footers for the presentation of the results.

This information is partially given in a table description tile and partially specitied by macro parameter. In the table description file, the information concerning headers and footers, the underlying model and the grouping variable is given, whereas information about thc data sets and special options could be put into macro variables.

Example: This example deals with an analyses of variance problem with the dependent variablc X, the grouping variable Treat, an additional classification variable Centre and interactions between them. The data set is given by

509

OBS DUMMY TREAT CENTRE X PNO

1 1 1 1 5.4 1 2 2 1 2 5.6 2 3 3 1 1 5.5 3 4 4 1 2 5.8 4 5 5 2 1 6. 1 5 6 6 2 2 5.8 6 7 7 2 1 5.5 7 8 8 2 2 5.9 8 9 9 3 1 5.4 9

10 10 3 2 5.3 10 11 11 3 1 5.7 11 12 12 3 2 5.6 12 13 13 4 1 5.8 13 14 14 4 1 5.9 14 15 15 4 2 6.1 15 16 16 4 2 6.3 16

The information on the underlying analyses of variance model, titles, footnotes and labe.ls is introduced by the following table description file:

H The Closed Test Procedure H The Trend Problem H -------------------------

M Class Treat centre; M Model X=Treat centre Treat*centre;

R Treat I Treatment F 1 Placebo F 2 Treatment 1 F 3 Treatment 2 F 4 Treatment 3

F -----------------------F APST.CTP.CTPUSER

Here, H, F in the first place identifies headers and footers, R the name of the grouping variable plus its label (put after the bar) and M defines SAS source code specifYing the statistical model. F in the second place specifies labels for the levels of the groupingvariablc.

This table description has to be written to an ASCII file. Then, the closed test procedure is called by

%CTP(COMP=T, DATA=db.proglm4, LIB=LIB1, GCAT=GCT1, GSFN=CSKAD5.GRAPH.OUTPUT(GRAPH1) ) ;

Here, COMP specifies the underlying testing problem (trend like, one to many or pairwise), DATA refers to the analysis data set, LIB to a SAS library to store the graphics, GCAT to a graphics catalog and GSFN to a file to which graphic code (post script or HPGL) of the produced graphics could be written to.

510

~i

~.) .•. ~

The produced Output: In a first step, the different kinds of p-values are simply listed using proc report. This listing contains

• A short description of the single hypotheses (the notation of the contrast statements in PROC GLM is used) • The unadjusted p-value of the hypotheses • The corresponding multiple p-value of the hypotheses

Furthermore, as an option it is possible to print the complete output of proc GLM or proc NP ARl WAY as a control. This could be necessary for solving problems, perhaps e.g. if some hypotheses are not estimable within the parametric setup.

THE CLOSED TEST PROCEDURE THE TREND PROBLEM

Hypotheses

1 2 2 3 3 4 123 234 1 234 1 2/3 4

Legend: 1 = Placebo 2 = Treatment 1 3 = Treatment 2 4 = Treatment 3

APST.CTP.CTPUSER

Unadjusted p-value

0.117 0.052 0.006 0.116 0.018 0.022 o . 011

Multiple p-value

0.117 0.116 0.022 0.116 0.022 0.022 0.022

The graphical output based on SAS/GRAPH® makes the relations between unadjusted and multiple p-value used by the CTP-procedure more transparent. It repeats the hierarchy of the hypotheses with the unadjusted and multiple p-values in it.

The decisive p-value is the multiple p-value. In addition the unadjusted p-value is given in brackets. Furthermore, as an option non-significant hypotheses could be tlaggcd according to the stopping rule of the closed test procedure.

511

;:.-.

THE CLOSED TEST PROCEDURE THE TREND PROBLEM

12

p-0.117

(p .. 0.117)

legend : (p=O.xxx) .. unadjusted p-volue

I.PUCE80 4 .lR£AlIENIl

APST.CTP.CTPUSER

1234

p-O.022 (p- 0.022)

2l p-0.116

(p=0.052)

l4

p- 0.022

(p=O.OO6)

p=O.lCJCX .. p-voIue adjusted for multiple tesUng

2 & IR£AlUENT I l = TRf.AlUENT 2

512

References: ..

- P. Bauer (1991) 'Multiple Testing in Clinical Trials' Statistics in Medicine 10,871-90.

-I. Einot, K.R Gabriel (1975) , A study of the powers of several methods of multiple Comparisons', JASA 70, 351.

- Y. Hochberg (1974) 'Some conservative Generalizations of the T-method in simultaneous inference' Journal of Multivariate Analyses 4, 224-34.

- S. Holms (1979) 'A simple sequential rejective multiple test procedure', Scandinavian Journal of Statistics 6, 65-70.

- C.Y. Kramer (1956) 'Extension of multiple range tests to group means with unequal numbers of Replications', Biometrics 12,307-10.

- R Markus, E. Peritz, K.R. Gabriel (1976) 'On closed testing procedures with special reference to ordered analyses of variance', Biometrika 63, 655-60.

- SAS (1990), Stat Guide, Version 6 fourth edition, SAS Institute Inc., Cary.

- J.W. Tukey (1953) 'The problem of multiple comparisons' Unpublished manuscript.

- RE. Welsch (1977) 'Stepwise multiple comparison Procedures', JASA 72, 359.

SAS, SAS/STAT and SAS/GRAPH are registered trademarks of SAS Institute Inc., Cary, NC, USA.

513

i i.