strategies for improving precision in group-randomized...

25
Strategies for Improving Precision in Group-Randomized Experiments Stephen W. Raudenbush University of Chicago Andres Martinez and Jessaca Spybrook University of Michigan Interest has rapidly increased in studies that randomly assign classrooms or schools to interven- tions. When well implemented, such studies eliminate selection bias, providing strong evidence about the impact of the interventions. However, unless expected impacts are large, the number of units to be randomized needs to be quite large to achieve adequate statistical power, making these studies potentially quite expensive. This article considers when and to what extent matching or covariance adjustment can reduce the number of groups needed to achieve adequate power and when these approaches actually reduce power. The presentation is nontechnical. Keywords: group-randomized experiments, multilevel research design INTEREST has rapidly increased in studies that ran- domly assign social units to alternative treatment conditions. Such units include whole schools (Borman et al., 2005; Cook, Hunt, & Murphy, 2000; Flay, 2000; Mosteller, Light, & Sachs, 1996; Porter, Blank, Smithson, & Osthoff, 2005), housing projects (Bloom & Riccio, 2005; Sikkema, 2005), neighborhoods or whole commu- nities (Hannan, Murray, Jacobs, & McGovern, 1994; Sherman & Weisburd, 1995; Teruel & Davis, 2000; Weisburd, 2000, 2005), and physi- cian practices (Donner & Klar, 2000; Grimshaw, Eccles, Campbell, & Elbourne, 2005; Leviton & Horbar, 2005; Murray, 1998). During the 1980s, the U.S. Department of Labor began supporting randomized trials as a means for understanding which employment and training programs effectively increased earnings. Similarly, during the 1990s, the U.S. Department of Health and Human Services started supporting randomized trials to determine which drug prevention pro- grams and other risk reduction programs were effective (Bloom, 2005; Mosteller & Boruch, 2002). In 2002, with the creation of the new Institute for Education Sciences, the U.S. Department of Education established the conduct of randomized trials as a priority (Education Sciences Reform Act, 2002). In fact, most of these studies treat whole classrooms or schools rather than students as the units of randomization. The decision to assign groups rather than indi- viduals to treatments is essential when interven- tions are designed to treat entire collectives of persons. This is true, for example, in whole-school The work reported here was supported by the grant “Building Capacity for Evaluating Group-Level Interventions,” sponsored by the William T. Grant Foundation. We are especially grateful to Bob Granger and Ed Seidman of the William T. Grant Foundation for their advice and encouragement. Special thanks also to Howard Bloom and Xiaofeng Liu for their consultation and advice on the issues discussed here and to three anonymous reviewers for valuable advice and careful reading of previous drafts. 5 Educational Evaluation and Policy Analysis March 2007, Vol. 29, No. 1, pp. 5–29 DOI: 10.3102/0162373707299460 ©2007 AERA. http://eepa.aera.net at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.net Downloaded from

Upload: others

Post on 30-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Strategies for Improving Precisionin Group-Randomized Experiments

Stephen W. RaudenbushUniversity of Chicago

Andres Martinez and Jessaca SpybrookUniversity of Michigan

Interest has rapidly increased in studies that randomly assign classrooms or schools to interven-tions. When well implemented, such studies eliminate selection bias, providing strong evidenceabout the impact of the interventions. However, unless expected impacts are large, the number ofunits to be randomized needs to be quite large to achieve adequate statistical power, making thesestudies potentially quite expensive. This article considers when and to what extent matching orcovariance adjustment can reduce the number of groups needed to achieve adequate power andwhen these approaches actually reduce power. The presentation is nontechnical.

Keywords: group-randomized experiments, multilevel research design

INTEREST has rapidly increased in studies that ran-domly assign social units to alternative treatmentconditions. Such units include whole schools(Borman et al., 2005; Cook, Hunt, & Murphy,2000; Flay, 2000; Mosteller, Light, & Sachs,1996; Porter, Blank, Smithson, & Osthoff, 2005),housing projects (Bloom & Riccio, 2005;Sikkema, 2005), neighborhoods or whole commu-nities (Hannan, Murray, Jacobs, & McGovern,1994; Sherman & Weisburd, 1995; Teruel &Davis, 2000; Weisburd, 2000, 2005), and physi-cian practices (Donner & Klar, 2000; Grimshaw,Eccles, Campbell, & Elbourne, 2005; Leviton &Horbar, 2005; Murray, 1998). During the 1980s,the U.S. Department of Labor began supportingrandomized trials as a means for understandingwhich employment and training programs

effectively increased earnings. Similarly, duringthe 1990s, the U.S. Department of Health andHuman Services started supporting randomizedtrials to determine which drug prevention pro-grams and other risk reduction programs wereeffective (Bloom, 2005; Mosteller & Boruch,2002). In 2002, with the creation of the newInstitute for Education Sciences, the U.S.Department of Education established the conductof randomized trials as a priority (EducationSciences Reform Act, 2002). In fact, most ofthese studies treat whole classrooms or schoolsrather than students as the units of randomization.

The decision to assign groups rather than indi-viduals to treatments is essential when interven-tions are designed to treat entire collectives ofpersons. This is true, for example, in whole-school

The work reported here was supported by the grant “Building Capacity for Evaluating Group-Level Interventions,” sponsoredby the William T. Grant Foundation. We are especially grateful to Bob Granger and Ed Seidman of the William T. GrantFoundation for their advice and encouragement. Special thanks also to Howard Bloom and Xiaofeng Liu for their consultationand advice on the issues discussed here and to three anonymous reviewers for valuable advice and careful reading of previousdrafts.

5

Educational Evaluation and Policy AnalysisMarch 2007, Vol. 29, No. 1, pp. 5–29

DOI: 10.3102/0162373707299460©2007 AERA. http://eepa.aera.net

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

reform efforts designed to engage all of theteachers in a school in a joint effort to improveinstruction (Borman et al., 2005) or when a vio-lence prevention program is intended to operateby changing the normative climate in an entireschool (Flay, 2000). Group-based trials are alsooften preferred when there is a danger of“spillover effects” or when logistical, ethical, orpolitical considerations discourage person-based randomization (Bloom, 2005). Cook(2005) suggested that by targeting the grouplevel, an intervention may improve group-levelprocesses, resulting in greater long-termimpacts on individuals who come into contactwith the group.

In most group-based studies, the statisticalpower to detect treatment effects depends morestrongly on the number of groups available thanon the number of persons per group (Bloom,Bos, & Lee, 1999; Donner & Klar, 2000;Murray, 1998; Raudenbush, 1997). At the sametime, the number of groups to be sampled willalso typically drive costs. For example, in aschool-based intervention, it is far more expen-sive to recruit a single school and to sustain thatschool’s engagement than it is to sample andassess an additional student once a school hasagreed to participate. Thus, the number ofgroups drives cost and power more than thetotal sample size of students.

An evaluation of statistical power for a typi-cal group-randomized study can create theimpression that the study will be exceedinglyexpensive. Consider, for example, a hypotheti-cal study in which J groups are to be assignedeither to an experimental or to a control condi-tion (J/2 groups in each), with n = 100 personsper group assessed after the treatment on a con-tinuous outcome variable. Assume that the truemean difference between the two groups isequivalent to 0.25 on the scale of the outcomestandard deviation. Such effect sizes (ESs) arewidely regarded as nonnegligible in educationresearch focused on academic achievement(Bloom, 2005). Finally, assume that 85% of thevariation lies between persons within groups,with 15% of this variation lying between groupswithin treatment conditions. This variance ratiois common in school-based research, particu-larly for achievement outcomes in elementaryschools (Bloom, Richburg-Hayes, & Black,

2005). The data will be analyzed by means of asimple t test using group means as the out-comes. Under these conditions, 82 schools (41per condition) must be sampled to achieve sta-tistical power of 0.80 to reject the null hypothe-sis of no program effect at the 5% level of signif-icance (Raudenbush, 1997). In many researchsettings, the cost of studying 82 groups will bedaunting. For example, in a study of whole-school reform, the cost of recruiting 82 schools,implementing the intervention in half of them,sustaining the involvement of those schools indata collection, and dispatching data collectorscan be expected to be many millions of dollars.Given current research funding, such per studycosts would severely limit the number of suchstudies and therefore limit the prospects of a planto rely on school-based randomized studies forinformation about school improvement.

In the hypothetical example above, noattempt was made to identify or use prior infor-mation that might have predicted participantoutcomes. Yet it is well known that such infor-mation can often be ascertained at compara-tively low cost and that when this information isused effectively, the sample-size requirementsfor experiments can be reduced, sometimes sub-stantially. Given the high cost of samplinggroups, it becomes essential to find efficientexperimental designs and analytic approachesthat capitalize on prior information to increasestatistical power.

Early in the past century, the pioneers ofexperimental design in agriculture learned thatpretreatment information can often be exploitedto improve statistical power (Cochran, 1957;Fisher, 1926, 1936, 1949). Two key approachesare available:

1. Prerandomization blocking: Prior to treat-ment assignment, experimental units can beclassified into subclasses called blocks, suchthat within each block, the units are likely to besimilar on the outcome. Units are then assignedto treatment conditions within blocks. In thisway, variation between blocks is eliminatedfrom the assessment of experimental error. Ifthe blocking is highly successful, such that vari-ation between blocks on the outcome farexceeds variation within blocks, the number ofexperimental units required to achieve a given

6

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

level of power can be dramatically smaller thanthe number of units required without suchblocking.

2. Covariance adjustment: Once again, priorto randomization, the researcher identifies char-acteristics of units that strongly predict the out-come. More specifically, the researcher obtainsdata on a covariate (i.e., a pretreatment variable),typically measured on a continuous scale. Afterimplementing the treatment and assessing out-comes, the researcher specifies a linear statisticalmodel in which the covariate and indicators oftreatment group membership are predictors. Ifthe covariate has a strong linear association withthe outcome, and if this association is similarwithin each treatment condition, the analysis canachieve much higher power than an analysisomitting the covariate. As a result, using a well-chosen covariate can greatly reduce the numberof units needed to achieve a given power.

In sum, both approaches use prior informa-tion to reduce uncertainty about outcomes.Prerandomization blocking builds this informa-tion into the design such that randomizationoccurs within blocks. In contrast, analysis ofcovariance (ANCOVA) exploits the linear asso-ciation between a covariate and the outcome inthe analysis phase of the study.

The key aim of the current article is to clarifyconditions under which the use of blocking andcovariance adjustment can reduce the number ofgroups required to achieve adequate power ingroup-randomized studies under specific con-ditions. This effort may be regarded as theextension of classic Fisherian principles ofexperimental design to settings in which socialunits (“groups”), rather than individuals, are theunits of randomization and treatment. Through-out the discussion, we draw on the key principlesat play when persons are the unit of randomiza-tion and show how they extend nicely to the caseof group-randomized studies, with certain keymodifications that we shall highlight.

This is not the first article to consider suchissues. Several authors have provided usefulaccounts of the trade-offs that arise in selectingalternative designs and analytic methods ingroup-randomized trials (for reviews, seeBloom, 2005; Donner & Klar, 2000; Hughes,2005; Martin, Diehr, Perrin, & Koepsell, 1993;

Murray, 1998; Raudenbush, 1997). Our aim isto provide a precise specification of how thepower of two alternative approaches, blockingand ANCOVA, compares with the power of astandard design that does not exploit pretreat-ment information. We make comparisons forspecific sample sizes and parameter values. Wehope to enable readers to use these results andthe freely available software we describe toimprove the planning of group-randomizedstudies (Raudenbush, Liu, Spybrook, Martinez,& Congdon, 2006).1

For simplicity, we focus on the case in whichthe aim is to compare two treatments: a novel“experimental” approach and a more standard“control” approach. In this case, blocks eachconsisting of two units may be formed. Theseblocks are called matched pairs. We considerthe question of when and to what extent suchmatching will boost the statistical power for agiven sample size or reduce the sample sizeneeded to achieve a given level of power. Wealso consider the utility of ANCOVA in such atwo-treatment setting.

In this setting, the aim is to provide sharpanswers to specific questions about trade-offs inselecting approaches to design and analysis. Weask the following questions:

1. Under what conditions and to what extentwill pretreatment matching reduce the numberof groups needed to achieve adequate power todetect an ES of interest?

2. Under what conditions and to what extentwill covariance adjustment reduce the numberof groups needed to achieve adequate power todetect such an effect?

3. Suppose that a continuous covariate isavailable. Such a covariate might be used eitherto form matched pairs or as a covariate.Specifically, under what conditions and to whatextent will matching approximate the efficiencyof ANCOVA?

We restrict our attention to “two-level designs”(e.g., persons within groups, with groups as theunit of randomization) and draw on the literaturefrom “single-level designs” (e.g., persons as unitsof randomization). We reserve a brief discussionof more complex designs (e.g., students withinclassrooms within schools that are assigned to

7

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

treatments, repeated-measures designs) for a con-cluding section, noting that the software describedcan handle a number of more complex designs.

This article is organized as follows. In thenext section, we consider matching in the con-text of group-randomized designs, addressingQuestion 1 above. In the following section, weturn to covariance adjustment (Question 2).Next, we compare matching and covariance

adjustment (Question 3). A concluding sectionsummarizes the results, provides guidance forplanning new studies, and briefly explores morecomplex designs.

Matching Versus Not Matching

The potential benefits associated with match-ing in cases in which individual participants are

8

FIGURE 1. Completely randomized (CR) design versus matched-pairs (MP) design, intracluster correlation(ICC) = .20.Note. ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

randomized to treatment groups have been a topicof methodological research for many decades(Cochran, 1957; Federer, 1955; Fisher, 1926,1936, 1949; Kempthorne, 1952). Excellent sum-maries can be found in classic texts on experi-mental design (see, e.g., Kirk, 1982). The basicprinciple of blocking (of which matching is a spe-cial case) was elucidated by Fisher (1926) in thecontext of agricultural research. He criticized asimple experiment in which plots of land wereassigned at random to receive one of five varietiesof seed to assess the causal effect of such varietieson plant growth:

On most land, however, we shall obtain a smallerstandard error, and consequently a more valuableexperiment, if we proceed otherwise. The land isdivided first into seven blocks, which, for the presentpurpose, should be as compact as possible; each ofthese blocks is divided into five plots, and these areassigned in this case to five varieties, independently,and wholly at random. If this is done, those compo-nents of soil heterogeneity which produce differencesin fertility between plots of the same block will becompletely randomized, while those componentswhich produce differences in fertility between blockswill be completely eliminated. (p. 509)

Blocking is not always helpful, however. If lit-tle variation lies between blocks, blocking canactually reduce power because it entails a loss ofdegrees of freedom. Even when blocking doeslittle to boost power, it can be useful because ofa potential increase in the “face validity” of anexperiment. We discuss these issues below in thecontext of person-randomized trials before turn-ing to the main focus of this section: matching ingroup-randomized trials.

Loss of Degrees of Freedom

In a completely randomized (CR) design,individual participants are randomized to treat-ment groups, and the treatment group means arecompared on an outcome variable. There is noattempt to use prior information to improvethe efficiency of the design. Because only twoquantities (the two group means) need to becomputed to obtain an ES, the available degreesof freedom are M – 2, where M is the total num-ber of participants.

In contrast, suppose that prior to randomiza-tion, the individuals are rank ordered on the

basis of some information X collected before-hand. Next, the experimenters match the indi-viduals such that the first two in the rankingconstitute Pair 1, the next two constitute Pair 2,and so on. Within each pair, individuals are thenassigned at random to treatment and controlconditions, the treatment conditions are imple-mented, and an estimate of the treatment effectis obtained from each pair. In this case, priorinformation on the individuals is explicitlyembedded into the design. The general name forsuch a design is a randomized block design, andin the special case in which each block containstwo units, it is called a matched-pairs (MP)design. The number of quantities computedhere is (M/2) + 1 (a treatment effect for eachpair and an average treatment effect), and theavailable degrees of freedom are then M –[(M/2) + 1] = (M/2) – 1, half those in the CRdesign. When the overall sample size M issmall, losing degrees of freedom can substan-tially increase the critical value of the t test. Soif matching is ineffective, this loss of degrees offreedom will reduce power. This penaltybecomes negligible as M increases.

Improving Face Validity

The foregoing text may suggest that if thesample size is small, matching is not a goodstrategy. However, this is not necessarily true.A benefit of matching other than its potentialcontribution to power involves face validity.For example, an experimenter may find pairsof individuals who have the same gender andethnicity and who are also close on X. Bymatching also on gender and ethnicity, theexperimenter ensures that after randomization,the two treatment groups will be identical withrespect to their gender and ethnic composi-tions. In contrast, the CR design allows chancedifferences between the two treatment groupson these socially and perhaps politicallysalient variables. Such chance differences areunlikely to be large if M is large, but whenM is modest or small, embarrassing chancedifferences in gender and ethnic compositionbetween experimental participants and con-trols might arise, producing a comparisonbetween groups that look different on salientvariables.

9

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

Matching in Group-Randomized Studies

Many of the key findings in the classic liter-ature on person-randomized studies can beextended to designs in which whole groups(e.g., classrooms) are the unit of randomization.In the context of group-randomized studies,

the unit of the outcome variable (the individ-ual) is nested within the unit of randomization(the group). Because of this nesting, the calcu-lation of the power to detect treatment effectsis slightly more complicated. To clarify howthe MP design compares with the CR designin the context of group-randomized studies,

10

FIGURE 2. Completely randomized (CR) design versus matched-pairs (MP) design, intracluster correlation(ICC) = .10.Note. ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

consider a simple setting with two treatmentconditions.

Example

Suppose a researcher is planning to assess theeffectiveness of a new schoolwide literacy pro-gram (the treatment condition) to be implementedin a number of schools in a given large city. A

single classroom containing 20 students willprovide data from each school. The schoolsunder consideration vary considerably in meanachievement and are quite segregated ethnically:about two thirds of the schools serve predomi-nately minority students, while the remainingschools enroll mainly non-Hispanic White chil-dren. The researcher wonders how many schoolswill be needed to achieve power to detect a given

11

FIGURE 3. Completely randomized (CR) design versus matched-pairs (MP) design, intracluster correlation(ICC) = .05.Note. ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

standardized ES and whether matching schoolsprior to randomization might help achieve this goal.

Under the CR design, intact schools, ratherthan isolated students, are assigned at random totreatment and control conditions. The literacyprograms are administered, and students arethen assessed on an outcome variable Y, thescore on a standardized reading test. The com-parison of outcomes needs to take into accountthe nesting of the students within the schools.

Similarly, under the MP group-randomizeddesign, researchers first locate pairs of schools,rather than of individuals, that are likely to besimilar on their mean reading outcomes.Researchers then rank order the schools on thebasis of expected achievement and create thepairs: the first two schools constitute Pair 1,the next two schools constitute Pair 2, and so on.Within each pair, one school is assigned at ran-dom to the new literacy program condition whilethe other is kept as a control. The programs arethen implemented, and students are assessed on Y.Again, the comparison of outcomes must takeinto account the nesting of the individuals withinthe groups and of the groups within the pairs.

The question that naturally arises is whetherthe MP design has substantially more powerthan the CR design. Intuitively, the answer willbe yes if X (the matching variable) strongly pre-dicts Y (reading outcomes). If so, schools withinany given pair, which are by design very similaron X, will also tend to be very similar on Y inthe absence of the treatment effect. In theextreme case in which matches on X withinpairs are perfect and in which X perfectly pre-dicts Y in the absence of treatment, the differ-ence between pair members on Y will perfectlyreflect the impact of the new literacy programwithin that pair, and the average of these differ-ences will give an excellent estimate of the pop-ulation average effect of the new program.

Generalization

The power associated with the CR designdepends on five factors:

1. the size of the true average ES, in units ofthe outcome standard deviation;

2. the fraction of the variance in the out-come that lies between schools, equivalent tothe intracluster correlation (ICC);

3. the number of schools studied, denoted J,with J/2 schools assigned at random to the exper-imental condition and J/2 schools assigned at ran-dom to the control condition;

4. the number of students studied per school,denoted by n; and

5. the adopted level of statistical significance, α.

The power associated with the MP designdepends on these five factors plus two addi-tional factors:

6. the correlation between latent true out-come means of schools within pairs, ρpairs; and

7. the magnitude of variation in the treatmenteffect across pairs.

Factor 6, the within-pair correlation, is alsoequivalent to the proportion of variance in thelatent mean outcomes that lies between pairs. Thefact that a correlation is equivalent to a varianceratio may seem odd, but that is easily shown to bethe case. Factor 7 is referred to as the ES variabil-ity (ESV). The ESV quantifies how much thetreatment effect varies across pairs. For example,the school differences in treatment implementa-tion may result in school differences in the inter-vention effect. This variation in the treatmenteffect is captured in the design by the ESV.

Within the past 15 years, a number ofauthors have compared the power of the CRdesign with that of the MP design in the contextof group-randomized studies (Freedman,Green, & Byar, 1990; Gail, Mark, Carroll,Green, & Pee, 1996; Martin et al., 1993).Freedman et al. (1990) examined how the MPdesign increased the precision of the estimate ofthe treatment effect relative to the CR design,restricting their attention to a comparison of thestandard errors of the treatment effect (as notedby Martin et al., 1993). This approach, althoughuseful, does not incorporate the effect on powerof the loss of degrees of freedom using the MPdesign. Martin et al. (1993) then considered thepower of CR and MP designs for small group-randomized studies, in which the loss ofdegrees of freedom would likely matter. Theyconcluded that if there were fewer than 20groups, matching should be used only if the cor-relation between the matching variable and theoutcome is greater than about .45. Otherwise,

12

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

the loss of degrees of freedom in the MP designwould result in a less powerful design than theCR design (see also Hughes, 2005). Bloom(2005) provided calculations showing therequired predictive power of matching toincrease the power of the test relative to a CRdesign for varying numbers of groups in thecontext of social studies.

Making the comparison precise

Figure 1 compares the power of CR and MPdesigns. In both cases, ES = {0.15, 0.30}, J ={20, 40, 60}, and ICC = .20, with n = 20 and α =.05. This ICC is commonly regarded as close tothe higher end of what is typically found forachievement outcomes (Shochet, 2005).

Power in the MP design will also depend onhow much the ES varies from pair to pair. SuchESV is allowed to be 0.01 and 0.05. To clarify,suppose ES = 0.15 and ESV = 0.01. If the pair-specific ESs vary randomly across pairs accord-ing to the normal distribution, one would expect95% of the pairs to yield ESs in the range of0.15 ± 1.96√0.01 ES units, that is, over theinterval (–0.05, 0.35). An ESV larger than 0.01would yield quite large plausible value inter-vals, perhaps larger than is realistic, although0.05 is considered an upper bound.

Looking at the upper left panel of Figure 1,when J = 20 (10 matched pairs) and ES = 0.15,power for the CR design is extremely low, atabout 0.10. Power for the MP design is not muchbetter, even when matching explains 90% of theoutcome variation. When ES = 0.30 (upper rightpanel), things are not much better. The CR designyields about 0.25 power, and the MP design doesnot help much unless it explains a large fractionof the variation. Still, even if 90% of the variationis explained, power is unacceptable, at about0.60. It is clear that the ES will need to bequite large if such a small study (J = 20) is toyield adequate power. Notice that the utility ofmatching also depends on the ESV, but thisdependence is weak. Indeed, the curves forESV = 0.01 and ESV = 0.05 are virtually indis-tinguishable despite the fact that ESV = 0.05 isvery large indeed. Across all the scenariosshown in the figure, the smaller ESV of 0.01gives slightly greater power than the largerESV of 0.05.

Increasing the number of schools will natu-rally add power to the study. Consider the cen-ter right panel of Figure 1, in which ES = 0.30and J = 40 (20 matched pairs). Power for the CRdesign is about 0.47. The utility of matchingdepends on the percentage of outcome variationexplained. Until about 25% of the outcome vari-ation is explained, matching actually makesthings slightly worse. But matching increas-ingly helps as the percentage of varianceexplained increases toward 1.0 and can evenlead to a well-powered study if it successfullyexplains about 80% of the variation in the out-come. Looking now at the bottom pair of pan-els, the benefit of increasing J to 60, especiallyfor the larger ES, can be seen. The benefits ofmatching are again clear. When ES = 0.30,power increases beyond 0.80 if about 60% ofthe variance in school means is explained.

Recall that Figure 1 portrays a “bad case” inthat the ICC is .20. When the ICC is large, thereis considerable variation between schools, mean-ing that the CR design requires many schools toachieve adequate power for a given ES. Matchingreduces the need to have so many schools byaccounting for some of the large variationbetween schools. However, matching may actu-ally hurt unless ρpairs, the percentage of varianceexplained between schools, achieves a minimumlevel. The minimum level is about 25% when J =20 and declines to about 15% when J = 60.

Figure 2 gives a somewhat more optimisticscenario because now the ICC is .10 rather than.20. This means that about 10% of the overallvariation in outcomes lies between schools.Power is uniformly higher than in Figure 1,holding constant J and ES. However, the bene-fit from matching is smaller in Figure 2 than inFigure 1. Now, ρpairs, the percentage of varianceexplained by matching, must be higher beforematching begins to help. And the benefit ofmatching after reaching “the threshold” issmaller than in Figure 1.

These principles are even clearer whenFigure 3 is examined, in which an ICC of only.05 is assumed. Now, the CR design achievesnearly respectable power when ES = 0.30 andJ = 40 and quite respectable power when J = 60.Notice that now, there is little to gain frommatching because little of the variation liesbetween schools, and matching schools can

13

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

help only in reducing that small fraction of vari-ation further.

The implications of this exercise are clear.Matching helps most when between-schoolvariation is large and therefore a big problemfor the CR design. The CR design then requiresa very large number of schools to “cut down thenoise” caused by random variation in schoolmeans within treatments. Matching helps byexplaining some or most of this variation, sothat fewer schools are required under matchingthan under the CR design, as long as the per-centage of variation in school means explainedby matching achieves a threshold. This thresh-old is higher when few schools are sampled andgoes up further as the ICC declines. When little

variation lies between schools in the absence ofmatching, matching will not help much,because there is little variation for matching toexplain. Indeed, matching is more likely to hurtwhen the ICC is small than when the ICC islarge.

Using a Covariate Versus Not Using aCovariate

Matching prior to randomization is one wayto increase power. A second major strategy forincreasing the power of an experiment is the useof ANCOVA. The efficacy of ANCOVA in thecontext of group-randomized studies has beendiscussed by several authors (see, e.g., Bloom,

14

FIGURE 4. Analysis of covariance example for two-level design.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

2005). A question arising in this literature iswhether the covariate should be a person-levelcharacteristic or a group characteristic. Accordingto Bloom (2005), correlations at the group levelin the social sciences are typically higher thancorrelations at the individual level. In addition,group-level data are often more accessible andmay be less expensive to acquire. For example,

consider a study designed to test the effect of areading intervention in schools in which readingachievement is the outcome and all third gradeclassrooms within a school are assigned to thesame treatment condition. A useful covariatewould be last year’s third grade scores on thereading test. Finding last year’s average thirdgrade scores for each school will typically be

15

FIGURE 5. Completely randomized (CR) design versus analysis of covariance (AC), intracluster correlation(ICC) = .20.Note. ES = effect size.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

inexpensive. Examples of this type arise fre-quently, so we focus here on covariance adjust-ment for group-level covariates.

Example

Consider once again a study in whichschools are to be assigned at random either to anexperimental group that receives a new literacy

program or to a control group that does not. Oncompletion of the program, the two groups arecompared on their average values of Y, theschool-level sample mean reading test score.Let µ denote the “true” school mean, of whichY is an estimate.

ANCOVA modifies the script in that theinvestigators decide to exploit the availability ofinformation on W, the prior mean achievement of

16

FIGURE 6. Completely randomized (CR) design versus analysis of covariance (AC), intracluster correlation(ICC) = .10.Note. ES = effect size.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

school J. Many studies have found a strong linearassociation between school-level mean achieve-ment and school-level mean outcome, so such aW is an excellent candidate as a covariate.

As in matching, the experimenter usesANCOVA to exploit the existence of prior infor-mation about schools, information that hope-fully is helpful in predicting school-level mean

outcomes. One key difference is that althoughmatching builds this prior information into thedesign of the study (by first matching and thenrandomizing within pairs), ANCOVA uses astatistical model to “hold constant” the priorvariables when evaluating the impact of thetreatment on the outcome. In fact, operationally,the ANCOVA design is identical to the CR

17

FIGURE 7. Completely randomized (CR) design versus analysis of covariance (AC), intracluster correlation(ICC) = .05.Note. ES = effect size.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

design: Persons are simply randomly assigned toeither the experimental or the control group withno regard for any prior information. However, theanalytic model is different. Specifically, the out-come, Y, is regarded as a linear function of twothings: the covariate, W, and treatment groupmembership, indicated by a dummy variable, Z.The aim is to predict how two schools with thesame value of W would differ if one received theexperimental treatment (Z = 1) and one receivedthe control treatment (Z = 0). Because treatmentsare assigned at random, only the treatment effectplus chance differences (other than the covariate)contribute to the predicted difference betweentwo such schools.

Thus, in the scenario described above, usinglinear regression, the experimenter uses W andtreatment group membership Z to predict Y. Inthis scenario, the expected mean differencebetween the two groups, holding constant W, isthe average impact of the treatment in the pop-ulation from which the schools were sampled.

Intuitively, ANCOVA will boost power whenW (the prior mean achievement of school J) hasa strong linear association with Y (the school-level mean reading test score). If so, two schoolsthat have very similar W values but experiencedifferent treatments will have very similar pre-dicted Y values in the absence of a treatmenteffect. Therefore, if, on average, schools that aresimilar on W but vary with respect to treatmentalso vary systematically on Y, one will tend tofind evidence of a treatment impact.

To illustrate this idea, data were generated inwhich W strongly predicts Y. To illustrate thelogic of ANCOVA, assume an unrealisticallyhigh correlation of ρwµ = .95 between W and thetrue school mean, µ. For ease of interpretation,test scores are measured in standard deviationunits. Schools receiving the new programreceive a boost of 0.25 standard deviation unitsin average scores. Under these assumptions, Wand Y are bivariate normal in distribution, eachwith a variance of 1.0.2 The sample size perschool, n, is set at 100; the number of schools,J, is set at 50; and the ICC is set at .10. Figure 4shows box plots of the experimental and controlgroups’ mean Y values. The median Y is a bithigher in the experimental group (Z = 1) than inthe control group (Z = 0), but the two groups’distributions overlap considerably. Under the

CR analysis, the t test of the mean differencebetween groups on mean Y taking the nestinginto account results in t = 0.60, far from beingsignificant at the 5% level.

Figure 4 gives a scatterplot of Y against Wfor the two treatment groups. The plot shows anextremely strong, positive, linear associationbetween W and Y (note the sample predictionline within each of the two groups). Note thatat nearly all values of W, the experimentalgroup Y values tend to be elevated above thoseof the control group. Thus, when attention isrestricted to schools that have similar values ofW, it can be readily discerned that the treat-ment and control group means are different.The estimate of the average impact of the treat-ment is the vertical distance between the twonearly parallel lines in the bottom left panel ofFigure 4. The test of the average treatmenteffect under the nested ANCOVA is positiveand statistically highly significant (t = 7.01,p < .001). In effect, controlling for the covari-ate W has dramatically reduced the “noise” inY, revealing the “signal,” that is, the impact ofthe treatment.

Generalization

As in the case of matching, controlling for auseless covariate is of no help at all; in fact, itactually hurts, again because of the loss ofdegrees of freedom. Recall that the degrees offreedom for the CR analysis equals J – 2.ANCOVA requires the computation of oneadditional quantity: the association between thecovariate and the outcome. So in this case,df = J – 3. The loss of one extra degree of free-dom will have a negligible effect on powerunless the sample size is very small. Thus,ANCOVA exacts a smaller penalty than the MPdesign for using useless information on W.

To see how the data look when a uselesscovariate is used, refer to the bottom right panelof Figure 4. Values of this covariate, labeled U,increase along the horizontal axis, with Y on thevertical axis. Confining attention to cases withsimilar values of U in no way reduces uncertaintyabout Y and therefore is of no help in discerningthe treatment effect. In this case, ANCOVA, likethe CR analysis, gives a nonsignificant test of theimpact of the treatment (t = 0.25).

18

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

The key assumptions underlying ANCOVAare that

1. power can increase only to the extent thatX and Y are linearly associated,

2. the magnitude of this association must bethe same in the experimental and controlgroups, and

3. the residuals (the errors of prediction) mustbe normally distributed with constant variance.

Assumptions 1 and 2 are not needed inusing the MP design. Thus, the assumptions forANCOVA are more restrictive than thoseneeded for the MP analysis, making the lattermore flexible.

19

FIGURE 8. Analysis of covariance (AC) versus the matched-pairs (MP) design, intracluster correlation(ICC) = .20.Note. CR = completely randomized; ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

As noted in the matching discussion, thepower associated with the CR design dependson five factors:

1. the true ES;2. the ICC;3. the total number of schools, J;4. the number of students studied per school,

n; and5. the adopted level of statistical significance, α.

The power in the ANCOVA design dependson one additional factor:

6. ρwµ, the correlation between the school-level covariate and the true school-level meanoutcome.

Making the Comparison Precise

Suppose an experimenter wonders howmany schools will be needed to achieve powerto detect a predetermined ES and whetherANCOVA might help achieve this goal. Theresults are graphed in Figures 5–7, followingthe same logic of the graphs presented in theprevious section on matching. Thus, Figure 5gives results when the ICC (unadjusted for thecovariate) is .20, Figure 6 gives results forICC = .10, and Figure 7 gives results for ICC =.05. In each case, results are seen for a “small”ES of 0.15 and for a larger ES of 0.30 using J =20, J = 40, and J = 60 and holding n constant at20 and α at .05. In every case, the results arereproduced for the CR design (i.e., the designthat does not use the covariate). Power is shownas a function of the magnitude of the correlationbetween the school-level covariate and theschool mean.

In many ways, the results parallel those ofthe preceding section. Thus, a smaller benefit ofusing the covariate is seen when the ICC is .10than when it is .20, and even less benefit is seenfor ICC = .05. Once again, the school-levelcovariate can explain only the variationbetween schools, and when the ICC is verysmall, there is little between-school variationto be explained.

One obvious and important differenceinvolves the potential negative effect of usingthe covariate. Recall that matching can actually

undermine power unless the percentage ofvariance explained by matching reaches athreshold value. In contrast, Figures 5–7 sug-gest that the penalty for using a useless covari-ate is negligible in all cases. The reason is clear:Only one degree of freedom is sacrificed inusing the covariate, compared with K – 1(where K is the number of pairs) in the MPgroup-randomized design.

Matching Versus ANCOVA

We have explored how two approaches tousing prior information can improve power com-pared with an approach that does not make use ofsuch prior information. As long as there is substan-tial variation between schools to explain, matchingwill substantially improve power as the correlationwithin pairs on the outcome increases. AndANCOVA will substantially improve powerwhen the correlation between the covariate andthe outcome increases. The question then natu-rally arises: Which of these two approaches ismost effective in increasing power?

A comparison between ANCOVA and match-ing makes sense only when ANCOVA is a reason-able approach, that is, when the assumptionsrequired for ANCOVA are at least approximatelycorrect. Yet if the ANCOVA assumptions dohold, ANCOVA will be optimal. However, thereare benefits of matching. Recall that matchingcan ensure balance between treatments on keysalient variables, increasing face validity.Moreover, the weaker assumptions required formatching make it appealing.

The question therefore arises as to whethermatching might be nearly as good as ANCOVAwhen the ANCOVA assumptions hold. If so,one might argue that matching ought to beadopted to purchase its unique benefits. Theavailable literature does not provide a conclu-sive answer to this question.

To date, researchers evaluating matching tech-niques have tended to overestimate the utility ofmatching (Hughes, 2005; Klar & Donner, 1998).In this literature, it is common to postulate theexistence of a continuous pretreatment variable,say W, having a correlation, say ρwµ, with thegroup mean outcome µ. One might then assumethat matching on W will reduce the unexplainedvariance in the outcome by a factor of ρ2

wµ. Such

20

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

an approach overestimates the benefit of match-ing by overestimating the similarity of unitswithin pairs on W, in effect assuming that pairmembers will be perfectly matched on W. Suchan assumption is unrealistic in practice, yet thereis no closed-form mathematical expression fordetermining the variance explained by matchingon W when W randomly varies within pairs. We

address this problem through simulation, allow-ing for a more rigorous comparison betweenmatching and ANCOVA.

First, a group-level covariate W and, for eachgroup, a latent “true” group mean outcome, µ,were generated as bivariate normal in distribu-tion, each with variance 1.0 and correlation ρwµ.Then, cases were rank ordered on W, and the

21

FIGURE 9. Analysis of covariance (AC) versus the matched-pairs (MP) design, intracluster correlation(ICC) = .10.Note. CR = completely randomized; ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

sample of groups was divided into pairs suchthat the first- and second-ranked groups consti-tuted Pair 1, the third and fourth constituted Pair2, and so on. The variances within and betweenpairs on µ were then computed, enabling thededuction of the correlation between pair mem-bers, ρpairs. To ensure the stability of this estimate,the process was repeated 1,000 times for each

sample size, and the average of the estimates wasused as the final value of ρpairs. Thus, for everysample size, it was possible to regard ρpairs as afunction of ρwµ, enabling the computation ofpower for the ANCOVA and the MP analysis ateach value of ρwµ. Power for the MP design canreadily be shown to be a function of the numberof groups, J; the sample size per group, n; the

22

FIGURE 10. Analysis of covariance (AC) versus the matched-pairs (MP) design, intracluster correlation(ICC) = .05.Note. CR = completely randomized; ES = effect size; ESV = ES variability.

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

ICC (the percentage of variance lying betweengroups in the absence of matching); and ρpairs.Here, J = 2K, where K is the number of pairs.

The results are graphed in Figures 8–10.These are the same results shown in Figures 5–7,except that the power for the MP design at eachvalue of ρwµ has been added. For simplicity, onlythe case in which ESV = 0.01, a plausible valueof ESV, it was reasoned, was included.

As expected, the ANCOVA approach worksbest. In every case, power associated withmatching increases at a rate nearly equal to thatof ANCOVA, but always at a lower level. Thetwo are most similar when J is large and theICC is large. A key difference is that when ρwµ issmall, matching can do worse than the CRdesign, whereas the penalty for a small correla-tion with ANCOVA is negligible. As seen earlier,there is virtually no penalty for using ANCOVAwhen X is a useless covariate, unless J is verysmall, smaller than studied here. In contrast,matching can reduce power, particularly whenineffective, when the number of pairs is smalland when the ICC is small.

Discussion

The logic of social experimentation oftenrequires that groups, rather than individuals, be theunit of assignment and of treatment. Assignmentby randomization confers the same advantages ingroup-based studies as in person-based studies: Awell-implemented randomized study eliminatesbias by statistically balancing treatment conditionson all prior characteristics. In this case, the onlydifferences between treatments on prior character-istics are chance differences, and standard signifi-cance tests and confidence intervals correctlyquantify uncertainty about the existence and mag-nitude of causal effects.

However, if no effort is made to identifyand control prior group characteristics, group-randomized studies will tend to lack statisticalpower, unless many groups are recruited. Thisneed for many groups arises because the effectsof short-term interventions typically imple-mented in social settings will tend to be modest.The number of groups required to achieve ade-quate power depends also on the intraclass cor-relation, also interpretable as the fraction ofvariation in group-mean outcomes that lies

between groups. Even when this fraction ismodest (e.g., less than 10%), the number ofgroups required to achieve adequate power canbe daunting if the ES is modest.

In this article, we have evaluated two alterna-tive approaches to identifying and controllingprior group characteristics: matching andANCOVA. Under certain circumstances, suchuses of prior information can substantiallyincrease power given the number of groups or,equivalently, reduce the number of groups neededgiven a desired level of power.

In this discussion, we first briefly summarizethe key findings. Second, we provide advice onhow to use existing data to assess the likely con-tribution of matching or ANCOVA to improvepower. Third, we briefly explore other morecomplex designs and how available informationcan be used to increase the power to detect treat-ment effects.

Key Findings

Matching

It has been demonstrated that matching willtend to enhance statistical power when groupswithin pairs are well matched on prior characteris-tics and when those characteristics strongly predictoutcomes. In this setting, groups within pairs willtend then to be similar on outcomes, except for thefact that one group experiences the treatment andone does not. Such similarity is measured by thecorrelation between pair members, ρpairs, which isequivalent to the correlation between group meanson the outcome within pairs and the fraction ofvariation that lies between pairs on the true meanoutcomes. The variation in the true ES across pairsalso affects power; power will be greater when thisvariation is small.

A key factor to consider, however, is the frac-tion of variation that lies between groups in theabsence of matching. This factor is indexed bythe ICC. Matching is most helpful when theICC is comparatively large. If the ICC is tiny,there is little variation between groups to beremoved through matching. Thus, the thresh-old value of ρpairs decreases with the number ofpairs and also with the ICC. Thus, matchingwill help least, and may in fact hurt, when thenumber of pairs is small and when the ICC issmall. We found, in fact, that in quite plausible

23

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

circumstances, matching actually hurt when theICC was .05, even if ρpairs was moderate to large.

ANCOVA

In group-randomized studies, the most promis-ing covariates are often group-level covariates, W(Bloom, 2005). They are typically cheap to meas-ure and often very strongly correlated with thegroup mean outcome, labeled µ in this study. Asone might expect, power tends to increase with thecovariate-outcome correlation ρwµ. However, thebenefit of using the covariate depends on the ICC.When the ICC is large, the use of a group-levelcovariate can substantially increase power.However, for a small ICC, the utility of the covari-ate will be small. Only if the number of groups isexceedingly small, however, does one pay a non-negligible penalty for using a poorly chosencovariate. However, the required assumptions aremore stringent than when matching on pretreat-ment characteristics.

Matching versus ANCOVA

Its flexibility gives matching certain advan-tages: There is no requirement that the pretreat-ment characteristic has a linear association withthe outcome to be used. Moreover, as noted,matching on one or more salient variables canenhance face validity. However, when the assump-tions of ANCOVA hold and when the correlationis large, it is difficult to beat ANCOVA for boost-ing power. A question that naturally arises iswhether matching on the covariate X can providenearly as much power as ANCOVA when theassumptions of ANCOVA hold. If so, one mightopt for matching to enjoy its other benefits.

We were surprised to find little precise guid-ance in the literature on this question. On reflec-tion, this finding is understandable in thatprecise mathematical comparisons are difficultif possible at all. We therefore conducted a sim-ulation study. In the simulation study, matchingapproximated the power of ANCOVA as thenumber of groups increased and as the ICCincreased. However, for many plausible cases,ANCOVA did significantly better than match-ing, because in these scenarios, either the num-ber of groups or the ICC was too small to enable

matching to approximate the power ofANCOVA (see Figures 8–10).

Match and covary? Given the trade-offs, is itpossible to benefit from matching and use X asa covariate? Although the precise study of suchan option goes beyond the scope of this article,the results are suggestive. On one hand, addinga useless covariate to an MP study will cause lit-tle harm unless the number of groups is verysmall. On the other hand, if the covariate ispowerfully related to the outcome, the benefitmay be great, particularly if matching was noteffective. Thus, it is quite plausible to envisiona scenario in which matching is desirable toenhance face validity but adds little power, oreven reduces it. Adding a strong covariate maythen help. On the other hand, if a good covari-ate is already available, matching in addition toANCOVA would seem unpromising in mostcases of interest for the purpose of increasingpower. In studies with small numbers of groups,matching in addition to ANCOVA may hurtbecause the loss of degrees of freedom associ-ated with matching increases the critical valueof the test statistic for the treatment effect.

Using These Ideas to Plan Group-Randomized Studies

Our hope is that this article will improvethe planning of group-randomized studies.Good planning typically requires reasonableestimates of quantities that can be known pre-cisely only after a study has been conducted.In many cases, pilot data or data from archivescan be analyzed to provide good estimates ofthese quantities. If those estimates becomeavailable, the reader might use the informationpresented in the figures to guide planning.These figures cover only a small fraction of thecases that arise in practice, however. Details ofhow to replicate the results in the figures usingthe Optimal Design software package are inAppendix A. We now consider how to use thesoftware (Raudenbush et al., 2006) to planstudies in cases not represented in those fig-ures. The discussion is restricted to the case ofcontinuous outcomes. However, the softwaredocumentation shows how to use the software

24

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

for dichotomous outcomes and presents all equa-tions needed to derive the results in the figures.

CR Design

In person-level studies, one may begin withan assumed standardized ES, that is, the meandifference between experimental subjects andcontrols in the scale of the outcome standarddeviation. This is typically the “minimum ESworth detecting.” One then simply calculatesthe sample size, M, needed to ensure a givenlevel of power at a given level of significance.Alternatively, one may fix the ES and deter-mine the “minimum detectable effect,” that is,the smallest ES that can be detected at thegiven power and significance level. In somecases, prior data will be needed to get a goodapproximation to the outcome standard devia-tion. The expected mean difference in out-comes divided by this quantity then constitutesthe standardized ES.

Planning group-level CR studies is a littlemore complicated. To obtain power, one needsnot only a hypothesized ES but also an estimateof the ICC, the fraction of variation lyingbetween groups. Prior data will be useful in thisregard (see Bloom, 2005; Shochet, 2005).Power is then a function of both the number ofgroups, J, and the sample size per group, n.

MP Design

For person-randomized studies, the calcula-tion of power requires the same quantity asneeded in CR: the standardized ES. In addi-tion, one needs an estimate of ρpairs, the corre-lation between outcomes of pair members inthe absence of treatment (or, equivalently, thefraction of variance in the outcome that liesbetween pairs). This quantity can easily beestimated from pilot data if the matching vari-ables and a good substitute for the outcome areavailable. To estimate the correlation betweenpairs, one simply matches cases on the match-ing variables and then computes a one-wayanalysis of variance on the outcome. A goodestimate of ρpairs is then half the differencebetween the between- and within-pair meansquares.

In the group-randomized case, planning forMP is again a bit more complicated. One needsthe same quantities as required in the CR case(the standardized ES and the ICC) plus an esti-mate of ρpairs. One might proceed in a way thatis entirely analogous to the proceduredescribed above for the person-level case:match groups, then compute a one-way analy-sis of variance using the group sample meanoutcomes, computing ρpairs in the same way.However, this procedure is not correct. Suchan estimate, call it ρpairs naive, will be attenuatedas the sample size per group diminishes. Whatis needed is an estimate of ρpairs defined as thefraction of variance in “true” group means thatis explained by matching (see Appendix B fordetails).

Analysis of Covariance

In the case of person-level studies, the compu-tation of power will require the standardized ESand a reasonable estimate of ρxy, the covariate-outcome association. In the group-randomizedcase, one again needs the standardized ES, butalso the ICC and an estimate of ρwµ, the correla-tion between the group-level covariate W andthe group mean outcome µ. We emphasize thatρwµ is not equivalent to the correlation betweenW and the sample mean outcome (see AppendixB for details).

Other (More Complex) Designs

The discussion in this article has beenlimited to two-level designs. However, manystudies often involve additional levels of clus-tering. Although precise study of these otherdesigns goes beyond the scope of this article,the findings described in this article providesome guidance on how to increase power insuch cases.

One such design is the three-level group-randomized trial with treatment at level 3. Inthis case, there is an additional layer of cluster-ing between the unit of analysis and the unit ofrandomization. Suppose a schoolwide studyinvolving several schools is designed to deter-mine the effects of a whole-school reform onstudents’ academic achievement. The reform is

25

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

implemented at the school level, and the out-comes of interest are at the individual level.However, the effectiveness of the reform maydepend on the quality of the teachers or on otherclassroom-level characteristics. Thus, takinginto account the clustering at the level of theclassroom is essential.

To increase the power of such a study, anoption might be to block the schools by districtprior to randomization. This blocking canreduce the between-district variation and thuspotentially increase the power. Again, whetherthe power increases will depend on thestrength of districts as a blocking variable andon the number of schools in the study, amongother variables. This type of blocking, how-ever, may be desirable for face-validity pur-poses. Alternatively, a school-level covariatesuch as school mean prior achievement mightbe included in the analysis. With three levels inthe design, any level covariate could be used;however, the school-level covariate in this caseis the most logical for reasons outlined abovein the discussion of covariance adjustment.

Another (three-level) design not discussedhere is the cluster randomized trial with repeatedmeasures at the individual level. Such a designinvolves repeated measures nested within indi-viduals nested within groups (Raudenbush &Liu, 2001). Yet again, the main findings about thepower-increasing strategies discussed in this arti-cle can once again be extended.

All the designs illustrated in the figuresplus the three-level group-randomized trialwith treatment at all levels and the cluster-randomized trial with repeated measures at theindividual level are included in the OptimalDesign software and its documentation, availablefrom the William T. Grant Foundation’s Web siteor from the authors. We encourage researchers touse these resources in planning their studies ofgroup-based initiatives. See Appendix A fordetails on how to use the software.

Appendix A

All the results in the figures included in this arti-cle may be replicated using the Optimal Design forGroup Randomized Trials mode of the OptimalDesign software. We include here a few examples.

For a more in depth discussion, refer to the softwaredocumentation (http://www.wtgrantfoundation.org/or http://sitemaker.umich.edu/group-based).

Example 1: Replicating the CompletelyRandomized (CR) Results for Intracluster

Correlation (ICC) = .20, n = 20, Effect Size(ES) = 0.15, and J = 20 (Figure 1,

Top Left Panel)

Select Cluster Randomized Trial → Power forthe main effect of treatment (continuous outcome)→ Power vs. effect size → ρ = 0.20, J = 20, n =20. Clicking on the figure for an ES of 0.15 yieldspower of about 0.10, which is the same as thepower for the CR in the top left panel in Figure 1.

Example 2: Replicating the Matched-Pairs(MP) Results for ICC = .20, n = 20, ES =0.15, J = 60, and ESV = 0.01 (Figure 1,

bottom left graph)

Select Multi-site CRT → Power for treatmenteffect → Power vs. effect size → n = 20, J = 2,K = 30, ρ = 0.20, σ2

δ = 0.01, B = 0.8. Clicking on theplot for an effect size of 0.15 yields power of about0.49, which approximates the power for the MPwith 0.01ESV in the bottom left graph in Figure 1.

Example 3: Replicating the ANCOVA Resultsfor ICC = .20, n = 20, ES = 0.15, and J = 60

(Figure 5, Bottom Left Panel)

Select Cluster Randomized Trial → Powerfor the main effect of treatment (continuous out-come) → Power vs. effect size → ρ = 0.20, J =60, n = 20, R2

L2 = .64. Clicking on the figure foran ES of 0.15 yields power of about 0.40, whichis the same as the power for the ANCOVA for acorrelation of .80 (.802 = .64) in the bottom leftgraph in Figure 5. Note that R2

L2 = .64 is enteredbecause it asks for the percentage of varianceexplained. Figure 5 is based on the correlation.

Appendix B

Estimating ρpairs for MP Designs forGroup-Randomized Studies

Suppose that the planner has obtained pilotdata on J schools, each with sample size n. First,

26

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

the user computes a one-way random-effectsanalysis of variance, obtaining an estimate of thebetween-school variance, τ2, the within-schoolvariance, σ2, and, from these, ICC = τ2/(τ2 + σ2)and λ = τ2/(τ2 + σ2/n).

The next step is to construct J/2 matchedpairs just as they will be constructed in thestudy. One then proceeds as follows:

1. Compute the sample means for each of theJ schools. This yields a data set with J/2 pairs ofsample means.

2. Now compute a second one-way analysisof variance with two level-1 units per each ofthe J/2 Level 2 units. This will yield a between-pair variance of τ*2, a within-pair variance ofσ*2, and, from these, ICC* = τ*2/(τ*2 + σ*2).

3. Compute ρpairs = ICC*/λ.

Estimating ρwµ in the ANCOVA Design forGroup-Randomized Studies

The correlation between the group-levelcovariate (denoted W) and the sample groupmean outcome (denoted m), ρwm, will be anattenuated estimate of ρwµ as the sample sizeused to compute m diminishes. So it is notappropriate simply to compute the sample cor-relation between W and m. Instead, one can usethe relationship ρwµ = ρwm/√λ, where λ is thereliability of m.

Now suppose that W itself is also the sam-ple mean of a person-level covariate and istherefore a fallible estimate from the pilotsample of the “true mean W.” Then let ρwµ naive

be the correlation between the sample mean ofW and the sample mean of the outcome. Thenthe relation ρwµ = ρwµ naive/√(λm × λw) can beused. Thus, the naive estimate is divided by thesquare root of the product of the two reliabili-ties, λm and λw, the reliabilities of the samplemean outcome and the sample mean covari-ates, respectively.

Notes

1The Optimal Design software is freely available fromthe Web site of the William T. Grant Foundation (http://www.wtgrantfoundation.org). It can be used to estimatethe power of individual- and group-randomized studies

following the designs discussed in this article, amongothers. Software documentation containing all the for-mulas for the power calculations is also accessible onthe Web site.

2The model generating the data was yij = 2 + 0.95Wj

+ 0.25Zj + rj + eij, where yij represents the outcome forstudent i = {1, . . ., n} in school j = {1, . . ., J}; Wj

represents the school-level covariate, assumed nor-mally distributed with mean 0 and variance 1; Zj

represents a school-level indicator (1 for treatment,0 for control); rj ~ N(0,τ) represents a random errorassociated with each school; and eij ~ N(0,σ2) repre-sents a random error associated with each student.The intraclass correlation is set at .10, n at 100, andJ at 50.

References

Bloom, H. S. (2005). Randomizing groups to evalu-ate place-based programs. In H. S. Bloom (Ed.),Learning more from social experiments: Evolvinganalytic approaches (pp. 115–172). New York:Russell Sage.

Bloom, H. S., Bos, J. M., & Lee, S.-W. (1999). Usingcluster random assignment to measure programimpacts: Statistical implications for the evaluationof education programs. Evaluation Review, 23(4),445–469.

Bloom, H. S., & Riccio, J. A. (2005). Using place-basedrandom assignment and comparative interruptedtime-series analysis to evaluate the jobs-plus employ-ment program for public housing residents. InR. Boruch (Ed.), Place randomized trials: Experi-mental tests of public policy (pp. 19–51). ThousandOaks, CA: Sage.

Bloom, H. S., Richburg-Hayes, L., & Black, A. R.(2005). Using covariates to improve precision:Empirical guidance for studies that randomizeschools to measure the impacts of educationalinterventions. New York: MDRC.

Borman, G. D., Slavin, R. E., Cheung, A.,Chamberlain, A., Madden, N., & Chambers, B.(2005). Success for all: First-year results from theNational Randomized Field Trial. EducationalEvaluation and Policy Analysis, 27, 1–22.

Cochran, W. G. (1957). Analysis of covariance: Itsnature and uses. Biometrics, 13(3), 261–281.

Cook, T. D. (2005). Emergent principles for thedesign, implementation and analysis of cluster-based experiments in social science. In R. Boruch(Ed.), Place randomized trials: Experimental testsof public policy (pp. 176–198). Thousand Oaks,CA: Sage.

27

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

Cook, T. D., Hunt, H. D., & Murphy, R. F. (2000).Comer’s School Development Program inChicago: A theory-based evaluation. AmericanEducational Research Journal, 37, 535–597.

Donner, A., & Klar, N. (2000). Design and analysisof cluster randomization trials in health research.London: Arnold.

Education Sciences Reform Act, 108 Cong. 2nd Sess.48 (2002).

Federer, W. T. (1955). Experimental design. NewYork: Macmillan.

Fisher, R. A. (1926). The arrangement of field exper-iments. Journal of the Ministry of Agriculture, 33,503–513.

Fisher, R. A. (1936). Statistical methods for researchworkers. Edinburgh, UK: Oliver & Boyd.

Fisher, R. A. (1949). The design of experiments.Edinburgh, UK: Oliver & Boyd.

Flay, B. R. (2000). Approaches to substance use pre-vention utilizing school curriculum plus socialenvironmental change. Addictive Behaviors, 25(6),861–885.

Freedman, L. S., Green, S. B., & Byar, D. P. (1990).Assessing the gain in efficiency due to matchingin a community intervention study. Statistics inMedicine, 9(8), 943–952.

Gail, M. H., Mark, S. D., Carroll, R., Green, S. B., &Pee, D. (1996). On design considerations and ran-domization-based inference for community inter-vention trials. Statistics in Medicine, 15(11),1069–1092.

Grimshaw, J., Eccles, M., Campbell, M., & Elbourne, D.(2005). Cluster randomized trials of professionaland organizational behavior change interventions inhealth care settings. In R. Boruch (Ed.), Place ran-domized trials: Experimental tests of public policy(pp. 71–93). Thousand Oaks, CA: Sage.

Hannan, P. J., Murray, D. M., Jacobs, D. J., &McGovern, P. (1994). Parameters to aid in thedesign and analysis of community trials: Intraclasscorrelations from the Minnesota Heart HealthProgram. Epidemiology, 5(1), 88–95.

Hughes, J. P. (2005). Using baseline data to design agroup randomized trial. Statistics in Medicine,24(13), 1983–1994.

Kempthorne, O. (1952). The design and analysis ofexperiments. New York: John Wiley.

Kirk, R. E. (1982). Experimental design: Proceduresfor the behavioral sciences (2nd ed.). Belmont,CA: Brooks/Cole.

Klar, N., & Donner, A. (1998). The merits of match-ing in community intervention trials: A cautionarytale. Statistics in Medicine, 16(15), 1753–1764.

Leviton, L. C., & Horbar, J. D. (2005). Cluster ran-domized trials for the evaluation of strategies

designed to promote evidence-based practice inperinatal and neonatal medicine. In R. Boruch(Ed.), Place randomized trials: Experimental testsof public policy (pp. 94–114). Thousand Oaks,CA: Sage.

Martin, D. C., Diehr, P., Perrin, E. B., & Koepsell, T. D.(1993). The effect of matching on the power of ran-domized community intervention studies. Statisticsin Medicine, 12(3–4), 329–338.

Mosteller, F., & Boruch, R. (Eds.). (2002). Evidencematters: Randomized trials in education.Washington, DC: Brookings Institution.

Mosteller, F., Light, R. J., & Sachs, J. A. (1996).Sustained inquiry in education: Lessons from skillgrouping and class size. Harvard EducationalReview, 66(4), 797–842.

Murray, D. M. (1998). Design and analysis of group-randomized trials. New York: Oxford UniversityPress.

Porter, A. C., Blank, R. K., Smithson, J. L., &Osthoff, E. (2005). Place-based randomized trialsto test the effects on instruction practices of amathematics/science professional developmentprogram for teachers. In R. Boruch (Ed.), Placerandomized trials: Experimental tests of publicpolicy (pp. 147–175). Thousand Oaks, CA: Sage.

Raudenbush, S. W. (1997). Statistical analysis andoptimal design for cluster randomized trials.Psychological Methods, 2(2), 173–185.

Raudenbush, S. W., & Liu, X. (2001). Effects of studyduration, frequency of observation, and sample sizeon power in studies of group differences in polyno-mial change. Psychological Methods, 6(4), 387–401.

Raudenbush, S. W., Liu, X.-F., Spybrook, J.,Martinez,A., & Congdon, R. (2006). Optimal Designsoftware for multi-level and longitudinal research(Version 1.77) [Computer software]. Available athttp://sitemaker.umich.edu/group-based

Sherman, L. W., & Weisburd, D. (1995). Generaldeterrent effects of police patrol in crime “hotspots”: A randomized, controlled trial. JusticeQuarterly, 12(4), 625–648.

Shochet, P. A. (2005). Statistical power for randomassignment evaluations of education programs.Princeton, NJ: Mathematica Policy Research.

Sikkema, K. J. (2005). HIV prevention among womenin low-income housing developments: Issues andinterventions outcomes in a place-based random-ized controlled trial. In R. Boruch (Ed.), Placerandomized trials: Experimental tests of publicpolicy (pp. 52–70). Thousand Oaks, CA: Sage.

Teruel, G. M., & Davis, B. (2000). Final report:An evaluation of the impact of PROGRESA cashpayments on private inter-household transfers.Washington, DC: International Food Policy

28

Raudenbush, Martinez, and Spybrook

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from

Research Institute.Weisburd, D. (2000). Randomized experiments in

criminal justice policy: Prospects and problems.Crime and Delinquency, 46(2), 181–193.

Weisburd, D. (2005). Hot spots policing experimentsand criminal justice research: Lessons from thefield. In R. Boruch (Ed.), Place randomized trials:Experimental tests of public policy (pp. 220–245).Thousand Oaks, CA: Sage.

Authors

STEPHEN W. RAUDENBUSH is a professor inthe Department of Sociology at the University ofChicago, 1126 E. 59th Street, Chicago, IL 60637;[email protected]. He is best known for hisexpertise in quantitative methodology using theadvanced research technique of hierarchical linearmodels, which allows researchers to accuratelyevaluate data from school performance. His researchpursues the development, testing, refinement, andapplication of statistical methods for individualchange. He also researches the effects of social set-tings, such as schools and neighborhoods.

ANDRES MARTINEZ is a doctoral student in the

combined program in Education and Statistics at theUniversity of Michigan and a visiting researchscholar at the University of Chicago, Social ScienceResearch Building, Room 417, 1126 E. 59th Street,Chicago, IL 60637; [email protected]. His currentresearch interests include causal inference in hierar-chical settings, the measurement of educational set-tings, and the effectiveness of conditional cash trans-fer programs in education.

JESSACA SPYBROOK is a doctoral candidate inthe combined program in Education and Statistics atthe University of Michigan, Institute for SocialResearch #2050, 426 Thompson Street, Ann Arbor,MI 48106-1248; [email protected]. She has beena part of the “Building Capacity for EvaluatingGroup-Level Interventions” project sponsored by theWilliam T. Grant Foundation since January 2004.She co-authored documentation that accompanies theOptimal Design Software and has been a part of theconsulting team that assists researchers in the designand analysis of group-randomized trials.

Manuscript received March 20, 2006Final revision received December 6, 2006

Accepted January 8, 2007

29

Precision in Group-Randomized Experiments

at UNIV OF CHICAGO LIBRARY on June 4, 2008 http://eepa.aera.netDownloaded from