principal stratification approach to broken randomized...

25
Principal Strati cation Approach to Broken Randomized Experiments: A Case Study of School Choice Vouchers in New York City John B ARNARD, Constantine E. FRANGAKIS, Jennifer L. HILL, and Donald B. RUBIN The precarious state of the educational system in the inner cities of the United States, as well as its potential causes and solutions, have been popular topics of debate in recent years. Part of the dif culty in resolving this debate is the lack of solid empirical evidence regarding the true impact of educational initiatives. The ef cacy of so-called “school choice” programs has been a particularly contentious issue. A current multimillion dollar program, the School Choice Scholarship Foundation Program in New York, randomized the distribution of vouchers in an attempt to shed some light on this issue. This is an important time for school choice, because on June 27, 2002 the U.S. Supreme Court upheld the constitutionality of a voucher program in Cleveland that provides scholarships both to secular and religious private schools. Although this study bene ts immensely from a randomized design, it suffers from complications common to such research with human subjects: noncompliance with assigned “treatments” and missing data. Recent work has revealed threats to valid estimates of experimental effects that exist in the presence of noncompliance and missing data, even when the goal is to estimate simple intention-to-treat effects. Our goal was to create a better solution when faced with both noncompliance and missing data. This article presents a model that accommodates these complications that is based on the general framework of “principal strati cation” and thus relies on more plausible assumptions than standard methodology. Our analyses revealed positive effects on math scores for children who applied to the program from certain types of schools—those with average test scores below the citywide median. Among these children, the effects are stronger for children who applied in the rst grade and for African-American children. KEY WORDS: Causal inference; Missing data; Noncompliance; Pattern mixture models; Principal strati cation; Rubin causal model; School choice. 1. INTRODUCTION There appears to be a crisis in America’s public schools. “More than half of 4th and 8th graders fail to reach the most minimal standard on national tests in reading, math, and sci- ence, meaning that they probably have dif culty doing grade- level work” (Education Week 1998). The problem is worse in high poverty urban areas. For instance, although only 43% of urban fourth-graders achieved a basic level of skill on a National Assessment of Educational Progress (NAEP) reading test, a meager 23% of those in high-povertyurban schools met this standard. One of the most complicated and contentiousof educational reforms currently being proposed is school choice. Debates aboutthe equity and potentialef cacy of school choice have in- creased in intensity over the past few years. Authors making a case for school choice include Cobb (1992), Brandl (1998), and John Barnard is Senior Research Statistician, deCODE Genetics, Waltham, MA-02451 (E-mail: [email protected] ). Constantine E. Frangakis is Assistant Professor, Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205 (E-mail: [email protected] ). Jennifer L. Hill is As- sistant Professor, School of International and Public Affairs, Columbia Univer- sity, New York, NY 10027 (E-mail: [email protected] ). Donald B. Rubin is John L. Loeb Professor of Statistics and Chair, Department of Statistics, Har- vard University, Cambridge, MA 02138 (E-mail: [email protected] ). The authors thank the editor, an associate editor, and three reviewers for very help- ful comments; David Myers and Paul E. Peterson as principal coinvestigators for this evaluation; and the School Choice Scholarships Foundation (SCSF) for cooperating fully with this evaluation. The work was supported in part by Na- tional Institutes of Health (NIH) grant RO1 EY 014314-01, National Science Foundation grants SBR 9709359 and DMS 9705158; and by grants from the following foundations: Achelis Foundation, Bodman Foundation, Lynde and Harry Bradley Foundation, Donner Foundation, Milton and Rose D. Friedman Foundation, John M. Olin Foundation, David and Lucile Packard Foundation, Smith Richardson Foundation, and the Spencer Foundation. The authors also thank Kristin Kearns Jordan and other members of the SCSF staff for their co- operation and assistance with data collection, and Daniel Mayer and Julia Kim, from Mathematica Policy Research, for preparing the survey and test score data and answering questions about that data. The methodology, analyses of data, reported ndings and interpretations of ndings are the sole responsibility of the authors and are not subject to the approval of SCSF or of any foundation providing support for this research. Coulson (1999). A collection of essays that report mainly pos- itive school choice effects has been published by Peterson and Hassel (1998). Recent critiques of school choice include those by the Carnegie Foundation for the Advancement of Teaching (1992), Cookson (1994), Fuller and Elmore (1996), and Levin (1998). In this article we evaluate a randomized experiment con- ducted in New York City made possible by the privately-funded School Choice Scholarships Foundation (SCSF). The SCSF program provided the rst opportunity to examine the ques- tion of the potential for improved school performance (as well as parental satisfaction and involvement, school mobility, and racial integration) in private schools versus public schools us- ing a carefully designed and monitored randomized eld exper- iment. Earlier studies were observationalin nature and thus sub- ject to selection bias (i.e., nonignorabletreatment assignment). Studies ndingpositive educationalbene ts from attendingpri- vate schools include those of Coleman, Hoffer, and Kilgore (1982), Chubb and Moe (1990), and Derek (1997). Critiques of these studies include those of Goldberger and Cain (1982) and Wilms (1985). On June 27, 2002, the U.S. Supreme Court upheld the constitutionalityof a voucher program in Cleveland that provides scholarships both to secular and religious private schools. As occurs in most research involving human subjects, how- ever, our study, although carefully implemented, suffered from complicationsdue to missingbackgroundand outcomedata and also to noncompliance with the randomly assigned treatment. We focus on describing and addressing these complications in our study using a Bayesian approach with the framework of principal strati cation (Frangakis and Rubin 2002). We describe the study in Section 2 and summarize its data complications in Section 3. In Section 4 we place the study in © 2003 American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98, No. 462, Applications and Case Studies DOI 10.1198/016214503000071 299

Upload: nguyentram

Post on 28-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Principal Strati cation Approach to BrokenRandomized Experiments A Case Study

of School Choice Vouchers in New York CityJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

The precarious state of the educational system in the inner cities of the United States as well as its potential causes and solutions have beenpopular topics of debate in recent years Part of the dif culty in resolving this debate is the lack of solid empirical evidence regarding thetrue impact of educational initiatives The ef cacy of so-called ldquoschool choicerdquo programs has been a particularly contentious issue A currentmultimillion dollar program the School Choice Scholarship Foundation Program in New York randomized the distribution of vouchers inan attempt to shed some light on this issue This is an important time for school choice because on June 27 2002 the US Supreme Courtupheld the constitutionality of a voucher program in Cleveland that provides scholarships both to secular and religious private schoolsAlthough this study bene ts immensely from a randomized design it suffers from complications common to such research with humansubjects noncompliance with assigned ldquotreatmentsrdquo and missing data Recent work has revealed threats to valid estimates of experimentaleffects that exist in the presence of noncompliance and missing data even when the goal is to estimate simple intention-to-treat effects Ourgoal was to create a better solution when faced with both noncompliance and missing data This article presents a model that accommodatesthese complications that is based on the general framework of ldquoprincipal strati cationrdquo and thus relies on more plausible assumptions thanstandard methodology Our analyses revealed positive effects on math scores for children who applied to the program from certain types ofschoolsmdashthose with average test scores below the citywide median Among these children the effects are stronger for children who appliedin the rst grade and for African-American children

KEY WORDS Causal inference Missing data Noncompliance Pattern mixture models Principal strati cation Rubin causal modelSchool choice

1 INTRODUCTION

There appears to be a crisis in Americarsquos public schoolsldquoMore than half of 4th and 8th graders fail to reach the mostminimal standard on national tests in reading math and sci-ence meaning that they probably have dif culty doing grade-level workrdquo (Education Week 1998) The problem is worsein high poverty urban areas For instance although only 43of urban fourth-graders achieved a basic level of skill on aNational Assessment of Educational Progress (NAEP) readingtest a meager 23 of those in high-povertyurban schools metthis standard

One of the most complicated and contentious of educationalreforms currently being proposed is school choice Debatesabout the equity and potential ef cacy of school choice have in-creased in intensity over the past few years Authors making acase for school choice include Cobb (1992) Brandl (1998) and

John Barnard is Senior Research Statistician deCODE Genetics WalthamMA-02451 (E-mail johnbarnarddecodeis) Constantine E Frangakis isAssistant Professor Department of Biostatistics Johns Hopkins UniversityBaltimore MD 21205 (E-mail cfrangakjhsphedu) Jennifer L Hill is As-sistant Professor School of International and Public Affairs Columbia Univer-sity New York NY 10027 (E-mail jh1030columbiaedu) Donald B Rubinis John L Loeb Professor of Statistics and Chair Department of Statistics Har-vard University Cambridge MA 02138 (E-mail rubinstatharvardedu) Theauthors thank the editor an associate editor and three reviewers for very help-ful comments David Myers and Paul E Peterson as principal coinvestigatorsfor this evaluation and the School Choice Scholarships Foundation (SCSF) forcooperating fully with this evaluation The work was supported in part by Na-tional Institutes of Health (NIH) grant RO1 EY 014314-01 National ScienceFoundation grants SBR 9709359 and DMS 9705158 and by grants from thefollowing foundations Achelis Foundation Bodman Foundation Lynde andHarry Bradley Foundation Donner Foundation Milton and Rose D FriedmanFoundation John M Olin Foundation David and Lucile Packard FoundationSmith Richardson Foundation and the Spencer Foundation The authors alsothank Kristin Kearns Jordan and other members of the SCSF staff for their co-operation and assistance with data collection and Daniel Mayer and Julia Kimfrom Mathematica Policy Research for preparing the survey and test score dataand answering questions about that data The methodology analyses of datareported ndings and interpretations of ndings are the sole responsibility ofthe authors and are not subject to the approval of SCSF or of any foundationproviding support for this research

Coulson (1999) A collection of essays that report mainly pos-itive school choice effects has been published by Peterson andHassel (1998) Recent critiques of school choice include thoseby the Carnegie Foundation for the Advancement of Teaching(1992) Cookson (1994) Fuller and Elmore (1996) and Levin(1998)

In this article we evaluate a randomized experiment con-ducted in New York City made possible by the privately-fundedSchool Choice Scholarships Foundation (SCSF) The SCSFprogram provided the rst opportunity to examine the ques-tion of the potential for improved school performance (as wellas parental satisfaction and involvement school mobility andracial integration) in private schools versus public schools us-ing a carefully designed and monitored randomized eld exper-iment Earlier studies were observational in nature and thus sub-ject to selection bias (ie nonignorable treatment assignment)Studies ndingpositive educationalbene ts from attendingpri-vate schools include those of Coleman Hoffer and Kilgore(1982) Chubb and Moe (1990) and Derek (1997) Critiquesof these studies include those of Goldberger and Cain (1982)and Wilms (1985) On June 27 2002 the US Supreme Courtupheld the constitutionalityof a voucher program in Clevelandthat provides scholarships both to secular and religious privateschools

As occurs in most research involving human subjects how-ever our study although carefully implemented suffered fromcomplicationsdue to missing backgroundand outcomedata andalso to noncompliance with the randomly assigned treatmentWe focus on describing and addressing these complications inour study using a Bayesian approach with the framework ofprincipal strati cation (Frangakis and Rubin 2002)

We describe the study in Section 2 and summarize its datacomplications in Section 3 In Section 4 we place the study in

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000071

299

300 Journal of the American Statistical Association June 2003

the context of broken randomized experiments a phrase appar-ently rst coined by Barnard Du Hill and Rubin (1998) Wediscuss the framework that we use in Section 5 We present ourmodelrsquos structural assumptions in Section 6 and its parametricassumptions in Section 7 We give the main results of the analy-sis in Section 8 (with some supplementaryresults in the Appen-dix) We discuss model buildingand checking in Section 9 andconclude the article in Section 10

2 THE SCHOOL CHOICE SCHOLARSHIPSFOUNDATION PROGRAM

In February 1997 the SCSF announced that it would provide1300 scholarships to private school to ldquoeligiblerdquo low-incomefamilies Eligibility required that the children at the time ofapplication be living in New York City entering grades 1ndash5currently attending a public school and from families with in-comes low enough to qualify for free school lunch That springSCSF received applications from more than 20000 studentsTo participate in the lottery that would award the scholarshipsa family had to attend a session during which (1) their eligi-bility was con rmed (2) a questionnaire of background char-acteristics was administered to the parentsguardians and (3) apretest was administered to the eligible children The nal lot-tery held in May 1997 was administered by Mathematica Pol-icy Research (MPR) and the SCSF offered winning familieshelp in nding placements in private schools

Details of the design have been described by Hill Rubin andThomas (2000) The nal sample sizes of children are displayedin Table 1 PMPD refers to the randomizeddesign developedforthis study (propensity matched pairs design) This design relieson propensity score matching (Rosenbaum and Rubin 1983)which was used to choose a control group for the families inthe rst application period where there were more applicantsthat did not win the scholarship than could be followed Theldquosinglerdquo and ldquomultirdquo classi cations describe families that haveone child and more than one child participating in the program

Table 1 Sample Sizes in the SCSF Program

Randomized blockFamilysize Treatment PMPD 1 2 3 4 Subtotal Total

Single Scholarship 353 72 65 82 104 323 676Control 353 72 65 82 104 323 676

Multi Scholarship 147 44 27 31 75 177 324Control 147 27 23 33 54 137 284

Total 1000 960 1960

For period 1 and single-child families Table 2 (taken fromBarnard Frangakis Hill and Rubin 2002 table 16) comparesthe balance achieved on background variables with the PMPDand two other possible designs a simple random sample (RS)and a strati ed random sample (STRS) of the same size fromthe pool of all potential matching subjects at period 1 For theSTRS the strata are the ldquoapplicantrsquos schoolrdquo (lowhigh) whichindicates whether the applicant child originates from a schoolthat had average test scores below (low) or above (high) thecitywide median in the year of application Measures of com-parison in Table 2 are Z statistics between the randomizedarms Overall the PMPD produces better balance in 15 of the21 variables compared with the RS design The PMPDrsquos bal-ance was better in 11 variables and worse in 9 variables (1 tie)compared with the STRS although the gains are generallylarger in the former case than in the latter case The table alsodemonstrates balance for the application periods 2ndash5 whichwere part of a standard randomized block design in which theblocks were each of the four periods cross-classi ed by familysize and by applicantrsquos school

More generally the entire experiment is a randomized de-sign where the assignment probabilities are a function of thefollowing design variables period of application applicantrsquosschool family size (single child versus multichild) and the es-timated propensity scores from the PMPD (For additional in-formation on the design see Hill et al 2000) Next we focuson the studyrsquos main data complications

Table 2 Design Comparisons in Balance of Background Variables Single-Child Families The Numbers AreZ Statistics From Comparing Observed Values of Variables Between Assignments

Application period 1 Periods 2ndash5

Simple Strati ed RandomizedVariable random sample random sample PMPD block

Applicantrsquos school (lowhigh) iexcl98 0 11 21Grade level iexcl163 03 iexcl03 iexcl39Pretest read score iexcl38 65 48 iexcl105Pretest math score iexcl51 117 20 iexcl137African-American 180 168 159 174Motherrsquos education 16 14 09 167In special education 31 166 iexcl17 22In gifted program 42 iexcl116 iexcl13 75English main language iexcl106 iexcl02 iexcl103 iexcl44AFDC iexcl28 49 83 iexcl157Food stamps iexcl108 iexcl27 94 iexcl131Mother works iexcl126 iexcl30 iexcl118 40Educational expectations 50 179 57 19Children in household iexcl101 iexcl175 41 iexcl102Child born in US 49 73 iexcl140 iexcl69Length of residence 42 71 66 iexcl78Fatherrsquos work missing 109 70 0 16Catholic religion iexcl184 iexcl19 iexcl74 iexcl80Male 88 122 76 53Income iexcl38 iexcl62 74 iexcl121Age as of 497 iexcl157 18 iexcl47 iexcl87

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 301

3 DATA COMPLICATIONS

The data that we use include the design variables the back-ground survey collectedat the veri cation sessions pretest dataand posttest data collected the following spring The test scoresused as outcomes in our analyses are grade-normed nationalpercentile rankings of reading and math scores from the IowaTest of Basic Skills (ITBS) The ITBS was used because it is notthe test administered in the New York City public school sys-tem which reduces the possibility of teachers in schools withparticipating children ldquoteaching to the testrdquo

Attempts to reduce missingness of data includedrequiringat-tendance at the initial veri cation sessions and providing nan-cial incentives to attend the follow-up testing sessions Despitethese attempts missingness of background variables did occurbefore randomization In principle such missingness is also acovariate and so does not directly create imbalance of subjectsbetween randomized arms although it does create loss in ef -ciency when as in this study background covariates are usedin the analysis For example for single-child families depend-ing on application period and background strata 34ndash51 ofthe childrenrsquos pretest scores were missing at the time of designplanning Since then MPR has processed and provided an ad-ditional 7 and 14 of the reading and mathematics pretestscores These scores were not as balanced between arms aswhen choosing the design (see Table 2) althoughthe differencewas not statistically signi cant at conventionallevels Hence weused all available pretest scores in the nal analysis that is con-ditional on these scores

Outcomes were also incomplete Among the observed out-comes in single-child families the average (standard deviation)was 235 (225) percentile rankings for mathematics and 281(239) percentile rankings for reading and unadjusted com-parisons between randomized arms do not point to a notabledifference However in contrast to missingness of backgroundvariables missingness of outcomes that occurs after random-ization is not guaranteed to be balanced between the random-ized arms For example depending on application period andbackground strata 18ndash27 of the children did not provideposttest scores and during design periods 2ndash5 the responseis higher among scholarship winners (80) than among theother children (73) Analyses that would be limited to com-plete cases for these variablesand the variablesused to calculatethe propensity score would discard more than half of the unitsMoreover standard adjustments for outcome missingness ig-nore its potential interaction with the other complications andgenerally make implicit and unrealistic assumptions

Another complication was noncompliance attendance at aprivate school was not perfectly correlated with winning ascholarshipFor example for single-child families and depend-ing on application period and background strata 20ndash27 ofchildren who won scholarships did not use them (scholarshipswere in the amount of $1400 which generally did not fullycover tuition at a private school) and 6ndash10 of children thatdid not win scholarships were sent to private schools neverthe-less

Two additional complications limit our analysis sampleFirst no pretest scores were obtained for applicants in kinder-garten because these children most likely had never been ex-posed to a standardized test hence considerable time would

have been spent instructing them on how to take a test andthere was concern that separating such young children fromtheir guardians in this new environment might lead to behav-ioral issues Hence we focus our analyses on the children whoapplied in rst grade or above Second we do not yet havecomplete compliance data for the multichild families Conse-quently all analyses reported in this article are further limitedto results for the 1050 single-child families who were in grades1ndash4 at the time of the spring 1997 application process

4 THE STUDY AS A BROKENRANDOMIZED EXPERIMENT

The foregoing deviations from the studyrsquos protocol clarifythat our experiment does not really randomize attendance butrather randomizes the ldquoencouragementrdquo using nancial sup-port to attend a private school rather than a public schoolMoreover as in most encouragement studies interest here fo-cuses not only on the effect of encouragement itself (which willdepend on what percentage of people encouraged would actu-ally participate if the voucher program were to be more broadlyimplemented) but also on the effect of the treatment beingencouragedmdashhere attending private versus public schools Ifthere were perfect compliance so that all those encouraged toattend private school actually did so and all those not so encour-aged stayed in public school then the effect being estimatedtypically would be attributed to private versus public school at-tendance rather than simply to the encouragement

We focus on de ning and estimating two estimands theintention-to-treat (ITT) effect the effect of the randomized en-couragement on all subjects and the complier average causaleffect (CACE) the effect of the randomized encouragement onall subjects who would comply with their treatment assignmentno matter which assignment they would be given (here the chil-dren who would have attended private school if they had wona scholarship and would not have attended had they not wona scholarship) These quantities are de ned more formally inSection 8

In recent years there has been substantial progress in theanalysis of encouragement designs based on building bridgesbetween statistical and econometric approaches to causal in-ference In particular the widely accepted approach in statis-tics to formulating causal questions is in terms of ldquopotentialoutcomesrdquo Although this approach has roots dating back toNeyman and Fisher in the context of perfect randomized exper-iments (Neyman 1923 Rubin 1990) it is generally referred toas Rubinrsquos causal model (Holland 1986) for work extendingtheframework to observational studies (Rubin 1974 1977) and in-cluding modes of inference other than randomization-basedmdashin particular Bayesian (Rubin 1978a 1990) In economics thetechnique of instrumental variables due to Tinbergen (1930)and Haavelmo (1943) has been a main tool of causal inferencein the type of nonrandomizedstudies prevalent in that eld An-grist Imbens and Rubin (1996) showed how the approachescan be viewed as completelycompatible thereby clarifying andstrengthening each approach The result was an interpretationof the instrumental variables technologyas a way to approach arandomized experiment that suffers from noncompliance suchas a randomized encouragement design

302 Journal of the American Statistical Association June 2003

In encouragement designs with compliance as the only par-tially uncontrolled factor and where there are full outcomedata Imbens and Rubin (1997) extended the Bayesian approachto causal inference of Rubin (1978a) to handle simple random-ized experimentswith noncomplianceand Hirano Imbens Ru-bin and Zhou (2000) further extended the approach to handlefully observed covariates

In encouragement designs with more than one partially un-controlled factor as with noncomplianceand missing outcomesin our study de ning and estimating treatment effects of inter-est becomes more challengingExisting methods (eg RobinsGreenland and Hu 1999) are designed for studies that dif-fer from our study in the goals and the degree of control ofthe aforementioned factors (For a detailed comparison of suchframeworks see Frangakis and Rubin 2002)Frangakis and Ru-bin (1999) studied a more exible framework for encourage-ment designs with both noncompliance to treatment and miss-ing outcomes and showed that for estimation of either the ITTeffect or CACE one cannot in general obtain valid estimatesusing standard ITT analyses (ie analyses that ignore data oncompliance behavior) or standard IV analyses (ie those thatignore the interaction between compliance behavior and out-come missingness) They also provided consistent moment-based estimators that can estimate both ITT and CACE underassumptions more plausible than those underlying more stan-dard methods

Barnard et al (1998) extended that template to allow formissing covariate and multivariate outcome values theystopped short however of introducing speci c methodologyfor this framework We present a solution to a still more chal-lenging situation in which we have a more complicated form ofnoncompliancemdashsome children attend private school withoutreceiving the monetary encouragement (thus receiving treat-ment without having been assigned to it) Under assumptionssimilar to those of Barnard et al (1998) we next fully developa Bayesian framework that yields valid estimates of quantitiesof interest and also properly accounts for our uncertainty aboutthese quantities

5 PRINCIPAL STRATIFICATION IN SCHOOLCHOICE AND ROLE FOR CAUSAL INFERENCE

To make explicit the assumptions necessary for valid causalinference in this study we rst introduce ldquopotential outcomesrdquo(see Rubin 1979 Holland 1986) for all of the posttreatmentvariables Potential outcomes for any given variable representthe observable manifestations of this variable under each pos-sible treatment assignment In particular if child i in our study(i D 1 n is to be assigned to treatment z (1 for privateschool and 0 for public school) we denote the following Diz

for the indicator equal to 1 if the child attends private schooland 0 if the child attends public school Yiz for the poten-tial outcomes if the child were to take the tests and Ryi

z forthe indicators equal to 1 if the child takes the tests We denoteby Zi the indicator equal to 1 for the observed assignment toprivate school or not and let Di D DiZi Yi D Yi Zi andRyi D Ryi Zi designate the actual type of school the outcometo be recorded by the test and whether or not the child takes thetest under the observed assignment

The notation for these and the remaining de nitions of rele-vant variables in this study are summarized in Table 3 In ourstudy the outcomes are reading and math test scores so the di-mension of Yiz equals two although more generally our tem-plate allows for repeated measurements over time

Our design variables are application period lowhigh indi-cators for the applicantrsquos school grade level and propensityscore In additioncorrespondingto each set of these individual-speci c random variables is a variable (vector or matrix) with-out subscript i that refers to the set of these variables acrossall study participants For example Z is the vector of treatmentassignments for all study participants with ith element Zi andX is the matrix of partially observed backgroundvariables withith row Xi

The variable Ci (see Table 3) the joint vector of treatmentreceipt under both treatment assignments is particularly im-portant Speci cally Ci de nes four strata of people compli-ers who take the treatment if so assigned and take control if soassigned never takers who never take the treatment no matter

Table 3 Notation for the i th Subject

Notation Specics General description

Zi 1 if i assigned treatment Binary indicator of treatment assignment0 if i assigned control

Di (z) 1 if i receives treatment under assignment z Potential outcome formulation of treatment receipt0 if i receives control under assignment z

Di Di (Zi ) Binary indicator of treatment receiptCi c if Di (0) D 0 and Di (1) D 1 Compliance principal stratum c D complier n D never taker

n if Di (0) D 0 and Di (1) D 0 a D always taker d D de era if Di (0) D 1 and Di (1) D 1d if Di (0) D 1 and Di (1) D 0

Yi (z) (Y (math)i (z)Y (read)

i (z)) Potential outcomes for math and reading

Yi (Y (math)i (Zi )Y (read)

i (Zi )) Math and reading outcomes under observed assignment

Ry (math)i (z) 1 if Y (math)

i (z) would be observed Response indicator for Y (math)i (z) under assignment z

0 if Y (math)i (z) would not be observed similarly for Ry (read)

i (z)

Ryi (z) (Ry (math)i (z)Ry (read)

i (z)) Vector of response indicators for Yi (z)

Ryi (Ry (math)i (Zi )Ry (read)

i (Zi )) Vector of response indicators for Yi

Wi (Wi1 WiK ) Fully observed background and design variablesXi (Xi1 XiQ ) Partially observed background variablesRxi (Rxi1 RxiQ ) Vector of response indicators for Xi

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 303

the assignment always takers who always take the treatmentno matter the assignment and de ers who do the opposite ofthe assignment no matter its value These strata are not fully ob-served in contrast to observed strata of actual assignment andattendance ZiDi For example children who are observedto attend private school when winning the lottery are a mixtureof compliers Ci D c and always takers Ci D a Such explicitstrati cation on Ci dates back at least to the work of Imbens andRubin (1997) on randomized trials with noncompliance wasgeneralized and taxonomized by Frangakis and Rubin (2002)to posttreatment variables in partially controlled studies and istermed a ldquoprincipal strati cationrdquo based on the posttreatmentvariables

The principal strata Ci have two important properties Firstthey are not affected by the assigned treatment Second com-parisons of potential outcomes under different assignmentswithin principal strata called principal effects are well-de nedcausal effects (Frangakis and Rubin 2002) These propertiesmake principal strati cation a powerful framework for evalu-ation because it allows us to explicitly de ne estimands thatbetter represent the effect of attendance in the presence of non-compliance and to explore richer and explicit sets of assump-tions that allow estimation of these effects under more plausiblethan standard conditions Sections 6 and 7 discuss such a set ofmore exible assumptions and estimands

6 STRUCTURAL ASSUMPTIONS

First we state explicitly our structural assumptions about thedata with regard to the causal process the missing data mech-anism and the noncompliancestructure These assumptions areexpressed without reference to a particular parametric distribu-tion and are the ones that make the estimands of interest identi- able as also discussed in Section 64

61 Stable Unit Treatment Value Assumption

A standard assumption made in causal analyses is the stableunit treatment value assumption (SUTVA) formalized with po-tential outcomes by Rubin (1978a 1980 1990) SUTVA com-bines the no-interferenceassumption (Cox 1958) that one unitrsquostreatment assignment does not affect another unitrsquos outcomeswith the assumption that there are ldquono versions of treatmentsrdquoFor the no-interference assumption to hold whether or not onefamily won a scholarship shouldnot affect another familyrsquos out-comes such as their choice to attend private school or their chil-drenrsquos test scores We expect our results to be robust to the typesand degree of deviations from no interference that might be an-ticipated in this study To satisfy the ldquono versionsof treatmentsrdquowe need to limit the de nition of private and public schools tothose participating in the experimentGeneralizabilityof the re-sults to other schools must be judged separately

62 Randomization

We assume that scholarships have been randomly assignedThis implies that

pZ j Y 1 Y 0XWC Ry0Ry1Rx micro

D pZ j W curren micro D pZ j Wcurren

where W curren represents the design variables in W and micro is genericnotation for the parameters governing the distribution of all

the variables There is no dependence on micro because there areno unknown parameters governing the treatment-assignmentmechanism MPR assigned the scholarships by lottery andthe randomization probabilities within for applicantrsquos school(lowhigh) and application period are known

63 Noncompliance Process Assumptions Monotonicityand Compound Exclusion

We assume monotonicity that there are no ldquode ersrdquomdashthatis for all i Di1 cedil Di0 (Imbens and Angrist 1994) Inthe SCSF program de ers would be families who would notuse a scholarship if they won one but would pay to go to privateschool if they did not win a scholarship It seems implausiblethat such a group of people exists therefore the monotonicityassumption appears to be reasonable for our study

By de nition the never takers and always takers will par-ticipate in the same treatment (control or treatment) regard-less of which treatment they are randomly assigned For thisreason and to facilitate estimation we assume compound ex-clusion The outcomes and missingness of outcomes for nevertakers and always takers are not affected by treatment assign-ment This compound exclusion restriction (Frangakis and Ru-bin 1999) generalizes the standard exclusion restriction (An-grist et al 1996 Imbens and Rubin 1997) and can be expressedformally for the distributionsas

pY 1Ry1 j XobsRx W C D n

D pY 0Ry0 j XobsRx W C D n for never takers

and

pY 1 Ry1 j Xobs RxWC D a

D pY 0Ry0 j Xobs RxWC D a for always takers

Compound exclusion seems more plausible for never takersthan for always takers in this study Never takers stayed in thepublic school system whether they won a scholarship or notAlthougha disappointmentaboutwinning a scholarshipbut stillnot being able to take advantage of it can exist it is unlikely tocause notable differences in subsequent test scores or responsebehaviors

Always takers on the other hand might have been in oneprivate school had they won a scholarship or in another if theyhad not particularly because those who won scholarships hadaccess to resources to help them nd an appropriate privateschool and had more money (up to $1400) to use toward tu-ition In addition even if the child had attended the same pri-vate school under either treatment assignment the extra $1400in family resources for those who won the scholarship couldhave had an effect on student outcomes However because inour application the estimated percentage of always takers is sosmall (approximately 9)mdashan estimate that is robust due tothe randomization to relaxing compound exclusionmdashthere isreason to believe that this assumption will not have a substan-tial impact on the results

Under the compound exclusion restriction the ITT compar-ison of all outcomes Yi0 versus Yi1 includes the null com-parison among the subgroupsof never takers and always takersMoreover by monotonicitythe compliers Ci D c are the onlygroup of childrenwho would attend private school if and only if

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

300 Journal of the American Statistical Association June 2003

the context of broken randomized experiments a phrase appar-ently rst coined by Barnard Du Hill and Rubin (1998) Wediscuss the framework that we use in Section 5 We present ourmodelrsquos structural assumptions in Section 6 and its parametricassumptions in Section 7 We give the main results of the analy-sis in Section 8 (with some supplementaryresults in the Appen-dix) We discuss model buildingand checking in Section 9 andconclude the article in Section 10

2 THE SCHOOL CHOICE SCHOLARSHIPSFOUNDATION PROGRAM

In February 1997 the SCSF announced that it would provide1300 scholarships to private school to ldquoeligiblerdquo low-incomefamilies Eligibility required that the children at the time ofapplication be living in New York City entering grades 1ndash5currently attending a public school and from families with in-comes low enough to qualify for free school lunch That springSCSF received applications from more than 20000 studentsTo participate in the lottery that would award the scholarshipsa family had to attend a session during which (1) their eligi-bility was con rmed (2) a questionnaire of background char-acteristics was administered to the parentsguardians and (3) apretest was administered to the eligible children The nal lot-tery held in May 1997 was administered by Mathematica Pol-icy Research (MPR) and the SCSF offered winning familieshelp in nding placements in private schools

Details of the design have been described by Hill Rubin andThomas (2000) The nal sample sizes of children are displayedin Table 1 PMPD refers to the randomizeddesign developedforthis study (propensity matched pairs design) This design relieson propensity score matching (Rosenbaum and Rubin 1983)which was used to choose a control group for the families inthe rst application period where there were more applicantsthat did not win the scholarship than could be followed Theldquosinglerdquo and ldquomultirdquo classi cations describe families that haveone child and more than one child participating in the program

Table 1 Sample Sizes in the SCSF Program

Randomized blockFamilysize Treatment PMPD 1 2 3 4 Subtotal Total

Single Scholarship 353 72 65 82 104 323 676Control 353 72 65 82 104 323 676

Multi Scholarship 147 44 27 31 75 177 324Control 147 27 23 33 54 137 284

Total 1000 960 1960

For period 1 and single-child families Table 2 (taken fromBarnard Frangakis Hill and Rubin 2002 table 16) comparesthe balance achieved on background variables with the PMPDand two other possible designs a simple random sample (RS)and a strati ed random sample (STRS) of the same size fromthe pool of all potential matching subjects at period 1 For theSTRS the strata are the ldquoapplicantrsquos schoolrdquo (lowhigh) whichindicates whether the applicant child originates from a schoolthat had average test scores below (low) or above (high) thecitywide median in the year of application Measures of com-parison in Table 2 are Z statistics between the randomizedarms Overall the PMPD produces better balance in 15 of the21 variables compared with the RS design The PMPDrsquos bal-ance was better in 11 variables and worse in 9 variables (1 tie)compared with the STRS although the gains are generallylarger in the former case than in the latter case The table alsodemonstrates balance for the application periods 2ndash5 whichwere part of a standard randomized block design in which theblocks were each of the four periods cross-classi ed by familysize and by applicantrsquos school

More generally the entire experiment is a randomized de-sign where the assignment probabilities are a function of thefollowing design variables period of application applicantrsquosschool family size (single child versus multichild) and the es-timated propensity scores from the PMPD (For additional in-formation on the design see Hill et al 2000) Next we focuson the studyrsquos main data complications

Table 2 Design Comparisons in Balance of Background Variables Single-Child Families The Numbers AreZ Statistics From Comparing Observed Values of Variables Between Assignments

Application period 1 Periods 2ndash5

Simple Strati ed RandomizedVariable random sample random sample PMPD block

Applicantrsquos school (lowhigh) iexcl98 0 11 21Grade level iexcl163 03 iexcl03 iexcl39Pretest read score iexcl38 65 48 iexcl105Pretest math score iexcl51 117 20 iexcl137African-American 180 168 159 174Motherrsquos education 16 14 09 167In special education 31 166 iexcl17 22In gifted program 42 iexcl116 iexcl13 75English main language iexcl106 iexcl02 iexcl103 iexcl44AFDC iexcl28 49 83 iexcl157Food stamps iexcl108 iexcl27 94 iexcl131Mother works iexcl126 iexcl30 iexcl118 40Educational expectations 50 179 57 19Children in household iexcl101 iexcl175 41 iexcl102Child born in US 49 73 iexcl140 iexcl69Length of residence 42 71 66 iexcl78Fatherrsquos work missing 109 70 0 16Catholic religion iexcl184 iexcl19 iexcl74 iexcl80Male 88 122 76 53Income iexcl38 iexcl62 74 iexcl121Age as of 497 iexcl157 18 iexcl47 iexcl87

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 301

3 DATA COMPLICATIONS

The data that we use include the design variables the back-ground survey collectedat the veri cation sessions pretest dataand posttest data collected the following spring The test scoresused as outcomes in our analyses are grade-normed nationalpercentile rankings of reading and math scores from the IowaTest of Basic Skills (ITBS) The ITBS was used because it is notthe test administered in the New York City public school sys-tem which reduces the possibility of teachers in schools withparticipating children ldquoteaching to the testrdquo

Attempts to reduce missingness of data includedrequiringat-tendance at the initial veri cation sessions and providing nan-cial incentives to attend the follow-up testing sessions Despitethese attempts missingness of background variables did occurbefore randomization In principle such missingness is also acovariate and so does not directly create imbalance of subjectsbetween randomized arms although it does create loss in ef -ciency when as in this study background covariates are usedin the analysis For example for single-child families depend-ing on application period and background strata 34ndash51 ofthe childrenrsquos pretest scores were missing at the time of designplanning Since then MPR has processed and provided an ad-ditional 7 and 14 of the reading and mathematics pretestscores These scores were not as balanced between arms aswhen choosing the design (see Table 2) althoughthe differencewas not statistically signi cant at conventionallevels Hence weused all available pretest scores in the nal analysis that is con-ditional on these scores

Outcomes were also incomplete Among the observed out-comes in single-child families the average (standard deviation)was 235 (225) percentile rankings for mathematics and 281(239) percentile rankings for reading and unadjusted com-parisons between randomized arms do not point to a notabledifference However in contrast to missingness of backgroundvariables missingness of outcomes that occurs after random-ization is not guaranteed to be balanced between the random-ized arms For example depending on application period andbackground strata 18ndash27 of the children did not provideposttest scores and during design periods 2ndash5 the responseis higher among scholarship winners (80) than among theother children (73) Analyses that would be limited to com-plete cases for these variablesand the variablesused to calculatethe propensity score would discard more than half of the unitsMoreover standard adjustments for outcome missingness ig-nore its potential interaction with the other complications andgenerally make implicit and unrealistic assumptions

Another complication was noncompliance attendance at aprivate school was not perfectly correlated with winning ascholarshipFor example for single-child families and depend-ing on application period and background strata 20ndash27 ofchildren who won scholarships did not use them (scholarshipswere in the amount of $1400 which generally did not fullycover tuition at a private school) and 6ndash10 of children thatdid not win scholarships were sent to private schools neverthe-less

Two additional complications limit our analysis sampleFirst no pretest scores were obtained for applicants in kinder-garten because these children most likely had never been ex-posed to a standardized test hence considerable time would

have been spent instructing them on how to take a test andthere was concern that separating such young children fromtheir guardians in this new environment might lead to behav-ioral issues Hence we focus our analyses on the children whoapplied in rst grade or above Second we do not yet havecomplete compliance data for the multichild families Conse-quently all analyses reported in this article are further limitedto results for the 1050 single-child families who were in grades1ndash4 at the time of the spring 1997 application process

4 THE STUDY AS A BROKENRANDOMIZED EXPERIMENT

The foregoing deviations from the studyrsquos protocol clarifythat our experiment does not really randomize attendance butrather randomizes the ldquoencouragementrdquo using nancial sup-port to attend a private school rather than a public schoolMoreover as in most encouragement studies interest here fo-cuses not only on the effect of encouragement itself (which willdepend on what percentage of people encouraged would actu-ally participate if the voucher program were to be more broadlyimplemented) but also on the effect of the treatment beingencouragedmdashhere attending private versus public schools Ifthere were perfect compliance so that all those encouraged toattend private school actually did so and all those not so encour-aged stayed in public school then the effect being estimatedtypically would be attributed to private versus public school at-tendance rather than simply to the encouragement

We focus on de ning and estimating two estimands theintention-to-treat (ITT) effect the effect of the randomized en-couragement on all subjects and the complier average causaleffect (CACE) the effect of the randomized encouragement onall subjects who would comply with their treatment assignmentno matter which assignment they would be given (here the chil-dren who would have attended private school if they had wona scholarship and would not have attended had they not wona scholarship) These quantities are de ned more formally inSection 8

In recent years there has been substantial progress in theanalysis of encouragement designs based on building bridgesbetween statistical and econometric approaches to causal in-ference In particular the widely accepted approach in statis-tics to formulating causal questions is in terms of ldquopotentialoutcomesrdquo Although this approach has roots dating back toNeyman and Fisher in the context of perfect randomized exper-iments (Neyman 1923 Rubin 1990) it is generally referred toas Rubinrsquos causal model (Holland 1986) for work extendingtheframework to observational studies (Rubin 1974 1977) and in-cluding modes of inference other than randomization-basedmdashin particular Bayesian (Rubin 1978a 1990) In economics thetechnique of instrumental variables due to Tinbergen (1930)and Haavelmo (1943) has been a main tool of causal inferencein the type of nonrandomizedstudies prevalent in that eld An-grist Imbens and Rubin (1996) showed how the approachescan be viewed as completelycompatible thereby clarifying andstrengthening each approach The result was an interpretationof the instrumental variables technologyas a way to approach arandomized experiment that suffers from noncompliance suchas a randomized encouragement design

302 Journal of the American Statistical Association June 2003

In encouragement designs with compliance as the only par-tially uncontrolled factor and where there are full outcomedata Imbens and Rubin (1997) extended the Bayesian approachto causal inference of Rubin (1978a) to handle simple random-ized experimentswith noncomplianceand Hirano Imbens Ru-bin and Zhou (2000) further extended the approach to handlefully observed covariates

In encouragement designs with more than one partially un-controlled factor as with noncomplianceand missing outcomesin our study de ning and estimating treatment effects of inter-est becomes more challengingExisting methods (eg RobinsGreenland and Hu 1999) are designed for studies that dif-fer from our study in the goals and the degree of control ofthe aforementioned factors (For a detailed comparison of suchframeworks see Frangakis and Rubin 2002)Frangakis and Ru-bin (1999) studied a more exible framework for encourage-ment designs with both noncompliance to treatment and miss-ing outcomes and showed that for estimation of either the ITTeffect or CACE one cannot in general obtain valid estimatesusing standard ITT analyses (ie analyses that ignore data oncompliance behavior) or standard IV analyses (ie those thatignore the interaction between compliance behavior and out-come missingness) They also provided consistent moment-based estimators that can estimate both ITT and CACE underassumptions more plausible than those underlying more stan-dard methods

Barnard et al (1998) extended that template to allow formissing covariate and multivariate outcome values theystopped short however of introducing speci c methodologyfor this framework We present a solution to a still more chal-lenging situation in which we have a more complicated form ofnoncompliancemdashsome children attend private school withoutreceiving the monetary encouragement (thus receiving treat-ment without having been assigned to it) Under assumptionssimilar to those of Barnard et al (1998) we next fully developa Bayesian framework that yields valid estimates of quantitiesof interest and also properly accounts for our uncertainty aboutthese quantities

5 PRINCIPAL STRATIFICATION IN SCHOOLCHOICE AND ROLE FOR CAUSAL INFERENCE

To make explicit the assumptions necessary for valid causalinference in this study we rst introduce ldquopotential outcomesrdquo(see Rubin 1979 Holland 1986) for all of the posttreatmentvariables Potential outcomes for any given variable representthe observable manifestations of this variable under each pos-sible treatment assignment In particular if child i in our study(i D 1 n is to be assigned to treatment z (1 for privateschool and 0 for public school) we denote the following Diz

for the indicator equal to 1 if the child attends private schooland 0 if the child attends public school Yiz for the poten-tial outcomes if the child were to take the tests and Ryi

z forthe indicators equal to 1 if the child takes the tests We denoteby Zi the indicator equal to 1 for the observed assignment toprivate school or not and let Di D DiZi Yi D Yi Zi andRyi D Ryi Zi designate the actual type of school the outcometo be recorded by the test and whether or not the child takes thetest under the observed assignment

The notation for these and the remaining de nitions of rele-vant variables in this study are summarized in Table 3 In ourstudy the outcomes are reading and math test scores so the di-mension of Yiz equals two although more generally our tem-plate allows for repeated measurements over time

Our design variables are application period lowhigh indi-cators for the applicantrsquos school grade level and propensityscore In additioncorrespondingto each set of these individual-speci c random variables is a variable (vector or matrix) with-out subscript i that refers to the set of these variables acrossall study participants For example Z is the vector of treatmentassignments for all study participants with ith element Zi andX is the matrix of partially observed backgroundvariables withith row Xi

The variable Ci (see Table 3) the joint vector of treatmentreceipt under both treatment assignments is particularly im-portant Speci cally Ci de nes four strata of people compli-ers who take the treatment if so assigned and take control if soassigned never takers who never take the treatment no matter

Table 3 Notation for the i th Subject

Notation Specics General description

Zi 1 if i assigned treatment Binary indicator of treatment assignment0 if i assigned control

Di (z) 1 if i receives treatment under assignment z Potential outcome formulation of treatment receipt0 if i receives control under assignment z

Di Di (Zi ) Binary indicator of treatment receiptCi c if Di (0) D 0 and Di (1) D 1 Compliance principal stratum c D complier n D never taker

n if Di (0) D 0 and Di (1) D 0 a D always taker d D de era if Di (0) D 1 and Di (1) D 1d if Di (0) D 1 and Di (1) D 0

Yi (z) (Y (math)i (z)Y (read)

i (z)) Potential outcomes for math and reading

Yi (Y (math)i (Zi )Y (read)

i (Zi )) Math and reading outcomes under observed assignment

Ry (math)i (z) 1 if Y (math)

i (z) would be observed Response indicator for Y (math)i (z) under assignment z

0 if Y (math)i (z) would not be observed similarly for Ry (read)

i (z)

Ryi (z) (Ry (math)i (z)Ry (read)

i (z)) Vector of response indicators for Yi (z)

Ryi (Ry (math)i (Zi )Ry (read)

i (Zi )) Vector of response indicators for Yi

Wi (Wi1 WiK ) Fully observed background and design variablesXi (Xi1 XiQ ) Partially observed background variablesRxi (Rxi1 RxiQ ) Vector of response indicators for Xi

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 303

the assignment always takers who always take the treatmentno matter the assignment and de ers who do the opposite ofthe assignment no matter its value These strata are not fully ob-served in contrast to observed strata of actual assignment andattendance ZiDi For example children who are observedto attend private school when winning the lottery are a mixtureof compliers Ci D c and always takers Ci D a Such explicitstrati cation on Ci dates back at least to the work of Imbens andRubin (1997) on randomized trials with noncompliance wasgeneralized and taxonomized by Frangakis and Rubin (2002)to posttreatment variables in partially controlled studies and istermed a ldquoprincipal strati cationrdquo based on the posttreatmentvariables

The principal strata Ci have two important properties Firstthey are not affected by the assigned treatment Second com-parisons of potential outcomes under different assignmentswithin principal strata called principal effects are well-de nedcausal effects (Frangakis and Rubin 2002) These propertiesmake principal strati cation a powerful framework for evalu-ation because it allows us to explicitly de ne estimands thatbetter represent the effect of attendance in the presence of non-compliance and to explore richer and explicit sets of assump-tions that allow estimation of these effects under more plausiblethan standard conditions Sections 6 and 7 discuss such a set ofmore exible assumptions and estimands

6 STRUCTURAL ASSUMPTIONS

First we state explicitly our structural assumptions about thedata with regard to the causal process the missing data mech-anism and the noncompliancestructure These assumptions areexpressed without reference to a particular parametric distribu-tion and are the ones that make the estimands of interest identi- able as also discussed in Section 64

61 Stable Unit Treatment Value Assumption

A standard assumption made in causal analyses is the stableunit treatment value assumption (SUTVA) formalized with po-tential outcomes by Rubin (1978a 1980 1990) SUTVA com-bines the no-interferenceassumption (Cox 1958) that one unitrsquostreatment assignment does not affect another unitrsquos outcomeswith the assumption that there are ldquono versions of treatmentsrdquoFor the no-interference assumption to hold whether or not onefamily won a scholarship shouldnot affect another familyrsquos out-comes such as their choice to attend private school or their chil-drenrsquos test scores We expect our results to be robust to the typesand degree of deviations from no interference that might be an-ticipated in this study To satisfy the ldquono versionsof treatmentsrdquowe need to limit the de nition of private and public schools tothose participating in the experimentGeneralizabilityof the re-sults to other schools must be judged separately

62 Randomization

We assume that scholarships have been randomly assignedThis implies that

pZ j Y 1 Y 0XWC Ry0Ry1Rx micro

D pZ j W curren micro D pZ j Wcurren

where W curren represents the design variables in W and micro is genericnotation for the parameters governing the distribution of all

the variables There is no dependence on micro because there areno unknown parameters governing the treatment-assignmentmechanism MPR assigned the scholarships by lottery andthe randomization probabilities within for applicantrsquos school(lowhigh) and application period are known

63 Noncompliance Process Assumptions Monotonicityand Compound Exclusion

We assume monotonicity that there are no ldquode ersrdquomdashthatis for all i Di1 cedil Di0 (Imbens and Angrist 1994) Inthe SCSF program de ers would be families who would notuse a scholarship if they won one but would pay to go to privateschool if they did not win a scholarship It seems implausiblethat such a group of people exists therefore the monotonicityassumption appears to be reasonable for our study

By de nition the never takers and always takers will par-ticipate in the same treatment (control or treatment) regard-less of which treatment they are randomly assigned For thisreason and to facilitate estimation we assume compound ex-clusion The outcomes and missingness of outcomes for nevertakers and always takers are not affected by treatment assign-ment This compound exclusion restriction (Frangakis and Ru-bin 1999) generalizes the standard exclusion restriction (An-grist et al 1996 Imbens and Rubin 1997) and can be expressedformally for the distributionsas

pY 1Ry1 j XobsRx W C D n

D pY 0Ry0 j XobsRx W C D n for never takers

and

pY 1 Ry1 j Xobs RxWC D a

D pY 0Ry0 j Xobs RxWC D a for always takers

Compound exclusion seems more plausible for never takersthan for always takers in this study Never takers stayed in thepublic school system whether they won a scholarship or notAlthougha disappointmentaboutwinning a scholarshipbut stillnot being able to take advantage of it can exist it is unlikely tocause notable differences in subsequent test scores or responsebehaviors

Always takers on the other hand might have been in oneprivate school had they won a scholarship or in another if theyhad not particularly because those who won scholarships hadaccess to resources to help them nd an appropriate privateschool and had more money (up to $1400) to use toward tu-ition In addition even if the child had attended the same pri-vate school under either treatment assignment the extra $1400in family resources for those who won the scholarship couldhave had an effect on student outcomes However because inour application the estimated percentage of always takers is sosmall (approximately 9)mdashan estimate that is robust due tothe randomization to relaxing compound exclusionmdashthere isreason to believe that this assumption will not have a substan-tial impact on the results

Under the compound exclusion restriction the ITT compar-ison of all outcomes Yi0 versus Yi1 includes the null com-parison among the subgroupsof never takers and always takersMoreover by monotonicitythe compliers Ci D c are the onlygroup of childrenwho would attend private school if and only if

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 301

3 DATA COMPLICATIONS

The data that we use include the design variables the back-ground survey collectedat the veri cation sessions pretest dataand posttest data collected the following spring The test scoresused as outcomes in our analyses are grade-normed nationalpercentile rankings of reading and math scores from the IowaTest of Basic Skills (ITBS) The ITBS was used because it is notthe test administered in the New York City public school sys-tem which reduces the possibility of teachers in schools withparticipating children ldquoteaching to the testrdquo

Attempts to reduce missingness of data includedrequiringat-tendance at the initial veri cation sessions and providing nan-cial incentives to attend the follow-up testing sessions Despitethese attempts missingness of background variables did occurbefore randomization In principle such missingness is also acovariate and so does not directly create imbalance of subjectsbetween randomized arms although it does create loss in ef -ciency when as in this study background covariates are usedin the analysis For example for single-child families depend-ing on application period and background strata 34ndash51 ofthe childrenrsquos pretest scores were missing at the time of designplanning Since then MPR has processed and provided an ad-ditional 7 and 14 of the reading and mathematics pretestscores These scores were not as balanced between arms aswhen choosing the design (see Table 2) althoughthe differencewas not statistically signi cant at conventionallevels Hence weused all available pretest scores in the nal analysis that is con-ditional on these scores

Outcomes were also incomplete Among the observed out-comes in single-child families the average (standard deviation)was 235 (225) percentile rankings for mathematics and 281(239) percentile rankings for reading and unadjusted com-parisons between randomized arms do not point to a notabledifference However in contrast to missingness of backgroundvariables missingness of outcomes that occurs after random-ization is not guaranteed to be balanced between the random-ized arms For example depending on application period andbackground strata 18ndash27 of the children did not provideposttest scores and during design periods 2ndash5 the responseis higher among scholarship winners (80) than among theother children (73) Analyses that would be limited to com-plete cases for these variablesand the variablesused to calculatethe propensity score would discard more than half of the unitsMoreover standard adjustments for outcome missingness ig-nore its potential interaction with the other complications andgenerally make implicit and unrealistic assumptions

Another complication was noncompliance attendance at aprivate school was not perfectly correlated with winning ascholarshipFor example for single-child families and depend-ing on application period and background strata 20ndash27 ofchildren who won scholarships did not use them (scholarshipswere in the amount of $1400 which generally did not fullycover tuition at a private school) and 6ndash10 of children thatdid not win scholarships were sent to private schools neverthe-less

Two additional complications limit our analysis sampleFirst no pretest scores were obtained for applicants in kinder-garten because these children most likely had never been ex-posed to a standardized test hence considerable time would

have been spent instructing them on how to take a test andthere was concern that separating such young children fromtheir guardians in this new environment might lead to behav-ioral issues Hence we focus our analyses on the children whoapplied in rst grade or above Second we do not yet havecomplete compliance data for the multichild families Conse-quently all analyses reported in this article are further limitedto results for the 1050 single-child families who were in grades1ndash4 at the time of the spring 1997 application process

4 THE STUDY AS A BROKENRANDOMIZED EXPERIMENT

The foregoing deviations from the studyrsquos protocol clarifythat our experiment does not really randomize attendance butrather randomizes the ldquoencouragementrdquo using nancial sup-port to attend a private school rather than a public schoolMoreover as in most encouragement studies interest here fo-cuses not only on the effect of encouragement itself (which willdepend on what percentage of people encouraged would actu-ally participate if the voucher program were to be more broadlyimplemented) but also on the effect of the treatment beingencouragedmdashhere attending private versus public schools Ifthere were perfect compliance so that all those encouraged toattend private school actually did so and all those not so encour-aged stayed in public school then the effect being estimatedtypically would be attributed to private versus public school at-tendance rather than simply to the encouragement

We focus on de ning and estimating two estimands theintention-to-treat (ITT) effect the effect of the randomized en-couragement on all subjects and the complier average causaleffect (CACE) the effect of the randomized encouragement onall subjects who would comply with their treatment assignmentno matter which assignment they would be given (here the chil-dren who would have attended private school if they had wona scholarship and would not have attended had they not wona scholarship) These quantities are de ned more formally inSection 8

In recent years there has been substantial progress in theanalysis of encouragement designs based on building bridgesbetween statistical and econometric approaches to causal in-ference In particular the widely accepted approach in statis-tics to formulating causal questions is in terms of ldquopotentialoutcomesrdquo Although this approach has roots dating back toNeyman and Fisher in the context of perfect randomized exper-iments (Neyman 1923 Rubin 1990) it is generally referred toas Rubinrsquos causal model (Holland 1986) for work extendingtheframework to observational studies (Rubin 1974 1977) and in-cluding modes of inference other than randomization-basedmdashin particular Bayesian (Rubin 1978a 1990) In economics thetechnique of instrumental variables due to Tinbergen (1930)and Haavelmo (1943) has been a main tool of causal inferencein the type of nonrandomizedstudies prevalent in that eld An-grist Imbens and Rubin (1996) showed how the approachescan be viewed as completelycompatible thereby clarifying andstrengthening each approach The result was an interpretationof the instrumental variables technologyas a way to approach arandomized experiment that suffers from noncompliance suchas a randomized encouragement design

302 Journal of the American Statistical Association June 2003

In encouragement designs with compliance as the only par-tially uncontrolled factor and where there are full outcomedata Imbens and Rubin (1997) extended the Bayesian approachto causal inference of Rubin (1978a) to handle simple random-ized experimentswith noncomplianceand Hirano Imbens Ru-bin and Zhou (2000) further extended the approach to handlefully observed covariates

In encouragement designs with more than one partially un-controlled factor as with noncomplianceand missing outcomesin our study de ning and estimating treatment effects of inter-est becomes more challengingExisting methods (eg RobinsGreenland and Hu 1999) are designed for studies that dif-fer from our study in the goals and the degree of control ofthe aforementioned factors (For a detailed comparison of suchframeworks see Frangakis and Rubin 2002)Frangakis and Ru-bin (1999) studied a more exible framework for encourage-ment designs with both noncompliance to treatment and miss-ing outcomes and showed that for estimation of either the ITTeffect or CACE one cannot in general obtain valid estimatesusing standard ITT analyses (ie analyses that ignore data oncompliance behavior) or standard IV analyses (ie those thatignore the interaction between compliance behavior and out-come missingness) They also provided consistent moment-based estimators that can estimate both ITT and CACE underassumptions more plausible than those underlying more stan-dard methods

Barnard et al (1998) extended that template to allow formissing covariate and multivariate outcome values theystopped short however of introducing speci c methodologyfor this framework We present a solution to a still more chal-lenging situation in which we have a more complicated form ofnoncompliancemdashsome children attend private school withoutreceiving the monetary encouragement (thus receiving treat-ment without having been assigned to it) Under assumptionssimilar to those of Barnard et al (1998) we next fully developa Bayesian framework that yields valid estimates of quantitiesof interest and also properly accounts for our uncertainty aboutthese quantities

5 PRINCIPAL STRATIFICATION IN SCHOOLCHOICE AND ROLE FOR CAUSAL INFERENCE

To make explicit the assumptions necessary for valid causalinference in this study we rst introduce ldquopotential outcomesrdquo(see Rubin 1979 Holland 1986) for all of the posttreatmentvariables Potential outcomes for any given variable representthe observable manifestations of this variable under each pos-sible treatment assignment In particular if child i in our study(i D 1 n is to be assigned to treatment z (1 for privateschool and 0 for public school) we denote the following Diz

for the indicator equal to 1 if the child attends private schooland 0 if the child attends public school Yiz for the poten-tial outcomes if the child were to take the tests and Ryi

z forthe indicators equal to 1 if the child takes the tests We denoteby Zi the indicator equal to 1 for the observed assignment toprivate school or not and let Di D DiZi Yi D Yi Zi andRyi D Ryi Zi designate the actual type of school the outcometo be recorded by the test and whether or not the child takes thetest under the observed assignment

The notation for these and the remaining de nitions of rele-vant variables in this study are summarized in Table 3 In ourstudy the outcomes are reading and math test scores so the di-mension of Yiz equals two although more generally our tem-plate allows for repeated measurements over time

Our design variables are application period lowhigh indi-cators for the applicantrsquos school grade level and propensityscore In additioncorrespondingto each set of these individual-speci c random variables is a variable (vector or matrix) with-out subscript i that refers to the set of these variables acrossall study participants For example Z is the vector of treatmentassignments for all study participants with ith element Zi andX is the matrix of partially observed backgroundvariables withith row Xi

The variable Ci (see Table 3) the joint vector of treatmentreceipt under both treatment assignments is particularly im-portant Speci cally Ci de nes four strata of people compli-ers who take the treatment if so assigned and take control if soassigned never takers who never take the treatment no matter

Table 3 Notation for the i th Subject

Notation Specics General description

Zi 1 if i assigned treatment Binary indicator of treatment assignment0 if i assigned control

Di (z) 1 if i receives treatment under assignment z Potential outcome formulation of treatment receipt0 if i receives control under assignment z

Di Di (Zi ) Binary indicator of treatment receiptCi c if Di (0) D 0 and Di (1) D 1 Compliance principal stratum c D complier n D never taker

n if Di (0) D 0 and Di (1) D 0 a D always taker d D de era if Di (0) D 1 and Di (1) D 1d if Di (0) D 1 and Di (1) D 0

Yi (z) (Y (math)i (z)Y (read)

i (z)) Potential outcomes for math and reading

Yi (Y (math)i (Zi )Y (read)

i (Zi )) Math and reading outcomes under observed assignment

Ry (math)i (z) 1 if Y (math)

i (z) would be observed Response indicator for Y (math)i (z) under assignment z

0 if Y (math)i (z) would not be observed similarly for Ry (read)

i (z)

Ryi (z) (Ry (math)i (z)Ry (read)

i (z)) Vector of response indicators for Yi (z)

Ryi (Ry (math)i (Zi )Ry (read)

i (Zi )) Vector of response indicators for Yi

Wi (Wi1 WiK ) Fully observed background and design variablesXi (Xi1 XiQ ) Partially observed background variablesRxi (Rxi1 RxiQ ) Vector of response indicators for Xi

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 303

the assignment always takers who always take the treatmentno matter the assignment and de ers who do the opposite ofthe assignment no matter its value These strata are not fully ob-served in contrast to observed strata of actual assignment andattendance ZiDi For example children who are observedto attend private school when winning the lottery are a mixtureof compliers Ci D c and always takers Ci D a Such explicitstrati cation on Ci dates back at least to the work of Imbens andRubin (1997) on randomized trials with noncompliance wasgeneralized and taxonomized by Frangakis and Rubin (2002)to posttreatment variables in partially controlled studies and istermed a ldquoprincipal strati cationrdquo based on the posttreatmentvariables

The principal strata Ci have two important properties Firstthey are not affected by the assigned treatment Second com-parisons of potential outcomes under different assignmentswithin principal strata called principal effects are well-de nedcausal effects (Frangakis and Rubin 2002) These propertiesmake principal strati cation a powerful framework for evalu-ation because it allows us to explicitly de ne estimands thatbetter represent the effect of attendance in the presence of non-compliance and to explore richer and explicit sets of assump-tions that allow estimation of these effects under more plausiblethan standard conditions Sections 6 and 7 discuss such a set ofmore exible assumptions and estimands

6 STRUCTURAL ASSUMPTIONS

First we state explicitly our structural assumptions about thedata with regard to the causal process the missing data mech-anism and the noncompliancestructure These assumptions areexpressed without reference to a particular parametric distribu-tion and are the ones that make the estimands of interest identi- able as also discussed in Section 64

61 Stable Unit Treatment Value Assumption

A standard assumption made in causal analyses is the stableunit treatment value assumption (SUTVA) formalized with po-tential outcomes by Rubin (1978a 1980 1990) SUTVA com-bines the no-interferenceassumption (Cox 1958) that one unitrsquostreatment assignment does not affect another unitrsquos outcomeswith the assumption that there are ldquono versions of treatmentsrdquoFor the no-interference assumption to hold whether or not onefamily won a scholarship shouldnot affect another familyrsquos out-comes such as their choice to attend private school or their chil-drenrsquos test scores We expect our results to be robust to the typesand degree of deviations from no interference that might be an-ticipated in this study To satisfy the ldquono versionsof treatmentsrdquowe need to limit the de nition of private and public schools tothose participating in the experimentGeneralizabilityof the re-sults to other schools must be judged separately

62 Randomization

We assume that scholarships have been randomly assignedThis implies that

pZ j Y 1 Y 0XWC Ry0Ry1Rx micro

D pZ j W curren micro D pZ j Wcurren

where W curren represents the design variables in W and micro is genericnotation for the parameters governing the distribution of all

the variables There is no dependence on micro because there areno unknown parameters governing the treatment-assignmentmechanism MPR assigned the scholarships by lottery andthe randomization probabilities within for applicantrsquos school(lowhigh) and application period are known

63 Noncompliance Process Assumptions Monotonicityand Compound Exclusion

We assume monotonicity that there are no ldquode ersrdquomdashthatis for all i Di1 cedil Di0 (Imbens and Angrist 1994) Inthe SCSF program de ers would be families who would notuse a scholarship if they won one but would pay to go to privateschool if they did not win a scholarship It seems implausiblethat such a group of people exists therefore the monotonicityassumption appears to be reasonable for our study

By de nition the never takers and always takers will par-ticipate in the same treatment (control or treatment) regard-less of which treatment they are randomly assigned For thisreason and to facilitate estimation we assume compound ex-clusion The outcomes and missingness of outcomes for nevertakers and always takers are not affected by treatment assign-ment This compound exclusion restriction (Frangakis and Ru-bin 1999) generalizes the standard exclusion restriction (An-grist et al 1996 Imbens and Rubin 1997) and can be expressedformally for the distributionsas

pY 1Ry1 j XobsRx W C D n

D pY 0Ry0 j XobsRx W C D n for never takers

and

pY 1 Ry1 j Xobs RxWC D a

D pY 0Ry0 j Xobs RxWC D a for always takers

Compound exclusion seems more plausible for never takersthan for always takers in this study Never takers stayed in thepublic school system whether they won a scholarship or notAlthougha disappointmentaboutwinning a scholarshipbut stillnot being able to take advantage of it can exist it is unlikely tocause notable differences in subsequent test scores or responsebehaviors

Always takers on the other hand might have been in oneprivate school had they won a scholarship or in another if theyhad not particularly because those who won scholarships hadaccess to resources to help them nd an appropriate privateschool and had more money (up to $1400) to use toward tu-ition In addition even if the child had attended the same pri-vate school under either treatment assignment the extra $1400in family resources for those who won the scholarship couldhave had an effect on student outcomes However because inour application the estimated percentage of always takers is sosmall (approximately 9)mdashan estimate that is robust due tothe randomization to relaxing compound exclusionmdashthere isreason to believe that this assumption will not have a substan-tial impact on the results

Under the compound exclusion restriction the ITT compar-ison of all outcomes Yi0 versus Yi1 includes the null com-parison among the subgroupsof never takers and always takersMoreover by monotonicitythe compliers Ci D c are the onlygroup of childrenwho would attend private school if and only if

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

302 Journal of the American Statistical Association June 2003

In encouragement designs with compliance as the only par-tially uncontrolled factor and where there are full outcomedata Imbens and Rubin (1997) extended the Bayesian approachto causal inference of Rubin (1978a) to handle simple random-ized experimentswith noncomplianceand Hirano Imbens Ru-bin and Zhou (2000) further extended the approach to handlefully observed covariates

In encouragement designs with more than one partially un-controlled factor as with noncomplianceand missing outcomesin our study de ning and estimating treatment effects of inter-est becomes more challengingExisting methods (eg RobinsGreenland and Hu 1999) are designed for studies that dif-fer from our study in the goals and the degree of control ofthe aforementioned factors (For a detailed comparison of suchframeworks see Frangakis and Rubin 2002)Frangakis and Ru-bin (1999) studied a more exible framework for encourage-ment designs with both noncompliance to treatment and miss-ing outcomes and showed that for estimation of either the ITTeffect or CACE one cannot in general obtain valid estimatesusing standard ITT analyses (ie analyses that ignore data oncompliance behavior) or standard IV analyses (ie those thatignore the interaction between compliance behavior and out-come missingness) They also provided consistent moment-based estimators that can estimate both ITT and CACE underassumptions more plausible than those underlying more stan-dard methods

Barnard et al (1998) extended that template to allow formissing covariate and multivariate outcome values theystopped short however of introducing speci c methodologyfor this framework We present a solution to a still more chal-lenging situation in which we have a more complicated form ofnoncompliancemdashsome children attend private school withoutreceiving the monetary encouragement (thus receiving treat-ment without having been assigned to it) Under assumptionssimilar to those of Barnard et al (1998) we next fully developa Bayesian framework that yields valid estimates of quantitiesof interest and also properly accounts for our uncertainty aboutthese quantities

5 PRINCIPAL STRATIFICATION IN SCHOOLCHOICE AND ROLE FOR CAUSAL INFERENCE

To make explicit the assumptions necessary for valid causalinference in this study we rst introduce ldquopotential outcomesrdquo(see Rubin 1979 Holland 1986) for all of the posttreatmentvariables Potential outcomes for any given variable representthe observable manifestations of this variable under each pos-sible treatment assignment In particular if child i in our study(i D 1 n is to be assigned to treatment z (1 for privateschool and 0 for public school) we denote the following Diz

for the indicator equal to 1 if the child attends private schooland 0 if the child attends public school Yiz for the poten-tial outcomes if the child were to take the tests and Ryi

z forthe indicators equal to 1 if the child takes the tests We denoteby Zi the indicator equal to 1 for the observed assignment toprivate school or not and let Di D DiZi Yi D Yi Zi andRyi D Ryi Zi designate the actual type of school the outcometo be recorded by the test and whether or not the child takes thetest under the observed assignment

The notation for these and the remaining de nitions of rele-vant variables in this study are summarized in Table 3 In ourstudy the outcomes are reading and math test scores so the di-mension of Yiz equals two although more generally our tem-plate allows for repeated measurements over time

Our design variables are application period lowhigh indi-cators for the applicantrsquos school grade level and propensityscore In additioncorrespondingto each set of these individual-speci c random variables is a variable (vector or matrix) with-out subscript i that refers to the set of these variables acrossall study participants For example Z is the vector of treatmentassignments for all study participants with ith element Zi andX is the matrix of partially observed backgroundvariables withith row Xi

The variable Ci (see Table 3) the joint vector of treatmentreceipt under both treatment assignments is particularly im-portant Speci cally Ci de nes four strata of people compli-ers who take the treatment if so assigned and take control if soassigned never takers who never take the treatment no matter

Table 3 Notation for the i th Subject

Notation Specics General description

Zi 1 if i assigned treatment Binary indicator of treatment assignment0 if i assigned control

Di (z) 1 if i receives treatment under assignment z Potential outcome formulation of treatment receipt0 if i receives control under assignment z

Di Di (Zi ) Binary indicator of treatment receiptCi c if Di (0) D 0 and Di (1) D 1 Compliance principal stratum c D complier n D never taker

n if Di (0) D 0 and Di (1) D 0 a D always taker d D de era if Di (0) D 1 and Di (1) D 1d if Di (0) D 1 and Di (1) D 0

Yi (z) (Y (math)i (z)Y (read)

i (z)) Potential outcomes for math and reading

Yi (Y (math)i (Zi )Y (read)

i (Zi )) Math and reading outcomes under observed assignment

Ry (math)i (z) 1 if Y (math)

i (z) would be observed Response indicator for Y (math)i (z) under assignment z

0 if Y (math)i (z) would not be observed similarly for Ry (read)

i (z)

Ryi (z) (Ry (math)i (z)Ry (read)

i (z)) Vector of response indicators for Yi (z)

Ryi (Ry (math)i (Zi )Ry (read)

i (Zi )) Vector of response indicators for Yi

Wi (Wi1 WiK ) Fully observed background and design variablesXi (Xi1 XiQ ) Partially observed background variablesRxi (Rxi1 RxiQ ) Vector of response indicators for Xi

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 303

the assignment always takers who always take the treatmentno matter the assignment and de ers who do the opposite ofthe assignment no matter its value These strata are not fully ob-served in contrast to observed strata of actual assignment andattendance ZiDi For example children who are observedto attend private school when winning the lottery are a mixtureof compliers Ci D c and always takers Ci D a Such explicitstrati cation on Ci dates back at least to the work of Imbens andRubin (1997) on randomized trials with noncompliance wasgeneralized and taxonomized by Frangakis and Rubin (2002)to posttreatment variables in partially controlled studies and istermed a ldquoprincipal strati cationrdquo based on the posttreatmentvariables

The principal strata Ci have two important properties Firstthey are not affected by the assigned treatment Second com-parisons of potential outcomes under different assignmentswithin principal strata called principal effects are well-de nedcausal effects (Frangakis and Rubin 2002) These propertiesmake principal strati cation a powerful framework for evalu-ation because it allows us to explicitly de ne estimands thatbetter represent the effect of attendance in the presence of non-compliance and to explore richer and explicit sets of assump-tions that allow estimation of these effects under more plausiblethan standard conditions Sections 6 and 7 discuss such a set ofmore exible assumptions and estimands

6 STRUCTURAL ASSUMPTIONS

First we state explicitly our structural assumptions about thedata with regard to the causal process the missing data mech-anism and the noncompliancestructure These assumptions areexpressed without reference to a particular parametric distribu-tion and are the ones that make the estimands of interest identi- able as also discussed in Section 64

61 Stable Unit Treatment Value Assumption

A standard assumption made in causal analyses is the stableunit treatment value assumption (SUTVA) formalized with po-tential outcomes by Rubin (1978a 1980 1990) SUTVA com-bines the no-interferenceassumption (Cox 1958) that one unitrsquostreatment assignment does not affect another unitrsquos outcomeswith the assumption that there are ldquono versions of treatmentsrdquoFor the no-interference assumption to hold whether or not onefamily won a scholarship shouldnot affect another familyrsquos out-comes such as their choice to attend private school or their chil-drenrsquos test scores We expect our results to be robust to the typesand degree of deviations from no interference that might be an-ticipated in this study To satisfy the ldquono versionsof treatmentsrdquowe need to limit the de nition of private and public schools tothose participating in the experimentGeneralizabilityof the re-sults to other schools must be judged separately

62 Randomization

We assume that scholarships have been randomly assignedThis implies that

pZ j Y 1 Y 0XWC Ry0Ry1Rx micro

D pZ j W curren micro D pZ j Wcurren

where W curren represents the design variables in W and micro is genericnotation for the parameters governing the distribution of all

the variables There is no dependence on micro because there areno unknown parameters governing the treatment-assignmentmechanism MPR assigned the scholarships by lottery andthe randomization probabilities within for applicantrsquos school(lowhigh) and application period are known

63 Noncompliance Process Assumptions Monotonicityand Compound Exclusion

We assume monotonicity that there are no ldquode ersrdquomdashthatis for all i Di1 cedil Di0 (Imbens and Angrist 1994) Inthe SCSF program de ers would be families who would notuse a scholarship if they won one but would pay to go to privateschool if they did not win a scholarship It seems implausiblethat such a group of people exists therefore the monotonicityassumption appears to be reasonable for our study

By de nition the never takers and always takers will par-ticipate in the same treatment (control or treatment) regard-less of which treatment they are randomly assigned For thisreason and to facilitate estimation we assume compound ex-clusion The outcomes and missingness of outcomes for nevertakers and always takers are not affected by treatment assign-ment This compound exclusion restriction (Frangakis and Ru-bin 1999) generalizes the standard exclusion restriction (An-grist et al 1996 Imbens and Rubin 1997) and can be expressedformally for the distributionsas

pY 1Ry1 j XobsRx W C D n

D pY 0Ry0 j XobsRx W C D n for never takers

and

pY 1 Ry1 j Xobs RxWC D a

D pY 0Ry0 j Xobs RxWC D a for always takers

Compound exclusion seems more plausible for never takersthan for always takers in this study Never takers stayed in thepublic school system whether they won a scholarship or notAlthougha disappointmentaboutwinning a scholarshipbut stillnot being able to take advantage of it can exist it is unlikely tocause notable differences in subsequent test scores or responsebehaviors

Always takers on the other hand might have been in oneprivate school had they won a scholarship or in another if theyhad not particularly because those who won scholarships hadaccess to resources to help them nd an appropriate privateschool and had more money (up to $1400) to use toward tu-ition In addition even if the child had attended the same pri-vate school under either treatment assignment the extra $1400in family resources for those who won the scholarship couldhave had an effect on student outcomes However because inour application the estimated percentage of always takers is sosmall (approximately 9)mdashan estimate that is robust due tothe randomization to relaxing compound exclusionmdashthere isreason to believe that this assumption will not have a substan-tial impact on the results

Under the compound exclusion restriction the ITT compar-ison of all outcomes Yi0 versus Yi1 includes the null com-parison among the subgroupsof never takers and always takersMoreover by monotonicitythe compliers Ci D c are the onlygroup of childrenwho would attend private school if and only if

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 303

the assignment always takers who always take the treatmentno matter the assignment and de ers who do the opposite ofthe assignment no matter its value These strata are not fully ob-served in contrast to observed strata of actual assignment andattendance ZiDi For example children who are observedto attend private school when winning the lottery are a mixtureof compliers Ci D c and always takers Ci D a Such explicitstrati cation on Ci dates back at least to the work of Imbens andRubin (1997) on randomized trials with noncompliance wasgeneralized and taxonomized by Frangakis and Rubin (2002)to posttreatment variables in partially controlled studies and istermed a ldquoprincipal strati cationrdquo based on the posttreatmentvariables

The principal strata Ci have two important properties Firstthey are not affected by the assigned treatment Second com-parisons of potential outcomes under different assignmentswithin principal strata called principal effects are well-de nedcausal effects (Frangakis and Rubin 2002) These propertiesmake principal strati cation a powerful framework for evalu-ation because it allows us to explicitly de ne estimands thatbetter represent the effect of attendance in the presence of non-compliance and to explore richer and explicit sets of assump-tions that allow estimation of these effects under more plausiblethan standard conditions Sections 6 and 7 discuss such a set ofmore exible assumptions and estimands

6 STRUCTURAL ASSUMPTIONS

First we state explicitly our structural assumptions about thedata with regard to the causal process the missing data mech-anism and the noncompliancestructure These assumptions areexpressed without reference to a particular parametric distribu-tion and are the ones that make the estimands of interest identi- able as also discussed in Section 64

61 Stable Unit Treatment Value Assumption

A standard assumption made in causal analyses is the stableunit treatment value assumption (SUTVA) formalized with po-tential outcomes by Rubin (1978a 1980 1990) SUTVA com-bines the no-interferenceassumption (Cox 1958) that one unitrsquostreatment assignment does not affect another unitrsquos outcomeswith the assumption that there are ldquono versions of treatmentsrdquoFor the no-interference assumption to hold whether or not onefamily won a scholarship shouldnot affect another familyrsquos out-comes such as their choice to attend private school or their chil-drenrsquos test scores We expect our results to be robust to the typesand degree of deviations from no interference that might be an-ticipated in this study To satisfy the ldquono versionsof treatmentsrdquowe need to limit the de nition of private and public schools tothose participating in the experimentGeneralizabilityof the re-sults to other schools must be judged separately

62 Randomization

We assume that scholarships have been randomly assignedThis implies that

pZ j Y 1 Y 0XWC Ry0Ry1Rx micro

D pZ j W curren micro D pZ j Wcurren

where W curren represents the design variables in W and micro is genericnotation for the parameters governing the distribution of all

the variables There is no dependence on micro because there areno unknown parameters governing the treatment-assignmentmechanism MPR assigned the scholarships by lottery andthe randomization probabilities within for applicantrsquos school(lowhigh) and application period are known

63 Noncompliance Process Assumptions Monotonicityand Compound Exclusion

We assume monotonicity that there are no ldquode ersrdquomdashthatis for all i Di1 cedil Di0 (Imbens and Angrist 1994) Inthe SCSF program de ers would be families who would notuse a scholarship if they won one but would pay to go to privateschool if they did not win a scholarship It seems implausiblethat such a group of people exists therefore the monotonicityassumption appears to be reasonable for our study

By de nition the never takers and always takers will par-ticipate in the same treatment (control or treatment) regard-less of which treatment they are randomly assigned For thisreason and to facilitate estimation we assume compound ex-clusion The outcomes and missingness of outcomes for nevertakers and always takers are not affected by treatment assign-ment This compound exclusion restriction (Frangakis and Ru-bin 1999) generalizes the standard exclusion restriction (An-grist et al 1996 Imbens and Rubin 1997) and can be expressedformally for the distributionsas

pY 1Ry1 j XobsRx W C D n

D pY 0Ry0 j XobsRx W C D n for never takers

and

pY 1 Ry1 j Xobs RxWC D a

D pY 0Ry0 j Xobs RxWC D a for always takers

Compound exclusion seems more plausible for never takersthan for always takers in this study Never takers stayed in thepublic school system whether they won a scholarship or notAlthougha disappointmentaboutwinning a scholarshipbut stillnot being able to take advantage of it can exist it is unlikely tocause notable differences in subsequent test scores or responsebehaviors

Always takers on the other hand might have been in oneprivate school had they won a scholarship or in another if theyhad not particularly because those who won scholarships hadaccess to resources to help them nd an appropriate privateschool and had more money (up to $1400) to use toward tu-ition In addition even if the child had attended the same pri-vate school under either treatment assignment the extra $1400in family resources for those who won the scholarship couldhave had an effect on student outcomes However because inour application the estimated percentage of always takers is sosmall (approximately 9)mdashan estimate that is robust due tothe randomization to relaxing compound exclusionmdashthere isreason to believe that this assumption will not have a substan-tial impact on the results

Under the compound exclusion restriction the ITT compar-ison of all outcomes Yi0 versus Yi1 includes the null com-parison among the subgroupsof never takers and always takersMoreover by monotonicitythe compliers Ci D c are the onlygroup of childrenwho would attend private school if and only if

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

304 Journal of the American Statistical Association June 2003

offered the scholarshipFor this reason we take the CACE (Im-bens and Rubin 1997) de ned as the comparison of outcomesYi0 versus Yi1 among the principal stratum of compliers torepresent the effect of attending public versus private school

64 Latent Ignorability

We assume that potential outcomes are independentof miss-ingness given observed covariates conditional on the compli-ance strata that is

pRy0Ry1 j RxY 1 Y 0Xobs W Cmicro

D pRy0Ry1 j Rx Xobs W Cmicro

This assumption represents a form of latent ignorability (LI)(Frangakis and Rubin 1999) in that it conditions on variablesthat are (at least partially) unobserved or latentmdashhere compli-ance principal strata C We make it here rst because it is moreplausible than the assumption of standard ignorability (SI) (Ru-bin 1978aLittle and Rubin 1987) and second because makingit leads to different likelihood inferences

LI is more plausible than SI to the extent that it provides acloser approximationto the missing-data mechanism The intu-ition behind this assumption in our study is that for a subgroupof people with the same covariate missing-data patterns Rx similar values for covariates observed in that pattern Xobs andthe same compliance stratum C a ip of a coin could deter-mine which of these individuals shows up for a posttest This isa more reasonable assumption than SI because it seems quitelikely that for example compliers would exhibit different at-tendance behavior for posttests than say never takers (evenconditional on other background variables) Explorations ofraw data from this study across individualswith known compli-ance status provide empirical support that C is a strong factor inoutcome missingness even when other covariates are includedin the model This fact is also supported by the literature fornoncompliance(see eg The Coronary Drug Project ResearchGroup 1980)

Regarding improved estimation when LI (and the precedingstructural assumptions) hold but the likelihood is constructedassuming SI the underlying causal effects are identi able (al-ternatively the posterior distribution with increasing samplesize converges to the truth) only if the additional assumption ismade that within subclasses of subjects with similar observedvariables the partially missing compliance principal stratum C

is not associated with potential outcomes However as notedearlier this assumption is not plausible

Theoretically only the structural assumptions described ear-lier are needed to identify the underlyingITT and CACE causaleffects (Frangakis and Rubin 1999) Estimation based solelyon those identi ability relations in principle requires very largesample size and explicit conditioning(ie strati cation) on thesubclasses de ned by the observed part of the covariates Xobsand the pattern of missingness of the covariates Rx as wellas implicit conditioning(ie deconvolutionof mixtures) on thesometimes missing complianceprincipal stratum C This worksin large samples because covariate missingness and compli-ance principal stratum are also covariates (ie de ned pretreat-ment) so samples within subclasses de ned by Xobs Rx and

C themselves represent randomized experiments With our ex-perimentrsquos sample size however performing completely sepa-rate analyses on all of these strata is not necessarily desirable orfeasible Therefore we consider more parsimonious modelingapproacheswhich have the role of assisting not creating infer-ence in the sense that results should be robust to different para-metric speci cations (Frangakis and Rubin 2001 rejoinder)

7 PARAMETRIC PATTERN MIXTURE MODEL

Generally speakingconstrainedestimationof separate analy-ses within missing-data patterns is the motivation behind pat-tern mixture modeling Various authors have taken pattern mix-ture model approaches to missing data including Little (19931996) Rubin (1978b) and Glynn Laird and Rubin (1993)Typically pattern mixture models partition the data with re-spect to the missingness of the variables Here we partition thedata with respect to the covariate missing-data patterns Rx aswell as compliance principal strata C design variables W andthe main covariates Xobs This represents a partial pattern mix-ture approach One argument in favor of this approach is thatit focuses the model on the quantities of interest in such a waythat parametric speci cations for the marginal distributions ofRx W and Xobs can be ignoredTo capitalize on the structuralassumptions consider the factorization of the joint distributionfor the potential outcomes and compliance strata conditionalonthe covariates and their missing-data patterns

piexclYi0 Yi1 Ryi0Ryi1 Ci j Wi Xobs

i Rxi microcent

D piexclCi j Wi Xobs

i Rxi micro Ccent

pound piexclRyi0Ryi1 j Wi Xobs

i Rxi Ci micro Rcent

pound piexclYi 0 Yi1 j Wi Xobs

i Rxi Ci micro Ycent

where the product in the last line follows by latent ignorabilityand micro D micro C micro R micro Y0 Note that the response pattern of co-variates for each individual is itself a covariate The parametricspeci cations for each of these components are described next

71 Compliance Principal Stratum Submodel

The compliance status model contains two conditionalprobitmodels de ned using indicator variables Cic and Cin forwhether individual i is a complier or a never taker

Cin D 1 if Cincurren acute g1Wi Xobsi Rxi

0macr C 1 C Vi middot 0

and

Cic D 1 if Cincurren gt 0

and Ci ccurren acute g0WiXobsi Rxi

0macr C 2 C Ui middot 0

where Vi raquo N01 and Ui raquo N01 independentlyThe link functions g0 and g1 attempt to strike a balance be-

tween on the one hand including all of the design variables aswell as the variables regarded as most important either in pre-dicting compliance or in having interactions with the treatmenteffect and on the other hand maintaining parsimony The re-sults discussed in Section 8 use a compliance componentmodelwhose link function g1 is linear in and ts distinct parametersfor an intercept applicantrsquos school (lowhigh) indicators forapplication period propensity scores for subjects applying in

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 305

the PMPD period and propensity scores for the other periodsindicators for grade of the student ethnicity (1 if the child orguardian identi es herself as African-American 0 otherwise)an indicator for whether or not the pretest scores of reading andmath were available and the pretest scores (reading and math)for the subjects with available scores A propensity score forthe students not in the PMPD period is not necessary from thedesign standpoint and is in fact constant However to increaseef ciency and to reduce bias due to missing outcomes here weincludean ldquoestimated propensity score valuerdquo for these periodscalculated as the function derived for the propensity score forstudents in the PMPD period and evaluated at the covariate val-ues for the students in the other periods as well Also the fore-going link function g0 is the same as g1 except that it excludesthe indicators for application period as well as the propensityscores for applicants who did not apply in the PMPD period(ie those for whom propensity score was not a design vari-able) This link function a more parsimonious version of onewe used in earlier models was more appropriate to t the rel-atively small proportion of always takers in this study Finallybecause the pretests were either jointlyobservedor jointlymiss-ing one indicator for missingness of pretest scores is suf cient

The prior distributions for the compliance submodel aremacrC 1 raquo Nmacr

C 10 ffrac34 C 1g2I and macr C 2 raquo N0 ffrac34 C 2g2I

independently where frac34 C 12 and frac34 C 22 are hyperpara-meters set at 10 and macr

C 10 is a vector of 0s with the excep-

tion of the rst element which is set equal to iexcl8iexcl11=3 currenffrac34 C 11=n

Pni g0

1 ig1 i C 1g1=2 where g1 i D g1Wi Xobsi

Rxi and n is our sample size These priors re ect our a prioriignorance about the probability that any individual belongs toeach compliance status by approximately setting each of theirprior probabilities at 13

72 Outcome Submodel

We rst describe the marginal distribution for the math out-come Y math To address the pile-up of many scores of 0 weposit the censored model

Ymathi z D

8gtlt

gt

0 if Ymath curreni z middot 0

100 if Ymath curreni z cedil 100

Ymath curreni z otherwise

where

Ymath curreni z j Wi Xobs

i RxiCi micro math

raquo Niexclg2Wi Xobs

i Rxi Ci z0macr math

exppoundg3Xobs

i Rxi Ci z0sup3 mathcurrencent

for z D 0 1 and micro math D macr math sup3 math Here Ymath curreni 0

and Ymath curreni 1 are assumed conditionally independentan as-

sumption that has no effect on inference for our superpopulationparameters of interest (Rubin 1978a)

The results reported in Section 8 use an outcome componentmodel whose outcome mean link function g2 is linear in dis-tinct parameters for the following

1 For the students of the PMPD design an intercept appli-cantrsquos school (lowhigh) ethnicityindicatorsfor grade an

indicator for whether or not the pretest scores were avail-able pretest score values for the subjects with availablescores and propensity score

2 For the students of the other periods the variables initem 1 along with indicators for application period

3 An indicatorfor whether or not a person is an always taker4 An indicator for whether or not a person is a complier5 For compliers assigned treatment an intercept appli-

cantrsquos school (lowhigh) ethnicity and indicators for the rst three grades (with the variable for the fourth-gradetreatment effect as a function of the already-includedvari-ables)

For the variance of the outcome component the link func-tion g3 includes indicators that saturate the cross-classi cationof whether or not a person applied in the PMPD period andwhether or not the pretest scores were available This depen-dence is needed because each pattern conditions on a differentset of covariates that is Xobs varies from pattern to pattern

The prior distributions for the outcome submodel are

macrmath j sup3 math raquo Niexcl0F sup3 mathraquoI

cent

where Fiexclsup3 math

centD 1

K

X

k

expiexclsup3

mathk

cent

sup3 math D sup3math1 sup3

mathK with one component for each

of the K (in our case K D 4) subgroups de ned by cross-classifying the PMPDnon-PMPD classi cation and the mis-sing-data indicator for pretest scores and where raquo is a hyper-parameter set at 10 and expsup3

mathk

iidraquo invAcirc 2ordm frac34 2 whereinvAcirc 2ordm frac34 2 refers to the distribution of the inverse of a chi-squared random variable with degrees of freedom ordm (set at 3)and scale parameter frac34 2 (set at 400 based on preliminary esti-mates of variances) The sets of values for these and the hy-perparameters of the other model components were chosen forsatisfying two criteria giving posterior standard deviations forthe estimands in the neighborhoodof the respective standard er-rors of approximate maximum likelihood estimate ts of simi-lar preliminary likelihoodmodels and giving satisfactory modelchecks (Sec 92) and producing quick enough mixing of theBayesian analysis algorithm

We specify the marginal distribution for reading outcomeY read in the same way as for the math outcome with separatemean and variance regression parameters macr read and sup3 readFinally we allow that conditionally on Wi Xobs

i Rxi Ci the

math and reading outcomes at a given assignment Ymath curreni z

and Yread curreni z may be dependent with an unknown correla-

tion frac12 We set the prior distribution for frac12 to be uniform in(iexcl1 1) independentlyof the remaining parameters in their priordistributions

73 Outcome Response Submodel

As with the pretests the outcomes on mathematics and read-ing were either jointly observed or jointly missing thus oneindicator Ryiz for missingness of outcomes is suf cient foreach assignment z D 01 For the submodel on this indicatorwe also use a probit speci cation

Ryiz D 1

if Ryizcurren acute g2Wi Xobsi Rxi Ci z0macrR C Eiz cedil 0

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

306 Journal of the American Statistical Association June 2003

where Ryi0 and Ryi1 are assumed conditionally indepen-dent (using the same justi cation as for the potential out-comes) and where Eiz raquo N01 The link function of theprobit model on the outcome response g2 is the same as thelink function for the mean of the outcome component Theprior distribution for the outcome response submodel is macr R raquoN0 ffrac34 Rg2I where ffrac34 Rg2 is a hyperparameter set at 10

8 RESULTS

All results herein were obtained from the same Bayesiananalysis We report results for the ITT and CACE estimandsfor proportionsof compliance principal strata and for outcomeresponse rates The reported estimands are not parameters ofthe model but rather are functions of parameters and dataThe results are reported by applicantrsquos school (lowhigh) andgrade Both of these variables represent characteristics of chil-dren that potentially could be targeted differentially by gov-ernment policies Moreover each was thought to have possibleinteractioneffects with treatment assignmentExcept when oth-erwise stated the plain numbers are posterior means and thenumbers in parentheses are 25 and 975 percentiles of the pos-terior distribution

81 Test Score Results

Here we address two questions

1 What is the impact of being offered a scholarship on stu-dent outcomes namely the ITT estimand

2 What is the impact of attendinga private school on studentoutcomes namely the CACE estimand

The math and reading posttest score outcomes represent thenational percentile rankings within grade They have been ad-justed to correct for the fact that some children were kept be-hind while others skippeda grade because students transferringto private schools are hypothesized to be more likely to havebeen kept behind by those schools The individual-levelcausalestimates have also been weighted so that the subgroup causalestimates correspond to the effects for all eligible children be-longing to that subgroup who attended a screening session

811 Effect of Offering the Scholarshipon Mathematics andReading We examine the impact of being offered a scholar-ship (the ITT effect) on posttest scores The corresponding es-timand for individual i is de ned as EYi1 iexcl Yi0 j W

pi micro

where Wpi denotes the policy variables grade level and appli-

cantrsquos school (lowhigh) for that individualThe simulated pos-terior distribution of the ITT effect is summarized in Table 4To draw from this distribution we take the following steps

Table 4 ITT Effect of Winning the Lottery on Mathand Reading Test Scores 1

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 23(iexcl13 58) 52(2083) 14(iexcl48 72) 51(1 103)

2 5(iexcl26 35) 13(iexcl1743) iexcl6(iexcl62 49) 10(iexcl43 61)

3 7(iexcl27 40) 33(iexcl5 70) iexcl5(iexcl60 50) 25(iexcl32 80)

4 30(iexcl11 74) 31(iexcl1272) 18(iexcl41 76) 23(iexcl33 78)

Overall 15(iexcl6 36) 32(1054) 4(iexcl46 52) 28(iexcl18 72)

NOTE Year Postrandomization Plain numbers are means and numbers in parentheses arecentral 95 intervals of the posterior distribution of the effects on percentile rank

1 Draw micro and Ci from the posterior distribution (seeApp A)

2 Calculate the expected causal effect EYi1 iexcl Yi0 jW

pi Xobs

i Ci Rxi micro for each subject based on the modelin Section 72

3 Average the values of step 2 over the subjects in the sub-classes of W p with weights re ecting the known samplingweights of the study population for the target population

These results indicate posterior distributions with mass pri-marily (gt975) to the right of 0 for the treatment effect onmath scores overall for children from low applicantschools andalso for the subgroup of rst graders Each effect indicates anaverage gain of greater than three percentile points for childrenwho won a scholarship All of the remaining intervals cover 0As a more general pattern estimates of effects are larger formathematics than for reading and larger for children from low-applicant schools than for children from high-applicantschools

These results for the ITT estimand using the method of thisarticle can also be contrasted with the results using a simplermethod reported in Table 5 The method used to obtain these re-sults is the same as that used in the initial MPR study (PetersonMyers Howell and Mayer 1999) but limited to the subset ofsingle-child families and separated out by applicantrsquos school(lowhigh) This method is based on a linear regression of theposttest scores on the indicator of treatment assignment ignor-ing compliance data In addition the analysis includes the de-sign variables and the pretest scores and is limited to individualsfor whom the pretest and posttest scores are known Separateanalyses are run for math and reading posttest scores and themissingness of such scores is adjusted by inverse probabilityweighting Finally weights are used to make the results for thestudy participants generalizable to the populationof all eligiblesingle-child families who were screened (For more details ofthis method see the appendix of Peterson et al 1999)

Table 5 ITT Effect of Winning the Lottery on Test Scores Estimated Using the OriginalMPR Method

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 iexcl10(iexcl7152) 21(iexcl2667 ) 48(iexcl100196) 26(iexcl155 207)

2 iexcl8(iexcl4932) 20(iexcl40 80) iexcl34(iexcl16597) 27(iexcl103 157)

3 32(iexcl1781) 50(iexcl8 107) iexcl80(iexcl25493) 40(iexcl177 256)

4 27(iexcl3588) 3(iexcl73 79) 279(80 478) 227(iexcl15468)

Overall 6(iexcl2032) 20(iexcl8 48) 11(iexcl70 91) 3(iexcl96101)

NOTE Plain numbers are point estimates and parentheses are 95 condence intervals for the mean effects onpercentile rank

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 307

Table 6 Effect of Private School Attendance on Test Scores

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Reading Math Reading Math

1 34(iexcl2087) 77(30 124) 19(iexcl73 103) 74(2 146)

2 7(iexcl3750) 19(iexcl24 62) iexcl9(iexcl9473) 15(iexcl6293)

3 10(iexcl4161) 50(iexcl8 107) iexcl8(iexcl9577) 40(iexcl49125)

4 42(iexcl15101) 43(iexcl16 101) 27(iexcl63 113) 35(iexcl47119)

Overall 22(iexcl9 53) 47(14 79) 6(iexcl71 77) 42(iexcl26109)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effectson percentile rank

The results of the new method (Table 4) are generally morestable than those of the original method (Table 5) which insome cases are not even credible In the most extreme cases theoriginal method estimates a 227-point gain [95 con denceinterval (iexcl15468)] in mathematics and a 279-point gain[95 con dence interval (80 478)] in reading for fourth-grade children from high-applicantschools More generally thenew methodrsquos results display a more plausible pattern in com-paring effects in high-applicant versus low-applicant schoolsand mathematics versus reading

812 Effect of Private School Attendance on Mathemat-ics and Reading We also examine the effect of offering thescholarship when focusing only on the compliers (the CACE)The correspondingestimand for individual i is de ned as

EYi1 iexcl Yi0 j Wpi Ci D c micro

This analysis de nes the treatment as private school atten-dance (Sec 63) The simulated posterior distribution of theCACE is summarized in Table 6 A draw from this distribu-tion is obtained using steps 1ndash3 described in the Section 811for the ITT estimand with the exception that at step 3 the av-eraging is restricted to the subjects whose current draw of Ci isldquocomplierrdquo

The effects of private school attendance follow a patternsimilar to that of the ITT effects but the posterior means areslightly bigger in absolute value than ITT The intervals havealso grown re ecting that these effects are for only subgroupsof all children the ldquocompliersrdquo in each cell As a result the as-sociated uncertainty for some of these effects (eg for fourth-graders applying from high-applicant schools) is large

82 Compliance Principal Strataand Missing Outcomes

Table 7 summarizes the posterior distribution of the esti-mands of the probability of being in stratum C as a functionof an applicantrsquos school and grade pCi D t j W

pi micro To draw

from this distribution we use step 1 described in Section 811and then calculate pCi D t j W

pi Xobs

i Rxi micro for each sub-ject based on the model of Section 71 and average these valuesas in step 3 of Section 811 The clearest pattern revealed byTable 7 is that for three out of four grades children who ap-plied from low-applicant schools are more likely to be compli-ers or always takers than are children who applied from high-applicant schools

As stated earlier theoretically under our structural assump-tions standard ITT analyses or standard IV analyses that usead hoc approaches to missing data are generally not appropri-ate for estimating the ITT or CACE estimands when the com-pliance principal strata have differential response (ie outcomemissing-data)behaviorsTo evaluate this here we simulated theposterior distributionsof

pRiz j Ci D t Wpi micro z D 01 (1)

To draw from the distributions of the estimands in (1) we usedstep 1 from Section 811 and then for z D 0 1 for each subjectcalculated pRi z j W

pi Xobs

i Rxi Ci micro We then averagedthese values over subjects within subclasses de ned by the dif-ferent combinationsof W

pi and Ci

For each draw (across individuals)from the distributionin (1)we calculate the odds ratios (a) between compliers attendingpublic schools and never takers (b) between compliers attend-ing private schools and always takers and (c) between com-pliers attending private schools and compliers attending publicschools The results (omitted) showed that response was in-creasing in the following order never takers compliers attend-ing public schools compliers attending private schools andalways takers These results con rm that the compliance prin-cipal stratum is an important factor in response both alone andin interaction with assignment

9 MODEL BUILDING AND CHECKING

This model was built through a process of tting and check-ing a succession of models Of note in this model are the follow-ing features a censored normal model that accommodates the

Table 7 Proportions of Compliance Principal Strata Across Grade and Applicantrsquos School

Applicantrsquos school Low Applicantrsquos school HighGrade atapplication Never taker Compiler Always taker Never taker Compiler Always taker

1 245(29) 671(38) 84(24) 250(50) 693(61) 57(33)

2 205(27) 694(37) 101(25) 253(51) 672(62) 75(34)

3 245(32) 659(40) 96(25) 288(56) 641(67) 71(35)

4 184(33) 728(46) 88(30) 270(55) 667(67) 63(36)

NOTE Plain numbers are means and parentheses are standard deviations of the posterior distribution of the estimands

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

308 Journal of the American Statistical Association June 2003

pile-up of 0s in the outcomes incorporationof heteroscedastic-ity and a multivariate model to accommodate both outcomessimultaneously It is reassuring from the perspective of modelrobustness that the results from the last few models t includ-ing those in the conference proceedings report of Barnard et al(2002) are consistent with the results from this nal model

91 Convergence Checks

Because posterior distributions were simulated from aMarkov chain Monte Carlo algorithm (App A) it is impor-tant to assess its convergence To do this we ran three chainsfrom different starting values To initiate each chain we setany unknown compliance stratum equal to a Bernoulli drawwith probabilitiesobtained from moment estimates of the prob-abilities of being a complier given observed attendance andrandomization data (DZ) alone Using the initialized compli-ance strata for each chain parameters were initialized to valuesbased on generalized linear model estimates of the model com-ponents

Each chain was run for 15000 iterations At 5000 iterationsand based on the three chains for each model we calculatedthe potential scale-reduction statistic (Gelman and Rubin 1992)for the 250 estimands (parameters and functions of parameters)that serve as building blocks for all other estimands The re-sults suggested good mixing of the chains (with the maximumpotential scale reduction statistic across parameters 104) andprovided no evidence against convergence Inference is basedon the remaining 30000 iterations combining the three chains

92 Model Checks

We evaluate the in uence of the model presented in Section 7with six posterior predictivechecks three checks for each of thetwo outcomes A posterior predictive check generally involves(a) choosing a discrepancy measure that is a function of ob-served data and possibly of missing data and the parameter vec-tor micro and (b) computing a posterior predictive p value (PPPV)which is the probability over the posterior predictive distribu-tion of the missing data and micro that the discrepancy measure ina new study drawn with the same micro as in our study would beas or more extreme than in our study (Rubin 1984 Meng 1996Gelman Meng and Stern 1996)

Posterior predictive checks in general and PPPVs in partic-ular demonstrate whether the model can adequately preservefeatures of the data re ected in the discrepancy measure wherethe model here includes the prior distributionas well as the like-lihood (Meng 1996) As a result properties of PPPVs are notexactly the same as properties of classical p values under fre-quency evaluations conditional on the unknown micro just as theyare not exactly the same for frequency evaluations over bothlevels of uncertaintythat is the drawing of micro from the prior dis-tribution and the drawing of data from the likelihood given micro For example over frequency evaluations of the latter type aPPPV is stochastically less variable than but has the same meanas the uniform distributionand so tends to be more conservativethan a classical p value although the reverse can be true overfrequency evaluations of the rst type (Meng 1996) (For moredetails on the interpretation and properties of the PPPVs seealso Rubin 1984 and Gelman et al 1996)

Table 8 Posterior Predictive Checks p Values

Signal Noise Signal to Noise

Math 34 79 32Read 40 88 39

The posterior predictive discrepancy measures that wechoose here are functions of

Arepp z D

copyY

repip IRy

repip D 1IY rep

ip 6D0ICrepi DcIZi D zD 1

ordf

for the measures that are functionsof data Yrepip Ry

repip and C

repi

from a replicated study and

Astudyp z D

copyYip IRyip D 1IYip 6D 0ICi D cIZi D z D 1

ordf

for the measures that are functions of this studyrsquos data HereRy

repip is de ned so that it equals 1 if Y

repip is observed and 0 oth-

erwise and p equals 1 for math outcomes and 2 for reading out-comes The discrepancy measures ldquoreprdquo and ldquostudyrdquo that weused for each outcome (p D 1 2) were (a) the absolute value ofthe difference between the mean of Ap 1 and the mean of Ap 0

(ldquosignalrdquo) (b) the standard error based on a simple two-samplecomparison for this difference (ldquonoiserdquo) and (c) the ratio of (a)to (b) (ldquosignal to noiserdquo) Although these measures are not treat-ment effects we chose them here to assess whether the modelcan preserve broad features of signal noise and signal-to-noiseratio in the continuous part of the compliersrsquo outcome distribu-tions which we think can be very in uential in estimating thetreatment effects of Section 8 More preferable measures mighthave been the posterior mean and standard deviation for the ac-tual estimands in Section 8 for each replicated dataset but thiswould have required a prohibitive amount of computer mem-ory because of the nested structure of that algorithm In settingssuch as these additional future work on choices of discrepancymeasures is of interest

PPPVs for the discrepancy measures that we chose were cal-culated as the percentage of draws in which the replicated dis-crepancy measures exceeded the value of the studyrsquos discrep-ancy measure Extreme values (close to 0 or 1) of a PPPV wouldindicate a failure of the prior distributionand likelihoodto repli-cate the correspondingmeasure of location dispersion or theirrelative magnitude and would indicate an undesirable in uenceof the model in estimation of our estimands Results from thesechecks displayed in Table 8 provide no special evidence forsuch in uences of the model

10 DISCUSSION

In this article we have de ned the framework for princi-pal strati cation in broken randomized experiments to accom-modate noncompliancemissing covariate informationmissingoutcome data and multivariate outcomes We make explicit aset of structural assumptions that can identify the causal effectsof interest and we also provide a parametric model that is ap-propriate for practical implementationof the framework in set-tings such as ours

Results from our model in the school choice study do notindicate strong treatment effects for most of the subgroups ex-amined But we do estimate positive effects (on the order of 3percentile points for ITT and 5 percentile points for the effect of

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Principal Strati cation Approach to Broken Randomized Experiments 309

attendance) on math scores overall for children who applied tothe program from low-applicant schools particularly for rstgraders Also the effects were larger for African-Americanchildren than for the remaining children (App B)

Posterior distributions for the CACE estimand which mea-sures the effect of attendance in private school versus publicschool are generally larger than the corresponding ITT effectsbut are also associated with greater uncertaintyOf importancebecause of the missing outcomes a model like ours is neededfor valid estimation even of the ITT effect

The results from this randomized study are not subject to se-lection bias in the way that nonrandomized studies in schoolchoice have been Nevertheless although we use the CACE awell-de ned causal effect to represent the effect of attendanceof private versus public schools it is important to rememberthat the CACE is de ned on a subset of the study children(those who would have complied with either assignment) andthat for the other children there is no information on such aneffect of attendance in this study Therefore as with any ran-domized trial based on a subpopulation external informationsuch as background variables also must be used when gen-eralizing the CACE from compliers to other target groups ofchildren Also it is possible that a broader implementation ofa voucher program can have a collective effect on the publicschools if a large shift of the children who might use vouch-ers were to have an impact on the quality of learning for chil-dren who would stay in public schools (a violation of the no-interference assumption) Because our study contains a smallfraction of participants relative to all school children it cannotprovide direct information about any such collective effect andadditional external judgment would need to be used to addressthis issue

The larger effects that we estimated for children applyingfrom schools with low versus high past scores are also notin principle subject to the usual regression to the mean biasin contrast to a simple beforendashafter comparison This is be-cause in our study both types of children are randomized tobe offered the scholarship or not and both types in both treat-ment arms are evaluated at the same time after randomizationInstead the differential effect for children from schools withdifferent past scores is evidence supporting the claim that the

school voucher program has greater potential bene t for chil-dren in lower-scoring schools

Our results also reveal differences in compliance andmissing-data pattern across groups Differences in complianceindicate that children applying from low-applicant schools aregenerally more likely to comply with their treatment assign-ment this could provide incentives for policy makers to targetthis subgroup of children However this group also exhibitshigher levels of always takers indicating that parents of chil-dren attending these poorly performancing schools are morelikely to get their child into a private school even in the ab-sence of scholarship availabilityThese considerationswould ofcourse have to be balanced with (and indeed might be dwarfedby) concerns about equityMissing-datapatterns reveal that per-haps relatively greater effort is needed in future interventionsof this nature to retain study participants who stay in publicschools particularly the public-school students who actuallywon but did not use a scholarship

The approach presented here also has some limitationsFirstwe have presented principal strati cation only on a binary con-trolled factor Z and a binary uncontrolled factor D and it isof interest to develop principal strati cation for more levels ofsuch factors Approaches that extend our framework in that di-rection including for time-dependentdata have been proposed(eg Frangakis et al 2002b)

Also the approach of explicitly conditioning on the patternsof missing covariates that we adopted in the parametric com-ponent of Section 7 is not as applicable when there are manypatterns of missing covariates In such cases it would be moreappropriate to use models related to those of DrsquoAgostino andRubin (2000) and integrate them with principal strati cationfor noncompliance Moreover it would be also interesting toinvestigate results for models t to these data that allow devia-tions from the structural assumptions such as weakened exclu-sion restrictions (eg Hirano et al 2000 Frangakis Rubin andZhou 2002a) although to ensure robustness of such models itwould be important to rst investigate and rely on additionalalternative assumptions that would be plausible

APPENDIX A COMPUTATIONS

Details are available at httpbiosun01biostatjhspheduraquocfrangakpaperssc

Table B1 ITT Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 28(iexcl1572) 67(30104) 18(iexcl5284) 63(8 119)

Other 17(iexcl1954) 39(5 72) 8(iexcl4561) 35(iexcl12 82)

2 AA 8(iexcl2844) 24(iexcl1058) iexcl3(iexcl6660) 22(iexcl38 80)

Other 2(iexcl3134) 5(iexcl2837) iexcl9(iexcl6143) 0(iexcl5248)

3 AA 11(iexcl3052) 46(3 89) iexcl1(iexcl6764) 42(iexcl25 106)

Other 3(iexcl3238) 20(iexcl2058) iexcl7(iexcl5844) 14(iexcl39 65)

4 AA 37(iexcl1287) 44(iexcl5 93) 24(iexcl4794) 38(iexcl29 105)

Other 24(iexcl1767) 18(iexcl2660) 14(iexcl4370) 11(iexcl44 64)

Overall AA 20(iexcl9 48) 45(1872) 8(iexcl5165) 42(iexcl11 93)

Other 11(iexcl1436) 21(iexcl7 47) 1(iexcl4646) 16(iexcl29 58)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

310 Journal of the American Statistical Association June 2003

Table B2 CACE Estimand

Applicant school Low Applicant school HighGrade atapplication Ethnicity Reading Math Reading Math

1 AA 38(iexcl2096) 90(41 140) 23(iexcl73 109) 83(11155)

Other 29(iexcl3188) 63(8 118) 13(iexcl80 102) 59(iexcl22139)

2 AA 11(iexcl3657) 31(iexcl13 75) iexcl4(iexcl9079) 29(iexcl50107)

Other 3(iexcl4954) 8(iexcl44 59) iexcl15(iexcl10573) 0(iexcl8585)

3 AA 15(iexcl4170) 63(3 122) iexcl2(iexcl9185) 56(iexcl32141)

Other 5(iexcl5464) 35(iexcl34 102) iexcl13(iexcl10579) 27(iexcl70120)

4 AA 46(iexcl16108) 55(iexcl7 116) 30(iexcl60 116) 48(iexcl36131)

Other 37(iexcl26101) 28(iexcl38 93) 23(iexcl74 117) 20(iexcl70110)

Overall AA 26(iexcl1263) 60(24 95) 10(iexcl69 84) 55(iexcl15122)

Other 17(iexcl2358) 33(iexcl12 77) 1(iexcl8179) 27(iexcl50103)

NOTE Plain numbers are means and parentheses are central 95 intervals of the posterior distribution of the effects onpercentile rank

APPENDIX B ETHNIC BREAKDOWN

The model of Section 7 can be used to estimate the effect of theprogram on ner strata that may be of interest For example to es-timate effects strati ed by ethnicity (AA for African-American) weobtain the posterior distribution of the causal effect of interest (ITT orCACE) using the same steps described in Section 811 or 812 butallowing ethnicity to be part of the de nition of W p The results forthis strati cation are reported in Table B1 for the estimands of ITTand Table B2 for the estimands of CACE

The results follow similar patterns to those in Section 8 For eachgrade and applicantrsquos school (lowhigh) combination however theeffects are more positive on average for the subgroup of African-American children For both estimands this leads to 95 intervalsthat are entirely above 0 for math scores for the following subgroupsAfrican-American rst- and third-graders and overall from low-applicant schools African-American rst-graders from high-applicantschools and non-African-American rst-graders from low-applicantschools All intervals for reading covered the null value This suggeststhat the positive effects reported on math scores in Section 8 for chil-dren originating from low-applicant schools are primarily attributableto gains among the African-American children in this subgroup

[Received October 2002 Revised October 2002]

REFERENCESAngrist J D Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 444ndash472

Barnard J Du J Hill J L and Rubin D B (1998) ldquoA Broader Templatefor Analyzing Broken Randomized Experimentsrdquo Sociological Methods andResearch 27 285ndash317

Barnard J Frangakis C Hill J L and Rubin D B (2002) ldquoSchool Choicein NY City A Bayesian Analysis of an Imperfect Randomized Experimentrdquoin Case Studies in Bayesian Statistics vol V eds Gatsonis C Kass R ECarlin B Cariiguiry A Gelman A Verdinelli I West M New YorkSpringer-Verlag pp 3ndash97

Brandl J E (1998) Money and Good Intentions Are Not Enough or Why Lib-eral Democrats Think States Need Both Competition and Community Wash-ington DC Brookings Institute Press

Carnegie Foundation for the Advancement of Teaching (1992) School ChoiceA Special Report San Francisco Jossey-Bass

Chubb J E and Moe T M (1990) Politics Markets and Americarsquos SchoolsWashington DC Brookings Institute Press

Cobb C W (1992) Responsive Schools Renewed Communities San Fran-cisco Institute for Contemporary Studies

Coleman J S Hoffer T and Kilgore S (1982) High School AchievementNew York Basic Books

Cookson P W (1994) School Choice The Struggle for the Soul of AmericanEducation New Haven CT Yale University Press

Coulson A J (1999) Market Education The Unknown History BowlingGreen OH Social Philosophy amp Policy Center

Cox D R (1958) Planning of Experiments New York WileyDrsquoAgostino Ralph B J and Rubin D B (2000) ldquoEstimating and Using

Propensity Scores With Incomplete Datardquo Journal of the American Statis-tical Association 95 749ndash759

Education Week (1998) Quality Counts rsquo98 The Urban Challenge PublicEducation in the 50 States Bethesda MD Editorial Projects in Education

Frangakis C E and Rubin D B (1999) ldquoAddressing Complicationsof Intention-to-Treat Analysis in the Combined Presence of All-or-NoneTreatment-Noncompliance and Subsequent Missing Outcomesrdquo Biometrika86 365ndash380

(2001) ldquoAddressing the Idiosyncrasy in Estimating Survival CurvesUsing Double-Sampling in the Presence of Self-Selected Right Censoringrdquo(with discussion) Biometrics 57 333ndash353

(2002) ldquoPrincipal Strati cation in Causal Inferencerdquo Biometrics 5820ndash29

Frangakis C E Rubin D B and Zhou X H (2002a) ldquoClustered Encourage-ment Design With Individual Noncompliance Bayesian Inference and Appli-cation to Advance Directive Formsrdquo (with discussion) Biostatistics 3 147ndash164

Frangakis C E Brookmeyer R S Varadhan R Mahboobeh S Vlahov Dand Strathdee S A (2002b) ldquoMethodology for Evaluating a Partially Con-trolled Longitudinal Treatment Using Principal Strati cation With Applica-tion to a Needle Exchange Programrdquo Technical Report NEP-06-02 JohnsHopkins University Dept of Biostatistics

Fuller B and Elmore R F (1996) Who Chooses Who Loses Culture In-stitutions and the Unequal Effects of School Choice New York TeachersCollege Press

Gelfand A E and Smith A F M (1990) ldquoSampling-Based Approaches toCalculating Marginal Densitiesrdquo Journal of the American Statistical Associ-ation 85 398ndash409

Gelman A Meng X-L and Stern H (1996) ldquoPosterior Predictive Assess-ment of Model Fitness via Realized Discrepancies (Disc P760-807)rdquo Statis-tica Sinica 6 733ndash760

Gelman A and Rubin D B (1992) ldquoInference from Iterative Simulation Us-ing Multiple Sequencesrdquo (with discussion) Statistical Science 7 457ndash472

Glynn R J Laird N M and Rubin D B (1993) ldquoMultiple Imputation inMixture Models for Nonignorable Nonresponse With Follow-Upsrdquo Journalof the American Statistical Association 88 984ndash993

Goldberger A S and Cain G G (1982) ldquoThe Causal Analysis of CognitiveOutcomes in the Coleman Hoffer and Kilgore Reportrdquo Sociology of Educa-tion 55 103ndash122

Haavelmo T (1943) ldquoThe Statistical Implications of a System of SimultaneousEquationsrdquo Econometrica 11 1ndash12

Hastings W (1970) ldquoMonte Carlo Sampling Methods Using Markov Chainsand Their Applicationsrdquo Biometrika 57 97ndash109

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarship Program Evaluationrdquo in Donald CampbellrsquosLegacy ed L Bickman Newbury Park CA Sage Publications

Hirano K Imbens G W Rubin D B and Zhou A (2000) ldquoAssessing theEffect of an In uenza Vaccine in an Encouragement Designrdquo Biostatistics 169ndash88

Holland P (1986) ldquoStatistics and Causal Inferencerdquo Journal of the AmericanStatistical Association 81 396 945ndash970

Imbens G W and Angrist J D (1994) ldquoIdenti cation and Estimation ofLocal Average Treatment Effectsrdquo Econometrica 62 467ndash476

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Mutheacuten Jo and Brown Comment 311

Imbens G W and Rubin D B (1997) ldquoBayesian Inference for Causal Effectsin Randomized Experiments With Noncompliancerdquo The Annals of Statistics25 305ndash327

Levin H M (1998) ldquoEducational Vouchers Effectiveness Choice and CostsrdquoJournal of Policy Analysis and Management 17 373ndash392

Little R J A (1993) ldquoPattern-Mixture Models for Multivariate IncompleteDatardquo Journal of the American Statistical Association 88 125ndash134

(1996) ldquoPattern-Mixture Models for Multivariate Incomplete DataWith Covariatesrdquo Biometrics 52 98ndash111

Little R J A and Rubin D B (1987) Statistical Analysis With Missing DataNew York Wiley

Meng X L (1996) ldquoPosterior Predictive p Valuesrdquo The Annals of Statistics22 1142ndash1160

Neal D (1997) ldquoThe Effects of Catholic Secondary Schooling on EducationalAchievementrdquo Journal of Labor Economics 15 98ndash123

Neyman J (1923) ldquoOn the Application of Probablity Theory to AgriculturalExperiments Essay on Principles Section 9rdquo translated in Statistical Science5 465ndash480

Peterson P E and Hassel B C (Eds) (1998) Learning from School ChoiceWashington DC Brookings Institute Press

Peterson P E Myers D E Howell W G and Mayer D P (1999) ldquoTheEffects of School Choice in New York Cityrdquo in Earning and Learning HowSchools Matter eds S E Mayer and P E Peterson Washington DC Brook-ings Institute Press

Robins J M Greenland S and Hu F-C (1999) ldquoEstimation of the CausalEffect of a Time-Varying Exposure on the Marginal Mean of a Repeated Bi-nary Outcomerdquo (with discussion) Journal of the American Statistical Associ-ation 94 687ndash712

Rosenbaum P R and Rubin D B (1983) ldquoThe Central Role of the PropensityScore in Observational Studies for Causal Effectsrdquo Biometrika 70 41ndash55

Rubin D B (1974) ldquoEstimating Causal Effects of Treatments in Randomizedand Non-Randomized Studiesrdquo Journal of Educational Psychology 66 688ndash701

(1977) ldquoAssignment to Treatment Groups on the Basis of a CovariaterdquoJournal of Educational Statistics 2 1ndash26

(1978a) ldquoBayesian Inference for Causal Effects The Role of Random-izationrdquo The Annals of Statistics 6 34ndash58

(1978b) ldquoMultiple Imputations in Sample Surveys A Phenomenologi-cal Bayesian Approach to Nonresponse (CR P29-34)rdquo in Proceedings of theSurvey Research Methods Section American Statistical Association pp 20ndash28

(1979) ldquoUsing Multivariate Matched Sampling and Regression Ad-justment to Control Bias in Observational Studiesrdquo Journal of the AmericanStatistical Association 74 318ndash328

(1980) Comments on ldquoRandomization Analysis of ExperimentalData The Fisher Randomization Testrdquo Journal of the American StatisticalAssociation 75 591ndash593

(1984) ldquoBayesianly Justi able and Relevant Frequency Calculationsfor the Applied Statisticianrdquo The Annals of Statistics 12 1151ndash1172

(1990) ldquoComment Neyman (1923) and Causal Inference in Experi-ments and Observational Studiesrdquo Statistical Science 5 472ndash480

The Coronary Drug Project Research Group (1980) ldquoIn uence of Adherenceto Treatment and Response of Cholesterol on Mortality in the Coronary DrugProjectrdquo New England Journal of Medicine 303 1038ndash1041

Tinbergen J (1930) ldquoDetermination and Interpretation of Supply Curves AnExamplerdquo in The Foundations of Econometric Analysis eds D Hendry andM Morgan Cambridge UK Cambridge University Press

Wilms D J (1985) ldquoCatholic School Effect on Academic Achievement NewEvidence From the High School and Beyond Follow-Up Studyrdquo Sociology ofEducation 58 98ndash114

CommentBengt MUTHEacuteN Booil JO and C Hendricks BROWN

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (BFHR) istimely in that the Department of Education is calling for morerandomized studies in educationalprogram evaluation(See thediscussion of the ldquoNo Child Left Behindrdquo initiative in egSlavin 2002) BFHR can serve as a valuable pedagogical ex-ample of a successful sophisticated statistical analysis of a ran-domized study Our commentary is intended to provide addi-tional pedagogicalvalue to bene t the planning and analysis offuture studies drawing on experiences and research within ourresearch group [The Prevention Science Methodology Group(PSMG wwwpsmghscusfedu) co-PIrsquos Brown and Mutheacutenhas collaborated over the last 15 years with support from theNational Institute of Mental Health and the National Instituteon Drug Abuse]

BFHR provides an exemplary analysis of the data from animperfect randomized trial that suffers from several compli-cations simultaneously noncompliance missing data in out-comes and missing data in covariates We are very pleased to

Bengt Mutheacuten is Professor and Booil Jo is postdoc Graduate School ofEducation and Information Studies University of California Los AngelesLos Angeles CA-90095 (E-mail bmuthenuclaedu) C Hendricks Brown isProfessor Department of Epidemiology and Biostatistics University of SouthFlorida Tampa FL 33620 The research of the rst author was supported byNational Institute on Alcohol Abuse and Alcoholism grant K02 AA 00230The research of the second and third authors was supported by National Insti-tute on Drug Abuse and National Institute of Mental Health grant MH40859The authors thank Chen-Pin Wang for research assistance Joyce Chappell forgraphical assistance and the members of the Prevention Science MethodologyGroup and the Fall ED299A class for helpful comments

see their applicationof cutting-edgeBayesian methods for deal-ing with these complexities In additionwe believe the method-ological issues and the results of the study have important im-plications for the design and analysis of randomized trials ineducation and for related policy decisions

BFHR provides results of the New York City school choiceexperiment based on 1-year achievement outcomes With theplanned addition of yearly follow-up data growth models canprovide an enhanced examination of causal impact We discusshow such growth modeling can be incorporated and provide acaution that applies to BFHRrsquos use of only one posttest occa-sion We also consider the sensitivity of the latent class ignora-bility assumption in combination with the assumption of com-pound exclusion

2 LONGITUDINAL MODELING ISSUES

BFHR focuses on variation in treatment effect across compli-ance classes This part of the commentary considers variationin treatment effect across a different type of class based on thenotion that the private school treatment effect might very wellbe quite different for children with different achievement de-velopment (Also of interest is potential variation in treatmenteffects across schools with respect to both the public schoolthe child originated in and the private school the child was

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000080

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

312 Journal of the American Statistical Association June 2003

moved to but this multilevel aspect of the data is left aside herefor lack of space) To study such a ldquotreatmentndashbaseline inter-actionrdquo (or ldquotreatmentndashtrajectory interactionrdquo) we will switchfrom BFHRrsquos pretestndashposttest analysis framework (essentiallya very advanced ANCOVA-type analysis) to the growth mix-ture modeling framework of Mutheacuten et al (2002) An under-lying rationale for this modeling is that individuals at differentinitial status levels and on different trajectories may bene tdifferently from a given treatment ANCOVA controls for ini-tial status as measured by the observed pretest score Unlikethe observed pretest score the latent variable of initial status isfree of time-speci c variation and measurement error

The focus on longitudinal aspects of the New York SchoolChoice Study (NYSCS) is both substantively and statisticallymotivated First treatment effects may not have gained fullstrength after only a 1-year stay in a private school The NYSCScurrently has data from three follow-ups that is providingrepeated-measuresdata from four grades Although BFHR usedpercentile scores that do not lend themselves to growth mod-eling a conversion to ldquoscale scoresrdquo (ie IRT-based equatedscores) should be possible enabling growth modeling Unfor-tunately educational research traditionally uses scales that areunsuitable for growth modeling such as percentile scores nor-mal curve equivalents and grade equivalents (for a comparisonin a growth context see Seltzer Frank and Bryk 1994) Hope-fully this tradition can be changedSecond the use of informa-tion from more time points than pretest and posttest makes itpossible to identify and estimate models that give a richer de-scription of the normative developmentin the control group andhow the treatment changes this development

Consider three types of latent variables for individual i The rst type Ci refers to BFHRrsquos complianceprincipal strata Thenext two relate to the achievement development as expressedby a growth mixture model Ti refers to trajectory class andacutei refers to random effects within trajectory class (within-classmodel is a regular mixed-effects model) Unlike the latent classvariable Ci the latent class variable Ti is a fully unobservedvariable as is common in latent variable modeling (see egMutheacuten 2002a) Consider the likelihoodexpression for individ-ual i using the [] notation to denote probabilitiesdensities

[Ci TijXi ][acutei jCi Ti Xi ][Yi jacutei Ci Ti Xi]

pound [Uijacutei Ci TiXi ][RijYi acutei Ci TiXi ] (1)

where Xi denotes covariates Ui denotes a compliance stratumindicator (with Ci perfectly measured by Ui in the treatmentgroup for never-takers and perfectly measured in the controlgroup for always-takers other groupndashclass combinations hav-ing missing data) and Ri denotes indicators for missing dataon the repeated-measures outcomes Yi (pretreatment and post-treatment achievement scores) This type of model can be ttedinto the latent variable modeling framework of the Mplus pro-gram (Mutheacuten and Mutheacuten 1998ndash2002 tech app 8) which hasimplementedan EM-based maximum likelihoodestimator (Forrelated references see the Mplus website wwwstatmodelcom)As a special case of (1) conventional random-effects growthmodeling includes acutei but excludes Ci and Ti and assumesmissingness at random so that the last term in (1) is ignoredGrowth mixture modeling (Mutheacuten and Shedden 1999 Mutheacuten

et al 2002 Mutheacuten and Mutheacuten 1998ndash2002) includes acutei andTi BFHR includes Ci but not Ti (or acutei) and includes the lastterm in (1) drawing on latent ignorability of Frangakis and Ru-bin (1999) Mutheacuten and Brown (2001) studied latent ignorabil-ity related to Ti in the last term of (1) In randomized studiesit would be of interest to study Ci and Ti classes jointly be-cause individuals in different trajectory classes may show dif-ferent complianceand missingness may be determined by theseclasses jointly

If data have been generated by a growth mixture model withtreatment effects varying across trajectory classes what wouldpretestndashposttest analysis such as that in BFHR reveal To judgethe possibility of such treatmentndashtrajectory interaction in theNYSCS we considered several recent applications of growthmixture modeling that have used Ti to represent qualitativelydifferent types of trajectories for behavior and achievementscores on children in school settings Drawing on these real-data studies two growth mixture scenarios were investigated(A detailed description of these real-data studies and scenar-ios and their parameter values are given in Mplus Web Note5 at wwwstatmodelcommplusexampleswebnotehtml) Forsimplicity no missing data on the outcome or pretest is as-sumed and Ci classes are not present In a three-class scenariothe treatment effect is noteworthy only for a 70 middle classassuming that the low-class membership (10) hinders individ-uals from bene ting from the treatment and assuming that thehigh-class membership (20) does not really need the treat-ment The achievement development in the three-class scenariois shown in Figure 1(a) and the corresponding posttest (y2)ndashpretest (y1) regressions are shown in Figure 1(b) The linesdenoted ANCOVA show a regular ANCOVA analysis allow-ing for an interaction between treatment and pretest (differentslopes) In the three-class scenario the ANCOVA interactionis not signi cant at n D 2000 and the treatment effect in themiddle class is underestimated by 20 but overestimated inthe other two classes In a two-class scenario (not shown here)where the treatment is noteworthy only for individuals in thelow class (50) ANCOVA detects an interaction that is sig-ni cant at the NYSCS sample size of n D 2000 but underes-timates the treatment effect for most children in the low class(At the low-class average pretest value of 0 the treatment effectis underestimated by 32)

Although the NYSCS children are selected from low-per-forming schools (the average NYSCS math and reading per-centile rankings are around 23ndash28) there may still be suf cientheterogeneity among children in their achievement growth tomake a treatment-trajectory interaction plausible The three-class scenario is possible perhaps with more children in thelow class relative to the other two classes If this is the case theANCOVA analysis shown in Figure 1 suggests a possible rea-son for BFHRrsquos nding of low treatment effects The empiricalstudies and the results in Figure 1 suggest that future programevaluations may bene t from exploring variation in treatmenteffects across children characterized by different developmentUsing data from at least two posttreatment time points (threetime points total) the class-speci c treatment effects generatedin these data can be well recovered by growth mixture model-ing (Monte Carlo simulation results are given in Mplus WebNote 5 at wwwstatmodelcommplusexampleswebnotehtml)

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Mutheacuten Jo and Brown Comment 313

Figure 1 Growth Mixture Modeling Versus PretestndashPosttest Analysis

A more exible analysis is obtained with more posttreatmenttime points An improved design for the determination of thelatent trajectory classes is to use more than one pretreatmenttime point so that the trajectory class membership is better de-termined before the treatment starts

3 COMPOUND EXCLUSION ANDLATENT IGNORABILITY

Based on the ideas of principal strati cation (Frangakisand Rubin 2002) and latent ignorability (Frangakis and Rubin1999) BFHR successfully demonstrates that the complexitiesof educationalstudies can be better handled under more explicitand exible sets of assumptions Although we think that theirstructural assumptions are reasonable in the NYSCS we wouldlike to add some thoughts on the plausibility of two other as-sumptions considering more general situations

Compound exclusion (CE) is one of the key structural as-sumptions in identifying principal effects under latent ignora-bility However the plausibility of this assumption can be ques-tioned in practice (Frangakis et al 2002 Hirano et al 2000 Jo2002 2002c Shadish Cook and Campbell 2002 West andSagarin 2000) In the NYSCS it seems realistic to assume thatwinning a lottery has some positive impact on always-takershowever it is less clear how winning a lottery will affect never-takers One possibility is that winning a lottery has a negativeimpact on parents because they fail to bene t from it Discour-aged parents may have a negative in uence on a childrsquos testscores or response behaviors This negative effect may becomemore evident if noncompliance is due to parentsrsquo low expec-tation of or lack of interest in their childrsquos education Anotherpossibility is that winning a lottery has a positive impact ona childrsquos test scores or response behaviors For example par-ents who are discouraged by being unable to send their child toprivate schools even with vouchers may try harder to improvethe quality of existing resources (eg in the public school theirchild attends) and be more motivated to support their child to

improve his or her academic performance Given these compet-ing possibilities it is not easy to predict whether and how CE isviolated

Depending on the situation causal effect estimates can bequite sensitive to violation of the exclusion restriction in out-come missingness (Jo 2002b) which is less known than theimpact of violating exclusion restriction in observed outcomes(Angrist et al 1996 Jo 2002) The implication of possible vio-lation of CE and its impact is that the relative bene t of modelsassuming latent ignorability (LI) and standard ignorability (SI)depends on degrees of deviation from CE and SI Identi cationof causal effects under LI relies on the generalized (compound)exclusion restriction (ie both on the outcomes and missing-ness of outcomes) whereas identi cation of causal effects un-der SI relies on the standard exclusion restriction (ie only onthe outcomes) Therefore in some situations the impact of de-viation from CE may outweigh the impact of deviation from SIresulting in more biased causal effect estimates in models as-suming LI than in models assuming SI (Jo 2002b)For exampleif SI holds but CE is seriously violated (say a 20 increase inthe response rate due to treatment assignment for compliers anda 15 increase for never-takers) causal effect estimates and thecoverage probability assuming LI and CE can drastically de-viate from the true value and the nominal level This type ofviolation does not affect models assuming SI and the standardexclusion restriction To empirically examine the plausibility ofSI LI and CE it will be useful to do sensitivity analyses ofmodels imposing different combinations of these assumptionsAs BFHR points out this investigationcan be conducted by re-laxing compound exclusion (eg Frangakis et al 2002 Hiranoet al 2000) or by using alternative structural assumptions (egJo 2002c)More research is needed to examine the ef ciency ofthese alternative models and to explore factors associated withinsensitivity of LI models to violation of compound exclusion

4 CONCLUSION

Causal inferences of the type BFHR provides are a dramaticimprovement over the existing literature now available on the

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

314 Journal of the American Statistical Association June 2003

question of whether school choice will produce better achieve-ment outcomes for children in an urban public school systemThe randomized lottery providesan exceptionallypowerful toolfor examining the impact of a programmdashfar more useful thanobservationalstudies that have causal change intertwined hope-lessly with self-selection factors Statisticians are just now in-vestigating variations in such principal strata analyses that isthose involving latent classes formed as a function of random-ized trials involving intervention invitations (such as vouch-ers) encouragement designs and eld trial designs involvingmore than one randomization (Brown and Liao 1999) The la-tent categories in this article which BFHR labels ldquocomplierrdquoldquonever-takerrdquo ldquoalways-takerrdquo and ldquode errdquo represent only onetype of design Other terms may be more relevant to the sci-enti c questions underlying trials in which subjects are ran-domly assigned to different levels of invitation (eg Angristand Imbens 1995) or different levels of implementation Suchtrials not only have great potential for examining questionsof effectiveness sustainability and scalability but also requireterms more consistent with adherence than compliance Againwe congratulate the authors on an important addition to themethodological literature that we predict will have lasting im-pact

ADDITIONAL REFERENCES

Angrist J D and Imbens G W (1995) ldquoTwo-Stage Least Squares Estima-tion of Average Causal Effects in Models With Variable Treatment IntensityrdquoJournal of the American Statistical Association 90 431ndash442

Brown C H and Liao J (1999) ldquoPrinciples for Designing Randomized Pre-ventive Trials in Mental Health An Emerging Developmental EpidemiologicPerspectiverdquo American Journal of Community Psychology 27 673ndash709

Jo B (2002a) ldquoModel Misspeci cation Sensitivity Analysis in EstimatingCausal Effects of Interventions With Noncompliancerdquo Statistics in Medicine21 3161ndash3181

(2002b) ldquoSensitivity of Causal Effects Under Ignorable and La-tent Ignorable Missing-Data Mechanismsrdquo available at wwwstatmodelcommplusexamplesjo

(2002c) ldquoEstimation of Intervention Effects With Noncompliance Al-ternative Model Speci cationsrdquo (with discussion) Journal of Educationaland Behavioral Statistics 27 385ndash420

Mutheacuten B (2002a) ldquoBeyond SEM General Latent Variable Modelingrdquo Be-haviormetrika 29 81ndash117

Mutheacuten B and Brown C H (2001) ldquoNon-Ignorable Missing Data in a Gen-eral Latent Variable Modeling Frameworkrdquo presented at the annual meetingof the Society for Prevention Research Washington DC June 2001

Mutheacuten B Brown C H Masyn K Jo B Khoo S T Yang C C WangC P Kellam S Carlin J and Liao J (2002) ldquoGeneral Growth MixtureModeling for Randomized Preventive Interventionsrdquo Biostatistics 3 459ndash475

Mutheacuten L and Mutheacuten B (1998ndash2002) Mplus Userrsquos Guide Los Angelesauthors

Mutheacuten B and Shedden K (1999) ldquoFinite Mixture Modeling With MixtureOutcomes Using the EM Algorithmrdquo Biometrics 55 463ndash446

Seltzer M H Frank K A and Bryk A S (1993) ldquoThe Metric Mat-ters The Sensitivity of Conclusions About Growth in Student Achieve-ment to Choice of Metricrdquo Educational Evaluation and Policy Analysis 1641ndash49

Shadish W R Cook T D and Campbell D T (2002) Experimentaland Quasi-Experimental Designs for Generalized Causal Inference BostonHoughton Mif in

Slavin R E (2002) ldquoEvidence-Based Education Policies Transforming Edu-cational Practice and Researchrdquo Educational Researcher 31 15ndash21

West S G and Sagarin B J (2000) ldquoParticipant Selection and Loss in Ran-domized Experimentsrdquo in Research Design ed L Bickman Thousand OaksCA Sage pp 117ndash154

CommentAlan B KRUEGER and Pei ZHU

The New York City school voucher experiment providessome of the strongest evidence on the effect of private schoolvouchers on student test achievement yet available Barnardet al provide a useful reevaluation of the experiment and someof the authors were integrally involved in the design of the ex-periment We will leave it to other discussants to comment onthe Bayesian methodological advances in their paper and in-stead comment on the substantive lessons that can be learnedfrom the experiment and the practical lessons that emerge fromthe novel design of the experiment

We have the advantage of having access to more completedata than Barnard et al used in preparing their paper This in-cludes more complete baseline test information data on multi-child families as well as single-child families and three yearsof follow-up test data instead of just one year of follow-up data

Alan B Krueger is Professor and Pei Zhu is graduate student Eco-nomics Department Princeton University Princeton NJ 08544 (E-mailakruegerprincetonedu) In part this comment extends and summarizes re-sults of Krueger and Zhu (2003) Readers are invited to read that article for amore in-depth analysis of many of the issues raised in this comment Some ofthe results presented here differ slightly from those in our earlier article how-ever because the de nition of family size in this comment corresponds to theone used by Barnard et al rather than to the de nition in our earlier article

In our comment we use the more comprehensive sample be-cause results for this sample are more informative but we high-light where differences arise from using the sample analyzed byBarnard et al

Three themes emerge from our analysis First simplicity andtransparency are under appreciated virtues in statistical analy-sis Second it is desirable to use the most recent comprehen-sive data for the widest sample possible Third there is no sub-stitute for probing the de nitions and concepts that underlie thedata

1 RANDOM ASSIGNMENT

As Barnard et al explain the voucher experiment entaileda complicated block design and different random assignmentprocedureswere used in the rst and second set of blocks In the rst block a propensity matched-pairs design (PMPD) methodwas used In this block far more potential control families wereavailable than was money to follow them up Rather than se-lect a random sample of controls for follow up after randomly

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000099

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Krueger and Zhu Comment 315

Table 1 Ef ciency Comparison of Two Methods of Random Assignment Estimated Treatment Effect and Standard Error by Method ofRandom Assignment

Relative Sample-size-Propensity score Randomized blocks

sampling adjusted relativematch subsample subsample

variance sampling varianceYear Model Coefcient Number of observations Coef cient Number of observations (RBPMPD) (RBPMPD)

All studentsYear 1 Control for baseline 33 721 210 734 972 989

(140) (138)Omit baseline 32 1048 iexcl102 1032 1274 1254

(140) (158)Year 3 Control for baseline 102 613 67 637 1244 1293

(156) (174)Omit baseline iexcl31 900 iexcl45 901 1285 1287

(157) (178)

African-American studentsYear 1 Control for baseline 32 302 810 321 827 879

(188) (171)Omit baseline 210 436 319 447 1082 1109

(199) (207)Year 3 Control for baseline 421 247 657 272 693 764

(251) (209)Omit baseline 226 347 305 386 875 973

(247) (231)

NOTE Standard errors are in parentheses Treatment effect coef cient is from a regression of test scores on a dummy indicating assignment to receive a voucher (1 D yes) lottery randomizationstrata dummies and in indicated models baseline test scores Bootstrap standard errors account for within-family correlation in residuals Year 1 or 3 refers to follow-up year The last column adjuststhe relative sampling variances of the treatment effects for differences in relative sample sizes The sample of African-American students consists of those whose mothersrsquo raceethnicity is identi edas non-Hispanic BlackAfrican-American

selecting the treatments the researchers estimated a propen-sity score model to align controls with the treatments and thenused a nearest-available-neighbor Mahalanobis match to selectspeci c paired control families for follow-up In principle thePMPD design should improve the precision of the estimatesby reducing the chance imbalances that can occur in a simplerandomized block design A standard randomized block designwas used in the remaining four blocks

Barnard et al emphasize that the PMPD block was more bal-anced for 15 of the 21 baseline variables that they examinedBut the key question is the extent to which the precision ofthe estimates was improved not the covariate balance becauseboth blocks yield unbiased estimates If the covariates explainrelatively little of the outcome variable then the PMPD will addvery little In Table 1 we provide standard intent-to-treat (ITT)estimatesmdashthat is the difference in mean test scores betweentreatments and controls conditional on the strata used for ran-dom assignment (family size by block by highlow-achievingschool)mdashand more importantly their standard errors sepa-rately for the PMPD subsample and randomized block subsam-ple The outcome variable is the average of the national per-centile rank on the math and reading segments of the Iowa Testfor Basic Skills taken either 1 year or 3 years after random as-signment Two sets of results are presented one set controllingfor baseline test scores for the subsample with baseline scoresand the other without controlling for scores for the larger sam-ple The last column adjusts the relative sampling variances fordifferences in sample size between the blocks

For all students (panel A) the standard errors are about 10smaller in the PMPD sample For the Black students (panel B)however the standard errors are essentially the same or slightlylarger in the PMPD sample This nding is unexpectedbecauseHill Rubin and Thomas (2000) predicted that analyses of sub-groups would have more power in the PMPD design because

matching should lead to a more nearly equal representation ofsubgroups in the treatment and control groups In addition thecovariate balance is less good for the Black students in thePMPD than in the randomized blocks if we compute Z scoresfor the differences between treatments and controls for the base-line covariates One possibility is that because Black studentscould have been matched to non-Black students in the PMPDdesign this subsample was actually less balanced than the sub-sample of Blacks from the more conventionalrandomizedblockdesign

The increase in power from the PMPD design is relativelymodest even in the full sample Of much more practical im-portance is the fact that the complexity of the PMPD de-sign caused Mathematica to initially calculate incorrect base-line sample weights The baseline weights were designed tomake the follow-up sample representative of all eligible ap-plicants for vouchers The initial weights assigned much toolittle importance to the control sample in the PMPD block Be-cause these families represented many unselected controls theyshould have been weighted heavily The revised weights in-creased the weight on the PMPD controls by 620 In contrastthe weight increased by just 18 for the rest of the sample

This error had grave consequences for the initial infer-ences and analyses of the data First using the faulty weightsPeterson Myers and Howell (1998) reported highly statisti-cally signi cant differences in baseline test scores between thetreatment and control groups with control group members scor-ing higher on both the math and reading exams After the mis-take was discovered and the weights were revised howeverthe baseline differences became small and statistically insignif-icant (We note however that if we limit attention to the stu-dents from single-children families that Barnard et al use anduse the more complete baseline test score data then there is

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

316 Journal of the American Statistical Association June 2003

a statistically signi cant difference in baseline reading scoresbetween treatments and controls even with the revised base-line weights If we pool together students from single-child andmultichild families then the baseline difference is insigni cantThis is another reason why we think it is more appropriate touse the broader sample that includes children from all families)

Second because of the inaccurate inference that there was adifference in baseline ability between treatments and controlsMathematicarsquos researchers were discouraged from analyzing(or at least from presenting) results that did not condition onbaseline scores This is unfortunate because conditioning onbaseline scores caused the researchers to drop from the sam-ple all the students who were initially in kindergarten (becausethese students were not given baseline tests) and 11 of stu-dents initially in grades 1ndash4 Including students with missingbaseline scores increases the sample by more than 40 andexpands the population to which the results can be generalizedAs explained later and in an article by Krueger and Zhu (2003)qualitativelydifferent results are found if students with missingbaseline scores are included in the sample Because assignmentto receive a voucher was random within lottery strata estimatesthat do not condition on baseline scores are nonetheless unbi-ased Moreover it is inef cient to exclude studentswith missingbaseline scores (most of whom had follow-up test scores) andsuch a sample restriction can potentiallycause sample selectionbias

Of lesser importance is the fact that the PMPD design alsocomplicates the calculation of standard errors The matching oftreatments and controls on selected covariates creates a depen-dence between paired observationsMoreover if the propensityscore model is misspeci ed then the equation error also cre-ates a dependence between observations Researchers have nottaken this dependence into account in the calculation of stan-dard errors We suspect however that this is likely to causeonly a small understatement of the standard errors because thecovariates do not account for much of the residual variance oftest scores and because the PMPD sample is only about half ofthe overall sample (We tried to compute the propensity scoreto adjust the standard errors ourselves but were unable to repli-cate the PMPD model because it was not described in suf -cient detail and because the computer programs are proprietaryThis also prevented us from replicating estimates in Table 4 ofBarnard et al)

The problems created by the complicated experimental de-sign particularly concerning the weights lead us to reiteratethe advice of Cochran and Cox (1957) ldquoA good working rule isto use the simplest experimental design that meets the needs ofthe occasionrdquo It seems to us that in this case the PMPD designintroduced unnecessary complexity that inadvertently led to aconsequentialmistake in the computationof weights with verylittle improvement in ef ciency This was not the fault of the ar-chitects of the PMPD design who did not compute the weightsbut it was related to the complexity of the design Simplicityand transparencymdashin designs methods and proceduresmdashcanhelp avoid mistakes down the road which are almost inevitablein large-scale empirical studies Indeed Mathematica recentlyinformed us that they still do not have the baseline weights ex-actly correct 5 years after random assignment

2 INTENT-TO-TREAT ESTIMATION RESULTS

Barnard et al devote much attention to addressing potentialproblems caused by missing data for the sample of students ingrades 1ndash4 at baseline this follows the practice of earlier re-ports by Mathematica which restricted the analysis of studentoutcomes (but not parental responses such as satisfaction) tothose enrolled in grades 1ndash4 at baseline In our opinion a muchmore important substantive problem results from the exclusionof students in the kindergarten cohort who were categoricallydropped because they were not given baseline tests Becauseassignment to treatment status was random (within strata) asimple comparison of means between treatments and controlswithout conditioning on baseline scores provides an unbiasedestimate of the average treatment effect Moreover as Barnardet al and others show treatments and controls were well bal-anced in terms of baseline characteristics so there is no reasonto suspect that random assignment was somehow corrupted

Table 2 presents regression estimates of the ITT effect usingvarious samples Because we cannot replicate the Bayesian es-timates without knowing the propensity score we present con-ventional ITT estimates For each entry in the rst columnwe regressed the test score percentile rank on a voucher offerdummy 30 dummies indicating lottery strata (block pound familysize pound highlow school) and baseline test scores The sample islimited to those with baseline test data Our results differ fromthe ITT estimates in Table 5 of Barnard et al because we usethe revised weights and a more comprehensivesample that alsoincludes students from multichild families Barnard et al reportthat their Bayesian estimates are ldquogenerally more stablerdquo thanthe conventional ITT estimates ldquowhich in some cases are noteven crediblerdquo but the extreme ITT estimates that they refer tofor 4th graders from high-achieving schools are based on only15 students For the samples that pool students across gradesthe results of the conventional ITT estimates are a priori credi-ble and probably not statistically different from their Bayesianestimates

In the second column we use the same sample as in column 1but omit the baseline test score from the model In the thirdcolumn we expand the sample to include those with missingbaseline scores In the fourth column we continue to includestudents with missing baseline scores but restrict the sample tothose initially in low-achieving public schools

We focus mainly on the results for Black students and stu-dents from low-achieving public schools because the effect ofoffering a voucheron either reading or math scores for the over-all sample is always statistically insigni cant and because mostpublic policy attention has focused on these two groups

Although it has not been made explicit in previous studies ofthese data Black students have been de ned as children witha non-Hispanic BlackAfrican-American mother We use thisde nition and a broader one to probe the sensitivity of the re-sults

Omitting baseline scores has little qualitative effect on theestimates when the same sample is used (compare columns 1and 2) the coef cient typically changes by no more than halfa standard error For year 3 scores for Black students for ex-ample the treatment effect on the composite score is 52 pointst D 32 when controlling for baseline scores and 50 pointst D 26 when not controlling This stability is expected in a

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Krueger and Zhu Comment 317

Table 2 Estimated Treatment Effects With and Without Controlling for Baseline Scores Coefcient on Voucher Dummy From Test ScoreRegression for Various Samples

Subsample Subsample Subsamplewith baseline with baseline applicant school

scores controls for scores omits Full sample Low omitsTest Sample baseline scores baseline scores omits baseline scores baseline scores

First follow-up testComposite Overall 127 42 iexcl33 iexcl60

(96) (128) (106) (106)Mother Black Non-Hispanic 431 364 266 175

(128) (173) (142) (157)Either parent Black Non-Hispanic 355 252 138 53

(124) (173) (142) (155)

Reading Overall 111 03 iexcl113 iexcl117(104) (136) (117) (118)

Mother Black Non-Hispanic 342 260 138 93(159) (198) (174) (184)

Either parent Black Non-Hispanic 268 149 iexcl09 iexcl53(150) (197) (171) (181)

Math Overall 144 80 47 iexcl03(117) (145) (117) (115)

Mother Black Non-Hispanic 528 481 401 268(152) (193) (154) (163)

Either parent Black Non-Hispanic 442 354 286 159(148) (190) (151) (163)

Third follow-up testComposite Overall 90 36 iexcl38 10

(114) (140) (119) (116)Mother Black Non-Hispanic 524 503 265 309

(163) (192) (168) (167)Either parent Black Non-Hispanic 470 427 187 190

(160) (192) (168) (167)

Reading Overall 25 iexcl42 iexcl100 iexcl08(126) (151) (125) (124)

Mother Black Non-Hispanic 364 322 125 206(187) (219) (183) (183)

Either parent Black Non-Hispanic 337 275 76 123(186) (218) (183) (182)

Math Overall 154 115 24 29(130) (153) (133) (128)

Mother Black Non-Hispanic 684 684 405 413(186) (209) (186) (185)

Either parent Black Non-Hispanic 604 578 297 258(183) (209) (185) (182)

NOTE Standard errors are in parentheses Dependent variable is test score NPR Reported coef cient is coef cient on voucher offer dummy All regressions control for 30 randomization stratadummies models in the rst column also control for baseline math and reading test scoresBootstrap standard errors are robust to correlation in residuals among students in the same family Bold font indicates that the absolute t ratio exceeds 196

Sample sizes in year 1 for subsample with baseline scoresor without are overall 14552080 mother black 623883 either parent black 684968

Sample sizes in year 3 for subsample with baseline scoresor without are overall 12501801 mother black 519733 either parent black 572807

Sample sizes in year 1 for subsample from low schools without baseline scores are overall 1802 mother black 770 either parent black 843

Sample sizes in year 3 for subsample from low schools without baseline scores are overall 1569 mother black 647 either parent black 712

randomized experiment When the sample is expanded to in-clude those with missing baseline scores in column 3 howeverthe ITT effect falls almost in half in year 3 for Black students to265 points with a t ratio of 158 The effect on the math scoreis statistically signi cant as Barnard et al emphasize Quali-tatively similar results are found if baseline covariates such asmotherrsquos education and family income are included as regres-sors (see Krueger and Zhu 2003)

The classi cation of Black students used in the study issomewhat idiosyncratic Race and ethnicity were collected ina single question in the parental surveymdashcontrary to the OMBguidelines for government surveysmdashso it is impossible to iden-tify Blacks of Hispanic origin (of which there are a large num-ber in New York) In addition the survey did not directly askabout the childrsquos race or ethnicity so childrenrsquos race must be

inferred from parentsrsquo raceethnicity A broader de nition ofBlack studentsmdashand one that probably comes closer to the de- nition used in government surveys and most other educationalresearchmdashwould also include those who have a Black fatherin the sample of Black students According to the 1990 Censusrace was reported as Black for 85 of the children with a Blackfather and a Hispanic mother (the overwhelming non-Blackportion of the sample) in the New York metropolitan region

Hence we also present results for a sample in which eitherparent is listed as ldquoBlack non-Hispanicrdquo This increases thesample of Black students by about 10 and the results are evenweaker for this sample For example the ITT estimate usingthe more comprehensive sample of Black students enrolled ingrades Kndash4 at baseline is 187 points t D 111 on the third-year follow-up composite test although even with this sample

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

318 Journal of the American Statistical Association June 2003

the effect on math is signi cant at the 10 level in year 1 asBarnard et al have found and almost signi cant in year 3 Sowe conclude that the effect of the opportunity to use a privateschool voucheron the composite score for the most comprehen-sive sample of Black students is insigni cantly different from 0although it is possible that there was initially a small bene cialeffect on the math score for Black students

In the nal column we present results for the subsample ofstudents originally enrolled in schools with average test scoresbelow the median score in New York City In each case theresults for this subsample are quite similar to those for thefull sample and a formal test of the null hypothesis that stu-dents from low- and high-achieving schools bene t equallyfrom vouchers is never close to rejecting It also appears veryunlikely that the differences between the treatment effects forapplicants from low- and high-achieving schools in Barnardet alrsquos Table 4 are statistically signi cant either

3 WHAT WAS BROKEN

We agree with Barnard et al that the experiment was brokenin the sense that attrition and missing data were common Previ-ous analyses were also strained if not broken for their neglectof the cohort of students originally in kindergartenwho were inthird grade at the end of the experiment and whose follow-uptest scores were ignored Including these students qualitativelyalters the results for Black students The experiment was alsobroken in the sense that years passed before correct baselineweights were computed

We disagree however with the interpretation that the experi-ment was broken because compliance was less than 100 This

depends on the question that one is interested in answeringBecause most interest in the experiment for policy purposescenters on the impact of offering vouchers on achievementmdashnot on compelling students to use vouchersmdashwe think the ITTestimates which re ect inevitable partial usage of vouchersare most relevant (see also Rouse 1998 Angrist et al 2003)Moreover New York had a higher voucher take-up rate thanthe experiments in Dayton and the District of Columbia soone could argue that the New York experiment provides anupper bound estimate of the effect of offering vouchers Onthe other hand if the goal is to use the experiment to esti-mate the effect of attending private school on achievementthen such methods as instrumental variables or those used byBarnard et al are necessary to estimate the parameter of inter-est We consider this of secondary interest in this case how-ever

ADDITIONAL REFERENCES

Angrist J Bettinger E Bloom E King E and Kremer M (2003) ldquoVouch-ers for Private Schooling in Colombia Evidence from a Randomized NaturalExperimentrdquo American Economic Review 92 1535ndash1558

Cochran W and Cox G (1957) Experimental Designs (2nd ed) New YorkWiley

Krueger A and Zhu P (2003) ldquoAnother Look at the New York CitySchool Voucher Experimentrdquo American Behavioral Scientist forthcomingAlso available from Industrial Relation Section Princeton University athttpwwwirsprincetonedu

Peterson P Myers D and Howell W (1998) ldquoAn Evaluation of theNew York City Scholarships Program The First Yearrdquo Mathematica PolicyResearch Inc available at httpwwwksgharvardedupepgpdfny1rptpdf

Rouse C (1998) ldquoPrivate School Vouchers And Student Achievement AnEvaluation Of The Milwaukee Parental Choice Programrdquo The QuarterlyJournal of Economics 13 553ndash602

CommentRichard A BERK and Hongquan XU

1 INTRODUCTION

The article by Barnard Frangakis Hill and Rubin (hereafterBFHR) is a virtuoso performance By applying the NeymanndashRubin model of causal effects and building on a series of im-portant works on the estimation and interpretation of casual ef-fects (eg Angrist Imbens and Rubin 1996)BFHR manage toextract a reasonable set of ndings for a ldquobrokenrdquo randomizedexperiment The article more generally underscores the impor-tant difference between estimating relationships in a consistentmanner and interpreting those estimates in causal terms obtain-ing consistent estimates is only part of the enterprise Finallythe article emphasizes that assumptions made so that proper es-timates may be obtained are not just technical moves of con-venience Rather they are statements about how the empiricalworld is supposed to work As such they need to be examined

Richard A Berk is Professor and Hongquan Xu is Assistant ProfessorDepartment of Statistics University of California Los Angeles CA 90095(E-mail berkstatuclaedu)

concretelywith reference to the phenomenabeing studied eval-uated empirically whenever possible and at a minimum sub-jected to sensitivity tests

As battle-tested academics we can certainlyquibblehere andthere about some aspects of the article For example might itnot have made sense to construct priors from the educators andeconomists who were responsible for designing the interven-tion Moreovermight it not have made sense to interpret the re-sults using a yardstick of treatment effects that are large enoughto matter in practical terms Nonetheless we doubt that over-all we could have done as well as BFHR did Moreover mostof our concerns depend on features of the data that cannot beknown from a distance We would have had to carefully exam-ine the data ourselves As a result we focus on why a virtuosoperformance was needed to begin with Why was this trip nec-essary

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000107

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Berk and Xu Comment 319

2 NONCOMPLIANCE

The implementationof randomized eld experiments as de-sirable as they are invites any number of well-known problems(Berk 1990) One of these problems is that human subjects of-ten do not do what they are told even when they say they willThere is recent scholarly literature on noncompliance (Metryand Meyer 1999 Johnson2000) and a citation trail leadingbackto evaluations of the social programs of the War on Poverty(Havemen 1977) Some noncompliancecan be expected exceptwhen there is unusual control over the study subjects Our re-cent randomized trial of the inmate classi cation system usedby the California Department of Corrections is one example(Berk Ladd Graziano and Baek 2003)

An obviousquestion follows Why does it seem that the indi-viduals who designed this experiment were caught by surpriseAs best we can tell little was done to reduce noncomplianceand more important for the analyses of BFHR apparently al-most no data were collected on factors that might be explicitlyrelated to noncompliance(Hill Rubin and Thomas 2000)

Without knowing far more about the design stages of the ex-periment we nd it dif cult to be very speci c But from theliteratures on compliance and school choice the following areprobably illustrative indicators relevant for parents who wouldnot be inclined to use a school voucher for other forms of non-compliance other measures would be needed

1 Whether the adults in the householdare employed outsidethe home during the day

2 Whether there is any child care at the private schools be-fore classes begin and after they end for children of em-ployed parents

3 A householdrsquos ability to pay for private school beyond thevalue of the voucher

4 Travel time to the nearest private schools5 Availability of transportation to those schools6 Religious preference of the household and religious af l-

iation of the private school

From these examples and others that individuals closer tothe study would know about one can imagine a variety of mea-sures to reduce noncomplianceFor instance key problems forany working parent are child care before and after school andround-trip school transportation These might be addressed di-rectly with supplemental support If no such resources could beobtained then the study design might be altered to screen outbefore random assignment households for whom compliancewas likely to be problematic

3 MISSING DATA

No doubtBFHR know more about the subtletiesof the miss-ing data than we do Nevertheless the same issues arise formissing data that arise for noncompliance Surely problemswith missing data could have been anticipated Factors relatedto nonresponse to particular items could have been explicitlymeasured and preventionstrategiesperhaps could have been de-ployed

4 DESIGN EFFICIENCY

A carefully designedexperimentnot only providesa solid ba-sis for statistical inference but also gives credible evidence forcausation The New York School Choice Scholarships Program(NYSCSP) was apparently the rst good example of a random-ized experiment to examine the potential bene ts of vouchersfor private schools Offering a scholarship (treatment) randomlythrough a lottery can (1) provide a socially acceptable way toration scarce resources (2) protect against undesirable selec-tion bias and (3) produce balance between the observed andunobserved variables which may affect the response In addi-tion to the randomization and blockingNYSCSP implementeda PMPD (propensity matched pairs design) to choose a controlgroup from a large candidate pool and to balance many covari-ates explicitlyTable 2 of BFHR shows that various backgroundvariables were balanced properly between the treatment groupand the control group

But might it have been possible to do a bit better BFHRdid not mention that the sample size of the applicantrsquos schoolwas not balanced althoughwinners from high and low schoolswere balanced between the treatments and the controls Anideal design would have had equal winners from ldquobadrdquo (orlow) and ldquogoodrdquo (or high) public schools In the NYSCSP85 of the winners were from bad schools and 15 were fromgood schools to achieve the social goals of the School ChoiceScholarships Foundation (Hill et al 2000 p 158) Bad schoolsare de ned as ldquothose for which the average test scores werebelow the median test scores for the cityrdquo (Hill et al 2000p 158) A direct consequence is that the estimates from thegood schools have much larger variation than that from the badschools which was evident in BFHRrsquos Tables 4ndash6 The con -dence intervals for good schools are wider

Instead of classifying schools into two classes (bad andgood) it might have been helpful to classify them into fourclasses very bad bad good and very good To estimate theeffect of applicantrsquos school (assuming linear effects) an opti-mal design would select half of the winners to the very badschool other half to the very good school and none to the twomiddle schools Avoiding the middle schools enlarges the dis-tance between bad and good schoolsand also reduces the sourcevariation in the bad and good schools and thus reduces the vari-ation in the estimates To be more consistent with the goals ofthe School Choice Scholarships Foundation an alternative de-sign might take 75 winners from very bad schools 10 frombad schools 5 from good schools and 10 from very goodschools This design would have a higher design ef ciency thanthe original design especially because the applicantrsquos schoolwas the most important design variable after family size (Hillet al 2000 p 160) Of course people on the ground at thetime may have considered these options and rejected them Ourpoint is that with better planning it might have been possibleto squeeze more ef ciency out of the study design Such anoutcome would have made the job undertaken by BFHR a biteasier

5 CONCLUSIONS

Scienti c problems often provide a fertile ground for impor-tant statistical developments BFHRrsquos article and the literatureon which they dependare excellent illustrationsHowever from

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

320 Journal of the American Statistical Association June 2003

the perspective of public policy the goal of program evaluationis to obtain usefully precise estimates of program impacts whileusing the most robust means of analysis available Simplicity isalso important both for data analysts who must do the workand policy makers who must interpret the results It followsthat large investments at the front end of an evaluationmdashin re-search design measurement and study integritymdashare essentialVirtuoso statistical analyses can certainly help when these in-vestments are insuf cient or when they fail but they must notbe seen as an inexpensivealternative to doing the study well tobegin with

ADDITIONAL REFERENCESAngrist J Imbens G W and Rubin D B (1996) ldquoIdenti cation of Causal

Effects Using Instrumental Variablesrdquo (with discussion) Journal of the Amer-ican Statistical Association 91 441ndash472

Berk R A (1990) ldquoWhat Your Mother Never Told You About RandomizedField Experimentsrdquo in Community-Based Care of People With AIDS De-veloping a Research Agenda AHCPR Conference Proceedings WashingtonDC US Department of Health and Human Services

Berk R A Ladd H Graziano H and Baek J (2003) ldquoA Randomized Ex-periment Testing Inmate Classi cation Systemsrdquo Journal of Criminology andPublic Policy 2 215ndash242

Havemen R H (ed) (1977) A Decade of Federal Antipoverty ProgramsAchievements Failures and Lessons New York Academic Press

Hill J L Rubin D B and Thomas N (2000) ldquoThe Design of the NewYork School Choice Scholarships Program Evaluationrdquo in Research DesignDonald Campbellrsquos Legacy (Vol II) ed L Bickman Thousand Oaks CASage

Johnson S B (2000) ldquoCompliance Behavior in Clinical Trials Error or Op-portunityrdquo in Promoting Adherence to Medical Treatment in Chronic Child-hood Illness Concepts Methods and Interventions ed D Drotar MahwahNJ Lawrence Erlbaum

Metry J and Meyer U A (eds) (1999) Drug Regimen Compliance Issuesin Clinical Trials and Patient Management New York Wiley

RejoinderJohn BARNARD Constantine E FRANGAKIS Jennifer L HILL and Donald B RUBIN

We thank the editorial board and the thoughtful group of dis-cussants that they arranged to comment on our article The dis-cussants have raised many salient issues and offered useful ex-tensions of our work We use this rejoinder primarily to addressa few points of contention

Muthen Jo and Brown

Muthen Jo and Brown (henceforthMJB) providean enlight-ening discussion of an alternate and potentially complementaryapproach to treatment effect estimationThe growth models thatthey discuss are for data more extensive than what we analyzedand so are beyond the scope of our article Nevertheless MJBpropose an interesting idea for looking for interactions in lon-gitudinal experimental data and we look forward to future ap-plications of the method to appropriate datasets

Although models that incorporate both our compliance prin-cipal strata as well as latent growth trajectories are theoreticallypossible (assuming multiple years of outcome data) and couldpotentially yield an extremely rich description of the types oftreatment effects manifested the more time points that existthe more complicated the compliance patterns can become (atleast without further assumptions) and it may be dif cult to nda dataset rich enough to support all of the interactions Given achoice we feel that it is probably more bene cial to handle realobserved issues than to search for interactions with latent tra-jectories but certainly it would always be nice to be able todo both Of course such a search for interactions in trajecto-ries is more bene cial when there is enough prior knowledgeabout the scienti c background to include additional structuralassumptions that can support robust estimation

MJB make an interesting point about the exploration of po-tential violations of exclusion for never-takers Such analyseshave been done in past work by some of the authors (Hiranoet al 2000 Frangakis Rubin and Zhou 2002) but was not pur-sued for our study because the robustness of such methods isstill an issue in examples as complex as this one

Regarding the trade-offs between the joint latent ignorabil-ity and compound exclusion assumptions versus the joint as-sumptions of standard ignorability and the standard exclusionrestriction this topic is up for debate We would like to pointout however that these standard structural assumptions comewith their own level of arbitrariness These issues have beenexplored by Jo (2002) and Mealli and Rubin (2002 2003)

We thank MJB for providinga thought-provoking discussionof our model and assumptions as well as ample fodder for ad-ditionaldirections that can be explored in this arena Clearly thegeneral category of work has stimulated much new and excitingmodeling

Krueger and Zhu

Krueger and Zhu (henceforth KZ) focus primarily on the un-derlying study and analyses that either we did not do or useddata we did not have Clearly they are interested in the substan-tive issues which is admirable but we cannot be responsiblefor analyses that we neither did nor were consulted on and arenot prepared to comment on what analyses we might have un-dertaken with data that we did not have

KZ state that ldquothere is no substitute for probing the de ni-tion and concepts that underlie that datardquo Although we are notentirely sure what this refers to in particular if it means thatgood science is more important than fancy techniques then wecertainly agree

Another comment by KZ is that ldquoit is desirable to use themost recent comprehensive data for the widest sample possi-blerdquo We agree that it is desirable to have access to the mostcomprehensive data available but it is not necessary to use allof these data in any given analysis Here we use only rst-yeardata because these were the only data available to us at the

copy 2003 American Statistical AssociationJournal of the American Statistical Association

June 2003 Vol 98 No 462 Applications and Case StudiesDOI 101198016214503000116

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Rejoinder 321

time that we were building the model We used only single-child families as explained in the article because regrettablyno compliance data were collected for individualchildren in themultichild families

The exclusion of kindergarten children could be consideredmore of a judgment call regarding the quality of the pretestscores not only for ef ciency but also for the reasonablenessofour assumptions about noncompliance and missing data Noth-ing in theory would prevent us from including these childrenand omitting these extra covariates

These types of trade-offs (more data vs more appropriatedata) are common in statistical analyses and the bene ts arewell known for example for avoiding model extrapolation orwhen some subsets of data do not mean the same thing as oth-ers In fact Krueger ldquoignored datardquo in his article on minimumwage with David Card (Card and Krueger 1994) by limitinganalyses to restaurants from a similar geographic region in thehope that employees and employers across the two comparisonstates would be faced with similar economic conditions ratherthan looking at all restaurants in these states This is a form ofimplicit matchingwhich we believe represented a good choiceSurely there are times when focusing on subsets of data pro-duces more reliable results

Furthermore we would like to point out that KZrsquos asser-tion that ldquobecause assignment to treatment status was random(within strata) a simple comparison of means between treat-ments and controls without conditioning on baseline scoresprovides an unbiased estimate of the average treatment effectrdquois simply false because there are missing outcomes Even ifthere is perfect compliance a simple comparison of meansacross treatment groups when there are missing outcomes willnot lead in general to an unbiased estimate of the treatmenteffect Assumptions need to be made about the missing-dataprocess

When noncompliance exists in addition to the missing out-come data Frangakis and Rubin (1999) have shown that esti-mation even of the ITT effect requires additional assumptionsMoreover the ITT estimate generally considered a conserva-tive estimate of the treatment effect can actually be anticon-servative in this scenario Finally contrary to what KZ seem toimply in their nal paragraph instrumental variables methodsalone are not suf cient to address the problems of both non-compliance and missing data

KZ advocate ldquosimplicity and transparencyrdquo We wholeheart-edly agree but disagree with the examples they use to defendtheir stance Their rst argument is that the PMPD did notyield strong ef ciency gains and that we did not account forthe PMPD in the calculation of standard errors It may well bethat precision gains were relatively modest however the analy-ses that they present are by no means conclusive in this regardFirst as with KZrsquos other ITT analyses we have no clue as tohow missing outcome data are handled In addition the PMPDwas also used to balance compliance and missingness It is un-clear however how the gains in precision due to matching themissingness and noncomplianceare displayed if at all in KZrsquosresults Finally KZ did not account for the experimental de-sign when estimating their results Failure to account properlyfor the design typically results in estimates that overstate thevariances of treatment effect estimates In this design it is the

correlation between matched pair units that leads the samplingvariances to be smaller than their estimates

Even if precision gains were truly modest they of coursewere unknown to us at the time we debated the use of the PMPDdesign KZrsquos argument is like the complaint of a cross-countrymotorist who experiencesno car problems during his long driveand then bemoans the money wasted on a spare tire purchasedbefore the trip began We prefer to avoid potential problems byusing careful design

Importantly KZrsquos criticism that we ignored this matchingin the analysis is incorrect The matching was done based onthe estimated propensity score so it follows from Rubin (1978)that assignment is ignorable conditionally given the covariatesthat were used to estimate it Thus the Bayesian analysis thatwe conduct that conditions on the estimated propensity scoreis an approximately valid Bayesian analysis with respect tothe matched assignment assuming that our modeled relation-ship between the outcomes and the estimated propensity scorecaptures the major features of the relationship to the covari-ates

KZrsquos second argument is that use of the PMPD ldquoled to graveconsequencesrdquo because sampling weights were miscalculatedby MPR It seems to us odd to maintain as KZ do that thePMPD ldquocausedrdquo this error nor is it clear that use of another de-sign would have avoided the error In any case we have rerunour analyses this time generalizing to the population of whichthe children in our analyses are representative (by setting con-stant weights in step 3 of Sec 811) For this population allresults are very similar to our original results These results areposted as Appendix C at httpbiosun01biostatjhspheduraquocfrangakpaperssc

A third argument revolves around the protracted debate be-tween use of ITT versus treatment-targeted measures such asCACE We addressed both ITT and CACE which allows thereader to focus on one or the other or both dependingon whicheffect is expected to generalize to another environment theITT or ldquosociologicalrdquo effect of accepting or not accepting anoffered voucher or the CACE or ldquostructuralrdquo or ldquoscienti crdquoeffect of using a voucher One could argue that the effect ofattending (CACE) is the more generalizable whereas the soci-ological effect (ITT) is more easily in uenced for instance byldquohyperdquo about how well the program works That is the ITT es-timate will not re ect the ldquoeffect of offeringrdquo if the next time theprogram is offered the compliance rates change (which seemslikely) We prefer to examine both to provide a more compre-hensive picture

In sum we agree that simplicity and transparency are impor-tant this is the major motivation for our detailed descriptionof our assumptions in terms of readily understandableconceptsrather than commonly invoked and less intuitive conditions in-volving correlated error terms However this desire for sim-plicity must be balanced with the prudence of guarding againstpotential problems through careful design

Although we agree in theory with much of what KZ es-pouse some of their criticisms seem misguidedMoreover theirpresentation of analyses with incomplete exposition regardingmissing-data complications makes these analyses dif cult forus to evaluate properly

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

322 Journal of the American Statistical Association June 2003

Berk and Xu

Berk and Xu (henceforth BX) raise several interesting ques-tions about the way in which the evaluation was designed andimplemented Many of these are actually addressed in a pre-ceding and lengthier article by us on this topic (Barnard et al2002) That article describes for instance MPRrsquos efforts to re-duce missing data without which the missingness rates wouldsurely have been much higher It also describes special chal-lenges faced in this study But in any study with human subjectsit is dif cult to completely eliminate missing data particularlyif one desires more interesting information than can be obtainedfrom administrative data One of the challenges for the collec-tion of outcome test scores was the fact that children neededto be tested outside of the public school setting which allowedthe use of a standardized test not used by the public school sys-tem thereby greatly reducing the possiblyof bias resulting fromteachers ldquoteaching to the testrdquo Consequently parents had tobring their children to testing areas on weekendsmdasha formidablehurdle for some even when monetary incentives were used

An important consideration with this study was that the eval-uators took the initiativeto investigatea program that was spon-sored by an organization (SCSF) whose primary goal was toprovide scholarships not to do research We feel that MPRdid an admirable job convincing the SCSF to allow the studyto be evaluated and raising money for this purpose but thisalso meant that the evaluators could not control every aspectof the program For example the SCSF originally only wantedto make the program available to children from the lower-test-score schools and the evaluators had to convince them to allowalso a small percentage from those in higher-test-score schoolsto increase the generalizability of the results The designs thatBX suggest even though they have certain desirable statisticalproperties under appropriate assumptions were not an optionwith this study

Related to this issue of the alignment of the evaluatorsrsquo andprogram administratorsrsquogoals is the issue of the potential for re-ducing noncompliancerates Some effort was made in that helpwas provided to scholarship winners in nding an appropriateprivate school Arguably however 100 compliance shouldnot be the goal in such a study given that we would not ex-pect 100 compliance if the program were expanded and of-fered say by the government on a larger scale Thus it maynot be wise to use extreme efforts to increase compliance ifsuch efforts would not be offered in the large-scale public caseWith the current study the opportunity exists to examine thenoncompliance process for this program in this setting take-up rates predictors of noncompliance and so on To this endsome sort of subexperiment that randomizes a reasonable set ofcompliance incentives might be useful in similar studies in thefuture if sample sizes are suf cient

We are not convinced that we should have solicited actualprior distributions from educators and economists In our expe-rience this is a confusingdif cult and often unrewarding taskMoreover school choice is such a politically charged and con-tentious issue that it seems wiser to try to use relatively diffuseand objective prior distributions

We agree that results should be interpreted in terms of a yard-stick of practical effects We are unaware of the most appropri-

ate such yardstick to use in this context One side could claimthat even tiny average effects matter not only because of thecumulative effects on parts of society but also because of pos-sible nonadditivityof such effects which could imply huge ef-fects for some rare individualsand minute effects for all othersOthers could argue that seemingly quite large effects have nolong-run practical impact on quality of life The results that wepresent providea basic empirical starting point for these debatesbetween people whose expertise is in this area

We thank BX for highlighting our emphasis on the role ofthe assumptions in our analyses Our hope is that we have pre-sented the assumptions in such a way to enable readers to maketheir own judgment about whether or not they were likely to besatis ed rather than only stating them technicallywhich wouldimplicitly make those important decisions for the interested butnontechnical readers

BX point out a dearth of certain types of covariates Part ofthe problem here is that we presented only a small portion ofthe data collected due to the fact that our model in its currentform did not handle many covariates However there were in-deed discussions between some members of our team and MPRstaff during the development of the questionnaire about whattypes of variables could be included that would be predictive ofmissing data or compliance as well as outcomes Some of thevariables proposed were discarded due to the surveyorsrsquo expe-rience with them as subject to problems such as high measure-ment error Also not every variable discussed was included dueto a desire for a survey that would not be overly burdensome forthe study families

We raise a nal small semantic question We wonder aboutthe propriety of the label ldquoNeymanndashRubin modelrdquo to describe aframework for causal inference that views observationalstudiesand randomizedexperimentson the same continuumand more-over encourages Bayesian analyses of both Neyman (1923)certainly seems to have been the rst to use the formal nota-tion of outcomes for randomization-basedpotential inference inrandomized experiments which was a truly major advance butneitherhe nor anyoneelse to the best of our knowledgeused thisformal notation to de ne causal effects in observational studiesuntil Rubin (1974 1975) nor did he nor anyone else ever advo-cate Bayesian inference for the unobserved potential outcomesuntil Rubin (1975 1978) which is the perspective that we haveimplemented here

Overall we agree with BX about the paramount importanceof study design and we tried to design this study carefully cer-tainly more so than most social science analyses of secondarydatasets like the PSID or CPS We thank BX for their thoughtfulcommentary they raise some important points

We thank all of the discussants for stimulating commentswhich we believe have added emphasis to the importance ofboth the substantive topic of school vouchers and the devel-opment of appropriate methods to evaluate this and other suchbroken randomized experiments

ADDITIONAL REFERENCES

Card D and Krueger A (1994) ldquoMinimum Wages and Employment A CaseStudy of the Fast-Food Industry in New Jersey and Pennsylvaniardquo AmericanEconomic Review 84 772ndash793

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239

Barnard et al Rejoinder 323

Jo B (2002) ldquoEstimation of Intervention Effects With Noncompliance Alter-native Model Speci cationsrdquo (with discussion) Journal of Educational andBehavioral Statistics 27 385ndash420

Mealli F and Rubin D B (2002) Discussion of ldquoEstimation of InterventionEffects With Noncompliance Alternative Model Speci cationsrdquo by B JoJournal of Educational and Behavioral Statistics 27 411ndash415

(2003) Assumptions Allowing the Estimation of Direct Causal Ef-fects Discussion of ldquoHealthy Wealthy and Wise Tests for Direct CausalPaths Between Health and Socioeconomic Statusrdquo by Adams et al Journalof Econometrics 112 79ndash87

Rubin D B (1975) ldquoBayesian Inference for Causality The Importance ofRandomizationrdquo in Proceedings of the Social Statistics Section AmericanStatistical Association pp 233ndash239