using state tests to measure student achievement in large-scale randomized experiments

21
Using State Tests to Measure Student Achievement in Large- Scale Randomized Experiments IES Research Conference June 28 th , 2010 Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC

Upload: glenna

Post on 04-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments. An Empirical Assessment Based on Four Recent Evaluations. IES Research Conference June 28 th , 2010. Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

IES Research ConferenceJune 28th, 2010

Marie-Andrée Somers (Presenter)Pei ZhuEdmond Wong MDRC

Page 2: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Two key concerns with using state tests in an evaluation…

They may not be suitable for the evaluation Validity concerns: They may not be aligned with

outcomes of interest (do not provide a valid inference about program impacts)

Reliability concerns: They may be too difficult for low-performing students (unreliable)

Variation in scale/content of state tests also complicates the task of combining impact findings across states and grades

2

Page 3: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

About This Study

Funded by Institute of Education Sciences (IES) Purpose is to “bring data to bear” on several topics

covered in May et al. discussion paper: Are state tests suitable for evaluation purposes?

As a measure of the outcome(s) of interest?

As a measure of student achievement at baseline?

How should impacts on state tests be pooled?

Are impact findings sensitive to methods of rescaling and aggregating test scores across states and/or grades?

Page 4: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Overview of Analytical Approach

We identified 4 large-scale randomized experiments where achievement was measured using both (i) state tests AND (ii) a study test The study test provides a benchmark for gauging the

suitability of state tests

Two types of analyses: Impact analyses: We compared estimated impacts on

state tests and on the « benchmark » study test

Descriptive analyses: We also examined published information on the characteristics/content of tests

Page 5: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Data and Samples

Studies represent diversity with respect to grade levels and outcomes

Analysis sample includes students with a state test score and a study test score

Study A Study B Study C Study D

Targeted Outcome

General Reading Achievement

General Math Achievement

Specific Reading Outcome

Specific Math Outcome

Level Elementary Elementary High School Middle School

Sample for Analysis

1,032(9 states)

944(7 states)

1,065(4 states)

4,387(9 states)

Page 6: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Approach for Estimating Impacts

Impact on state tests: Rescaling: Scores are z-scored by state and grade

using the sample mean and standard deviation Pooling approach: Impacts by state and grade are

aggregated using precision weighting

Impact on the study test: Rescaled/pooled using the same approach for

comparability

Page 7: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Two dimensions of suitability Validity:

Whether the content of state tests is aligned with the outcomes of interest in the evaluation

Reliability: Whether state tests provide a reliable measure of

achievement for the target population (in this case, low-performing students)

A key concern: State tests have low reliability and do not yield valid inferences about program effectiveness

Criteria for Assessing “Suitability”

Page 8: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Criteria for Assessing “Suitability”

Implications for the impact findings: Poor Validity:

Could fail to detect impacts on the outcome of interest (invalid inference about program effectiveness)

Affects the magnitude of the estimated impact on state tests

Low Reliability:

Student achievement is estimated with greater error Affects the standard error of the estimated impact on

state tests

Page 9: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Criteria for Assessing “Suitability”

Reliability: Compare the standard error of the estimated impact on state tests vs. the study test Smaller standard error is better (more precision)

Validity: Compare the magnitude of the impact estimates, in light of estimation error… Compare the statistical significance of the impact

findings (i.e., conclusions about program effectiveness based on p-value)

If both estimates are statistically significant, then also compare their magnitudes

Page 10: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Criteria for Assessing Validity

The extent to which the magnitude of the impact estimates are expected to differ depends on the outcome that state tests are intended to measure

Two types of intervention:

Targeted outcome is general achievement (Studies A and B)

The outcome of interest is “general achievement” in math or reading

Both state tests and the study test measure the targeted outcome (general achievement)

If state tests are valid, then the impact on the study test and state tests should be similar

Page 11: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Criteria for Assessing Validity

Two types of intervention (ctd.) Targeted outcome is a specific skill (Studies C and D)

There are two outcomes of interest: Targeted skill (short-term) and

General achievement (longer term)

Study test is used to measure the short-term outcome (specific skill), while state tests are used to measure the longer-term outcome (general achievement)

If state tests are valid, then the impact on state tests should be smaller than the impact on the study test

Page 12: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Benchmark: Benchmark: Impact on the Study TestImpact on the Study Test

Page 13: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

P-Value & Magnitude (Validity)Targeted Outcome is General Achievement

p = 0.055

p = 0.119

Page 14: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

P-Value & Magnitude (Validity)Targeted Outcome is General Achievement

p = 0.055

p = 0.119

p = 0.229

p = 0.189

Page 15: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill

p = 0.578

p = 0.002

Page 16: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill

p = 0.578

p = 0.002 p = 0.007

Page 17: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill

p = 0.578

p = 0.002 p = 0.007

p = 0.219

Page 18: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Standard Errors (Reliability)

Page 19: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Standard Errors (Reliability)

State-Study Ratio: 1.20 1.07 1.04 1.03

Page 20: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Conclusion Findings suggest that state tests can be used as a

complement to a study-administered test State tests are suitable (valid and reliable) in 3 of 4 studies Whether state tests can be used as a substitute for a study

test is an open question Limited availability in some grades and subjects

Available for all states/grades in only 1 of 4 studies May not be able to use them to measure a specific targeted

skill Possibly less reliable

Findings from descriptive analysis lead to the same conclusions as the impact analysis…

Page 21: Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments

Questions?

Marie-Andrée Somers [email protected]

Pei Zhu [email protected]

Edmond Wong [email protected]