© 2012, richard p phelpsinternational test commission, 8th conference, amsterdam, july, 2012 1 the...

© 2012, Richard P PHELPS International Test Commission, 8th Conference, Amsterdam, July, 2012

1

The effect of testing on student achievement: 1910-2010

Richard P. PHELPS


2

Meta-analysis

• A method for summarizing a large research literature, with a single, comparable measure.


3

The effect of testing on student achievement

• 12-year long study

• analyzed close to 700 separate studies, and more than 1,600 separate effects

• 2,000 other studies were reviewed and found incomplete or inappropriate

• lacking sufficient time and money, hundreds of other studies will not be reviewed


4

Looking for studies to include in the meta-analyses

1. Included only those studies that found an effect from testing on student achievement or on teacher instruction…


5

Studies included in the meta-analyses

2. …when:

• a test is newly introduced, or newly removed• quantity of testing is increased or reduced• test stakes are introduced or increased, or removed or

reduced


6

Studies included in the meta-analyses

3. …plus previous research summaries (e.g.)

• Kulik, Kulik, Bangert-Drowns, & Schwalb (1983-1991) on:

– mastery testing, – frequency of testing, and – programs for high-risk university

students• Basol & Johanson (2009) on testing

frequency • Jaekyung Lee (2007) on cross-state

studies • W.J. Haynie (2007) in career-tech ed


7

Number of studies of effects, by methodology type

Methodology typeNumber of

studiesNumber of

effects

Quantitative 177 640

Surveys and public opinion polls (US & Canada)

247 813

Qualitative 245 245

TOTAL 669 1698


8

Effect size: Cohen’s d

d = (YE - YC) / Spool

YE = mean, experimental group

YC = mean, control group

Spooled = standard deviation


9

Effect size: Other formulae

d = t*((n1+n2/n1*n2)^0.5

d = 2r/(1-r²)^0.5

d = (YE pre-YE post-YC pre+YC post)/Spooled post


10

Effect size: Interpretation

• d between 0.25 & 0.50 weak effect

• d between 0.50 et 0.75 medium effect

• d more than 0.75 strong effect


11

Quantitative studies

(population coverage ≈ 7 million persons)


12

Quantitative studies: Effect size

• “Bare bones” calculation:

d ≈ +0.55 …a medium effect

• Bare bones effect size adjusted for measurement error

d ≈ +0.71 …a stronger effect

• Using same-study-author aggregation

d ≈ +0.88 …a strong effect


13

Which predictors matter?

Treatment Group…Mean Effect

Size

…is made aware of performance, and control group is not +0.98

…receives targeted instruction (e.g., remediation) +0.96

…is tested with higher stakes than control group +0.87

…is tested more frequently than control group +0.85


14

More Moderators – Source of Test

Number of Studies

Mean Effect Size

Researcher or Teacher 87 0.93National 24 0.87Commercial 38 0.82State or District 11 0.72Total 160


15

More Moderators – Sponsor of Test

Number of Studies

Mean Effect Size

International 5 1.02Local 99 0.93National 45 0.81State 11 0.64Total 160


16

More Moderators - Study Design

Number of Studies

Mean Effect Size

Pre-post 12 0.97Experiment, Quasi-experiment 107 0.94Multivariate 26 0.80Experiment, posttest only 7 0.60Pre-post (with shadow test) 8 0.58

Total 160


17

More Moderators – Scale of Analysis

Number of Studies

Mean Effect Size

Aggregated 9 1.60Small-scale 118 0.91Large-scale 33 0.57

Total 160


18

More Moderators – Scale of Administration

Number of Studies

Mean Effect Size

Classroom 115 0.95Mid-scale 6 0.72Large-scale 39 0.71

Total 160


19

Surveys and opinion polls


20

Percentage of survey items, by respondent group and type of survey

05

10

1520253035

404550

Public opinion polls Program evaluationsurveys*

Per

cen

t EducationProviders

EducationConsumers


21

Number and percent of survey items,by test stakes and target group

Test stakes Number % Target group Number %

High 507 62 Students 393 46

Medium 184 23 Schools 281 33

Low 33 4 Teachers 116 14

Unknown 89 11 No stakes 64 7

TOTAL 813 TOTAL 854


22

Opinion polls, by year

• 244 between 1958--2008, in the U.S. & Canada

• 813 unique question-response combinations

• close to 700,000 individual respondents

0

20

40

60

80

100

120

1960 1965 1970 1975 1980 1985 1990 1995 2000 2005

Year


23

Surveys and opinion polls: Regular standardized tests, performance tests

Regular tests(N ≈125)

Performance tests (N ≈ 50)

Respondent opinion d d

Achievement is increased 1.2 1.0

…weighted by size of study population 1.9 0.5

Instruction is improved 1.0 1.4


Tests help align instruction 1.0 1.0



24

Qualitative studies: Summary

(One cannot calculate an effect size.)


25

Qualitative studies, by methodology type

MethodologyNumber of

studies %

Case study 120 43

Experiment or pre-post study 21 7

Interviews (individual or group) 75 27

Journal 2 1

Review of official records, documents, reports 33 12

Research review 8 3

Survey 22 8

TOTAL 281 100


26

Qualitative studies: Effect on student achievement

Direction of effectNumber of

studies Percent of studiesPercent without

the inferred

Positive 204 84 93

Positive inferred 24 10

Mixed 5 2 2

No change 8 3 4

Negative 3 1 1

TOTAL 244 100 100

244 studies conducted in the past century in over 30 countries


27

Qualitative studies: Testing improves student achievement and teacher instruction

Achievement is improvedNumber of

studies %

Yes 200 95

Mixed results 1 <1

No 10 5

TOTAL 211 100

Instruction is improvedNumber of

studies %

Yes 158 96

No 7 4

TOTAL 165 100


28

Qualitative studies: Variation by rigor and test stakes

Direction of effect

Level of rigor

Totalhigh medium low

Positive 95 67 42 204

Positive inferred 10 8 6 24

Mixed 3 1 1 5

No change 4 3 1 8

Negative 1 1 1 3

TOTAL 113 80 51 244

Direction of effect

Stakes

Totalhigh medium low unknown

Positive 133 27 38 6 204

Positive inferred 12 5 7 24

Mixed 4 1 5

No change 2 1 5 8

Negative 3 3

TOTAL 154 33 51 6 244


29

Qualitative studies: Regular standardized tests and performance tests

Regular tests (N =176)

Performance tests (N = 69)

Study results % %

Generally positive 93 95

High-stakes tests 71 42

High level of study rigor 46 48

Student attitudes toward test positive 60 71

Teacher attitudes toward test positive 55 80

Student achievement improved 95 95

Instruction improved 92 100

Large-scale testing 86 68


30

An enormous research literature

• But, assertions that it does not exist at all are common

– Some claims are made by those who oppose standardized testing, and may be wishful thinking

– Others are “firstness” claims


31

Dismissive research reviews

• With a dismissive research literature review, a researcher assures all that no other researcher has studied the same topic


32

Firstness claims

• With a firstness claim, a researcher insists that he or she is the first to ever study a topic


33

Social costs are enormous

• Research conducted by those without power or celebrity is dismissed -- ignored and lost

• Public policies are skewed, based exclusively on the research results of those with power or celebrity

• Society pays again and again for research that has already been done


34

The effect of testing on student achievement:

1910-2010

Richard P. PHELPS

© 2012, richard p phelpsinternational test commission, 8th conference, amsterdam, july, 2012 1 the...

Documents