national center for research on evaluation, standards, and student testing university of colorado at...

40
National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST Conference: The Future of Test-Based Educational Accountability, January 23, 2007 Educational Accountability Systems

Upload: wesley-thompson

Post on 17-Dec-2015

222 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Robert L. Linn

Paper prepared for The CRESST Conference: The Future of Test-Based Educational Accountability, January 23, 2007

Educational Accountability Systems

Page 2: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Test-based Accountability

• Popular tool for purposes of educational reform

• Accountability is one of few tools available to policymakers to leverage changes in instruction

• In use in many states since the early 1990s

• Quite a range of approaches to using student test results for accountability systems

• Central component of NCLB

Page 3: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Some Rationales for Testing

• Clarify expectations for teaching and learning

• Motivate greater effort on part of students, teachers and administrators

• Monitor educational progress of schools and students

• Identify schools that need to be improved

• Provide a basis for distributing rewards and sanctions

• Monitor achievement gaps and encourage the closing of those gaps

Page 4: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

No Child Left Behind

• NCLB is the latest in a series of re-authorizations of the Elementary and Secondary Education Act (ESEA) of 1965

• ESEA was the main educational component of President Johnson’s “Great Society” program

• ESEA, as re-authorized every view years, is the principal federal law affecting elementary and secondary education throughout the country

Page 5: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Assessments

• Basic skills and norm-referenced tests of 1980s and early 90s

• Nation of Risk encouragement of more ambitious tests - performance assessments

• NCLB increased uniformity of assessments for grades 3-8 of reading and mathematics

Page 6: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Content Standards

• States encouraged to develop content standards by Goals 2000 and IASA

• NCLB requires all states to have academic content standards in reading/English language arts, mathematics, and science

• All states adopted content standards by 2005 to meet requirements of NCLB if they had not already done so

Page 7: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

NCLB

States required to adopt “challenging academic content standards” that “specify what children are expected to know and be able to do; coherent and rigorous content; [and] encourage the teaching of advanced skills” (NCLB, 2001, part A, subpart 1, Sec. 1111, a (D).

Page 8: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Performance Standards• Called Academic Achievement Standards

by NCLB

• Absolute rather than normative

• Establish fixed criterion of performance

• Intended to be challenging

• Relatively small number of levels

• Apply to all, or essentially all students

• Depend on judgment

Page 9: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Standards Movement

• High expectations of NCLB consistent with the standards movement of 1990s

• National Assessment of Educational Progress (NAEP) standards (called achievement levels) set at ambitious levels

• NAEP 1990 proficient level in mathematics set at high levels

• Grade 4: 87th percentile – 13% proficient or above

• Grade 8: 85th percentile – 15% proficient or above

• Grade 12: 88th percentile – 12% proficient or above

Page 10: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Figure 1Differences Between Percentage of Students Proficient or Above on 2005 State Grade 8

Mathematics Assessments and 2005 Grade 8 State NAEP Mathematics Assessment (33 States, Source: Olsen, 2005)

-20

-10

0

10

20

30

40

50

60

70

Difference

Sta

te

Page 11: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Figure 2Scatterplot of Percent Proficient or Above on Grade 8 State Mathematics

Assessments and Grade 8 NAEP in 2005 for 33 States (r = .34) (Source: Olsen, 2005)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

State Test

NA

EP

Page 12: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

States with the Highest and Lowest Percent Proficientor Above on State Assessments in 2005

Highest

Reading: Grade 4

Mississippi: 89%

Reading: Grade 8

North Carolina: 88%

Math: Grade 4

North Carolina, 92%

Math: Grade 8

Tennessee: 87%

Lowest

Reading: Grade 4

Missouri: 35%

Reading: Grade 8

South Carolina: 30%

Math: Grade 4

Maine & Wyo.: 39%

Math: Grade 8

Missouri: 16%

Page 13: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Contrasts of Percent Proficient or above on NAEPand State Assessments (Grade 8 Mathematics)

NAEP

Missouri 21%

Tennessee 26%

State Assessments

Missouri 16%

Tennessee 87%

Page 14: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Alignment

• Alignment of assessments and content standards viewed as critical by proponents of standards-based reform

• NCLB peer review requires states to demonstrate alignment, usually through studies by independent contractors

Page 15: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Alignment of Assessments to Content Standards

• Webb

• Categorical concurrence

• Depth of knowledge consistency

• Range of knowledge correspondence

• Balance of representation

• Porter

• Content categories by cognitive demand matrix

Page 16: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Alignment of Assessments to Content Standards (Cont’d)

• Achieve

• Content centrality

• Performance centrality

• Challenge

• Balance

• Range

Page 17: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Approaches to Test-Based Accountability

• Status Approach: compare assessment results for a given year to fixed targets (the NCLB approach)

• Growth Approach: evaluate growth in achievement (allowed for NCLB pilot program states)

• “Growth” may be measured by comparing performance of successive cohorts of students

• Growth may be evaluated by longitudinal tracking of students from year to year

Page 18: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Status and Growth Approaches

• Status approach has many drawbacks when used to identify schools as successes or in need of improvement

• Does not account for differences in student characteristics, most importantly differences in prior achievement

• Growth approach has advantage of accounting for differences in prior achievement, but may set different standards for schools that start in different places

Page 19: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

NCLB Pilot Program

• Five states have received approval to use growth model approaches to determining AYP

• Early results suggest that it does not radically alter the proportion of schools failing to make AYP

• Constraints on growth models are severe, most notably the retention of the requirement that they lead to the completely unrealistic goal of 100% proficiency by 2014

Page 20: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Multiple-Hurdle Approach

• NCLB uses multiple-hurdle approach

• Schools must meet multiple targets each year – participation and achievement separately for reading and mathematics for the total student body and for subgroups of sufficient size

• Many ways to fail to make AYP (miss any target), but only one way to make AYP (meet or exceed every target)

• Large schools with diverse student bodies at a relative disadvantage in comparison to small schools or schools with relatively homogeneous student bodies

Page 21: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Compensatory Approach

• State systems often use a compensatory approach rather than a multiple-hurdle approach

• An advantage of compensatory approach is that it creates fewer ways for a school to fall short of targets

• Hybrid models also possible that use a combination of compensatory and multiple-hurdle approaches

Page 22: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Disaggegation

• Critical for monitoring the closing of gaps in achievement

• No real relevance for small schools with homogeneous student bodies

• However, it leads to many hurdles that large, diverse schools must meet

Page 23: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Implications of Subgroup Results

• Schools with multiple subgroups at relative disadvantage to schools with homogeneous student population

• May want to consider combining across more than one year as is already allowed for students with disabilities

Page 24: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Subgroup Gains in NAEP Mathematics Average Scale Scores (1996 to 2005)

Group Grade 4 Grade 8

White 14 8

Black 22 15

Hispanic 19 11

Page 25: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Closing Achievement Gaps: NAEP Mathematics Average Scale Scores (1996 to

2005)

Groups Grade 4 Grade 8

White and

Black

-8 -7

White and

Hispanic

-5 -3

Page 26: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Use of Academic Achievement Standards

• Apparent closing or widening of achievement gaps using percent above cut scores can depend on choice of level, e.g., basic or above vs. proficient or above

• See, for example, Holland, P. W. (2002). Two measures of changes in gaps between CDF’s of test score distributions. JEBS, 27, 3-17.

Page 27: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Subgroup Gains in NAEP Mathematics Percent At or

Above Basic or Proficient (1996 to 2005)

Grade 4 Grade 4 Grade 8 Grade 8

Group Basic Prof. Basic Prof.

White 14 20 7 9

Black 33 10 17 5

Hispanic 28 12 13 5

Page 28: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Changes in Achievement Gaps: NAEP Mathematics Percent At or Above Basic or

Proficient (1996 to 2005)

Grade 4 Grade 4 Grade 8 Grade 8

Groups Basic Prof. Basic Prof.

White and

Black

-19 +10 -10 +4

White and

Hispanic

-14 +8 -6 +4

Page 29: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Gaps and Percent Above Cuts

• “Using differences in percent above cut scores can give a confusing impression of a rather simple situation” (Holland, 2002)

• Need to look beyond percents basic or above or proficient or above

• Compare average scale scores, effect size statistics, and comparisons of distributions

Page 30: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Comparing States on Closing Gaps

• Gaps measured in terms of percent proficient or above on state assessments can be quite misleading due to the wide variation in the stringency of state definitions of the proficient standard

Page 31: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Performance Indexes

• Focusing only on percent proficient or above has disadvantages

• Does not give credit to student moving from below basic to basic

• Encourages attention to students thought to be near the proficient cut, possibly at the expense of other students

• Performance Index scores avoid these problems

Page 32: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Illustration of MA Index Scores for a Hypothetical School in 2006 & 2007

Perfor-mance Level

Points N

2006

N

2007

2006

Points

2007

Points

Prof + 100 50 50 5,000 5,000

NI high 75 75 100 5,625 7,500

NI low 50 100 125 5,000 6,250

W/F high 25 100 125 2,500 3,125

W/F low 0 75 50 0 0

Total 400 400 18,125 21,875

Page 33: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

School Index Scores

2006 Score = 18,125/400 = 45.31

2007 Score = 21,875/400 = 54.69

Percent Proficient or Above

2006 = 12.5%

2007 = 12.5%

Page 34: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Score Inflation• Defined as “.. a gain in scores that substantially

overstates the improvement in learning it implies” (Koretz, 2005)

• Research has found that gains in scores in high-stakes accountability systems often fail to generalize to other measures of achievement

• Narrow focus on past tests rather than broader content standard can cause score inflation

• Emphasis on alignment and the need to repeat a substantial percentage of items on assessments for year-to-year equating may contribute to score inflation

Page 35: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Validity of Causal Inferences

• Status approach does not provide a defensible basis for inferring that higher scoring school is more effective than a lower scoring school

• Making an inference about school quality requires the elimination of many alternate explanations of differences in student achievement other than differences in instructional effectiveness

• Prior achievement differences

• Differences in support from home

Page 36: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Inferences About Schools

• Growth models rule out the alternate explanation of differences in prior achievement

• Nonetheless, causal inferences about school effectiveness are not justified the growth approach to test-based accountability

• Many rival explanations to between-school differences in growth besides differences in school quality or effectiveness

• Results better thought of as descriptive for generating hypotheses about school quality that need to be evaluated

Page 37: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

School Characteristicsand Instructional Practice

• School differences in achievement and in growth describe outcomes and can be the source of hypotheses about school effectiveness

• Accountability systems need to be informed by direct information about school characteristics and instructional practices

Page 38: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Conclusions

• Test-based accountability has become a pervasive part of efforts to improve education in the U.S.

• The features of accountability systems matter

• Requirement to include nearly all students in test-based accountability has brought needed attention to groups often ignored in the past

Page 39: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Conclusions (continued)

• Performance standards are supposed to define the level of achievement that students should reach, but

• The definition of proficient achievement varies so widely from state to state that it lacks any semblance of common meaning

• Using percent proficient or above a primary indicator does not give credit for gains of students at other levels

• Using percent proficient or above to monitor gaps in achievement is not an adequate approach

Page 40: National Center for Research on Evaluation, Standards, and Student Testing University of Colorado at Boulder Robert L. Linn Paper prepared for The CRESST

National Center for Research on Evaluation, Standards, and Student Testing

University of Colorado at Boulder

Conclusions (continued)

• Status-based approach to accountability does not provide a valid way of distinguishing successful schools from schools that are in need of improvement

• Growth models have advantages over status models but still are best thought of as providing descriptive information rather than the providing the basis for causal inferences about school quality