do the scores mean the same thing if we use the computer? randy bennett [email protected]

63
Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett [email protected]

Upload: arleen-hodge

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Do the Scores Mean the Same ThingIf We Use the Computer?

Randy Bennett

[email protected]

Page 2: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

“The QCA will not be content with simply responding to developments in e-learning and e-assessment. If we do, we will be too late. We need to be on the front foot all the way, which is why I want National Curriculum tests to be available on-screen within 5 years to those schools that want to use them.”

From a 4/20/04 speech by Ken Boston, CEO, QCA

Page 3: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 4: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

The States and Online Testing

Source: Education Week survey of state technology contacts, Technology Counts, 2004

Page 5: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Generalizations from the State Initiatives

State efforts: Are being pursued at multiple grade levels, in all

key content areas, and for a variety of populations Involve both low- and high-stakes assessments Vary widely in progress and target implementation

dates Initially use multiple-choice items almost

exclusively Are in some cases an explicit part of an integrated

plan

Page 6: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Reasons for Delivering Tests Online

Speed of scoring and reporting

Mass customization

Promise of being able to measure things that can’t be measured on paper

Eventual reduction in costs

Page 7: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 8: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Major Issues

Cost Near-term costs of online assessment are higher

than paperObtaining or upgrading equipment and Internet

connectivity Licensing test delivery softwareConverting existing paper items to computer formatTraining staff to deliver electronic testsProviding on-call technical support

Page 9: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Major Issues

Time Large-scale testing on computer requires multi-

year ramp-upDeveloping bid requests and selecting a delivery

contractorCreating and institutionalizing new test

development and review processes Preparing staff, students, parentsPiloting and refining the test and delivery systemConducting research

Page 10: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Major Issues

Equipment, software, and network availability and dependability Instructional computing may be curtailed

during the test administration period Technical problems may interfere with

testing

Page 11: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Major Issues

SecurityBecause testing will often be

conducted in a window, active test content will be exposed

If high-stakes tests are offered continuously, security can become a complicated and costly problem

Page 12: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Measurement and Fairness

Comparability of delivery modes

Comparability of computer platforms

Comparability of CR scoring

Page 13: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 14: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What is “Comparability?”

DefinitionCommonality of score meaning

across testing “conditions” Scores are comparable when they can

be used interchangeably

Page 15: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What is “Comparability?”

CriteriaHighly similar rank-ordering of

individuals across conditionsHighly similar distributions across

conditions

Page 16: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

When is Comparability Important?

When scores need to have common meaning with respect to:One another Some reference groupA content standard

If scores are not comparable across “conditions,” then decisions may be wrong

Page 17: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Kinds of Decisions Could Be Wrong?

Wrong decisions could be made:About individuals or groups In high- or low-stakes situations

ExamplesPromotion or graduationDiagnosis or learning progressSchool effectivenessGroup proficiency

Page 18: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 19: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Delivery Modes

Do the scores from paper and computer tests mean the same thing?Differences in presentation characteristics

Differences in response requirements

Differences in general administration characteristics

Page 20: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Delivery Modes

Many agencies need to deliver in both modesNot all schools have enough computers

Some students don’t have computer skills

Page 21: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for Adults (mostly)

Mead & Drasgow (1993)Meta-analysis of studies that

compared paper and computer versions of the same tests with respect to:

Rank ordering of individualsThe difference in mean scores

Page 22: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for Adults (mostly)

Mead & Drasgow (1993)Across 159 correlations, found values

of:.97 for timed power tests.72 for speeded tests

For the timed power tests, the mean difference between the modes was close to zero

Page 23: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for Population Groups

Gallagher, Bridgeman, & Cahalan (2000)Does delivery mode differentially affect

particular groups?Analyzed data from the GRE, GMAT, SAT I,

Praxis, and TOEFLFound that delivery mode consistently changed

the size of the differences between some groups, but only by small amounts

Page 24: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for E&S Mathematics

Students score higher on P&P than online versions of MC testsChoi & Tinkler (2002): Grade 3Coon, McLeod, & Thissen (2002): Grade 5 Ito & Sykes (2004): Grades 4-12Davis & Gardner (2004): Grade 10

No difference between modesPoggio et al. (2004): Grade 7

Page 25: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

NAEP Math Online Study

8th grade students scored higher on P&P than online versions

In general, there was no differential impact of mode for population groups

Computer facility predicted online test score

Page 26: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for E&S Reading and Verbal Skills

Students score higher on P&P than online versions of MC tests Choi & Tinkler (2002): Grades 3 and 10 Coon, McLeod, & Thissen (2002): Grade 3 Ito & Sykes (2004): Grades 4-12 Davis & Gardner (2004): Grade 10

No difference or higher on online than P&P versions of MC tests Pommerich (2004): Grades 11-12

Page 27: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for CR Items

Constructed response Uncommon in online state assessmentShould produce larger mode effects than

MC itemsCR items require more respondingMore responding on computer suggests need

for greater technology skill

Page 28: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research with Adults for Essay Tests

Adults score higher on P&P than online versions of essay tests of writing skillBridgeman & Cooper (1998): GMATYu et al. (2004): PraxisWolfe et al. (2004): TOEFL

Page 29: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research for E&S Essay Tests

Score levels not comparable across modesRussell & Haney, 1997: Grades 6-8Russell & Plati, 2000: Grade 8Wolfe et al., 1996: Secondary

Computer experience and delivery mode may interact

Page 30: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 31: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Platforms

Item presentation can be affected by: Monitor size Screen resolutionOperating system settingsBrowser settings

Page 32: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Monitor Size

May affect legibilitySmaller monitors may be harder to

read because the text is physically littler

Does not affect the amount of information displayed

Page 33: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Screen Resolution

Affects the size of text and may affect the amount of information displayedGiven the same screen size and font size,

text displayed at high resolution will be smaller than text displayed at a lower resolution

Higher resolution allows more words per line and lines per screen

Page 34: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org
Page 35: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org
Page 36: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Screen Resolution

What’s the practical impact?Lower resolutions may:

Require examinees to spend more time locating information

More scrollingMore visual sweeps to process shorter lines

of textIncrease processing difficulty if critical

information is split between screens

Page 37: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Font Size

Affects the size of text and the amount of information displayed Smaller fonts permit more characters per line and lines per

screen

Can be changed in multiple ways Operating system settings Browser settings Web page coding

May not be identical across machines under some Internet delivery models

Page 38: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org
Page 39: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org
Page 40: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Relevant Research

Bridgeman et al. (2003)Looked at effect of variations in monitor

size, resolution, and item-presentation latency on test performance for SAT items

No effect for math scoresReading comprehension scores were lower

for students using a smaller, lower-resolution screen than for students using a larger, higher-resolution display

Page 41: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 42: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of CR Scoring

Under different presentation methodsBy different processing mechanisms

Page 43: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Scoring Presentation for CR Items

Constructed responses can be given to human examiners for scoring:On paper or online In handwritten or in typewritten form

Do human examiners award the same scores under these conditions?

Page 44: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Scoring Presentation for CR Items

Scoring on paper vs. online Zhang, Powers, Wright, & Morgan (2003)

Used responses from the AP English Language test and the AP Calculus test

Compared scores awarded to the same handwritten responses presented in both paper and online modes

Found little if any difference Findings consistent with other studies for college-age

students taking Praxis, school-age students taking NAEP, and school-age students taking QCA National Curriculum tests in English

Page 45: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Scoring Presentation for CR Items

Scoring handwritten vs. typewritten responses Powers & Farnum (1997)

College-age students Each essay appeared:

Typed on screen Typed on paper Handwritten on screen Handwritten on paper

No difference for scoring on-screen vs. on paper Typed essays graded lower than handwritten versions

Page 46: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Scoring Presentation for CR Items

Scoring handwritten vs. typewritten responses Handwritten/typed effect consistent with other

studies using college-age and E&S students Why does it occur?

Typed responses look like final draft Poor handwriting masks errors that are easier to detect

in typed responses Revisions more obvious in handwritten answers

Two studies suggest effect can be reduced, or eliminated, through training

Page 47: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Across CR Scoring Mechanisms

CR scoring mechanismsHuman judgesAutomated systems

Page 48: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Across CR Scoring Mechanisms

Comparability is required when we want to:Use humans for grading some students

and the machine for other studentsChange over from human to machine

scoring Generalize the validity evidence from tests

scored by humans to the same test scored automatically

Page 49: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability of Human and Automated Scoring

Commercially available automated essay scoring systemsPEG (Measurement, Inc.) IEA (KAT) Intellimetric (Vantage)E-rater (ETS)

Page 50: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

How Does a Machine Score Essays?

Identify the features of responses that predict human scoresCreate a program to extract those featuresCombine the extracted features to form a score“Validate” the machine scores by comparing them to some criterion

Page 51: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Comparability Research

Some problems Undisclosed scoring methods Inadequately described or poorly conducted

research Better-quality research concentrated at adult level Results not published in peer-review

measurement journals Critical comparability (and validity) questions

sometimes unaddressed

Page 52: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Critical Questions

What writing dimensions are scored, and how are text features evaluated and mapped to these writing dimensions?How highly do the distributions and rank orders of automated and human scores agree?How similar are automated and human scores in their relations to other measures of writing?How effectively does automated scoring deal with unusual responses?

Page 53: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Do We Know?

What writing dimensions are scored, and how are text features evaluated and mapped to these dimensions? We know a fair amount for some programs, nothing for others

How highly do the distributions and rank orders of automated and human scores agree? We know that machine scores agree reasonably well with human

scores

How similar are automated and human scores in their relations to other measures of writing? We know relatively little

How effectively does automated scoring deal with unusual responses? We know very little

Page 54: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 55: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Can We Do?

General steps Increase research efforts to identify likely

sources of irrelevant score variationPublish the results in high-quality

assessment journalsAllows results to be:

Vetted through peer review Challenged in a rejoinder Disseminated to the field

Page 56: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Can We Do?

Comparability of delivery modes If the rank orders of scores are very high

but the distributions don’t match, consider equating scores

If equating is not possible:Use two score scales

Separate cut-scores and/or norms

Use one delivery mode

Page 57: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Can We Do?

Comparability of platforms Establish hardware and software standards to limit

presentation variation Manipulate presentation characteristics through

test delivery software Have proctors set display characteristics before

starting the test Design items for the lowest common denominator Render items intelligently

Page 58: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Can We Do?

Comparability of CR scoring under different presentation conditionsTrain examiners to avoid format effects by:

Showing examples of papers that appear different in handwritten and typed form

Having examiners practice grading, and then qualify on, pre-scored benchmark papers in both forms

Page 59: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

What Can We Do?

Comparability of CR scoring by different processing mechanismsDemand to know how it worksRun pilots to determine suitability for a

given contextDo research as part of the try-out process

Page 60: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Chapter

I: Online testing in the states

II: Major issues

III: What is “comparability?”

IV: Comparability of delivery modes

V: Comparability of computer platforms

VI: Comparability of CR scoring

VII: What can we do?

VIII: Conclusion

Page 61: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Conclusion

Many education agencies are exploring or implementing online assessmentsAgencies hope to realize significant improvements from electronic delivery Agencies are encountering important issues, including measurement and fairnessOf the measurement and fairness issues, score comparability is one of the most critical

Page 62: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Conclusion

Score comparability may be affected by variation in delivery mode, computer platform, and the manner of grading CR itemsSome of these effects may diminish over timeSuch effects can have undesirable consequences for institutions and for individualsAgencies should: Increase efforts to study the impact of variation in delivery

and scoring Take measures to manage variation found to affect

performance

Page 63: Do the Scores Mean the Same Thing If We Use the Computer? Randy Bennett rbennett@ets.org

Do the Scores Mean the Same ThingIf We Use the Computer?

Randy Bennett

[email protected]