six major challenges for educational and psychological ... · testing practices ronald k. hambleton...

Six Major Challenges for Educational and Psychological Testing Practices

Ronald K. HambletonUniversity of Massachusetts at Amherst

Annual APA Meeting, New Orleans, Aug. 11, 2006

In 1966 (I began my studies at the University of Toronto, in Canada.)

1. Multiple-Choice Tests 2. Relatively Simple Statistics

(up to only ANOVA and linear regression)

3. Routine Psychometric Studies Could Be Published

4. Computer Cards/Tapes

In 2006 (40 years later)

1. Wide Array of Item Types2. Complex Statistical Modeling of

Data (IRT, GT, SEM)3. Standard-Setting, DIF, CBT,

CAT, Performance Testing, Automated Scoring and Test Development

4. Laptops, Desktops, Internet

Impossible to predict changes between 1966 and 2006, but a few initial predictions about the next 40 years seem possible because some trends are clear..1. Wider Uses of Psychological Tests in

International Markets2. Advances in Modeling of Test Data3. New Item Types/Scoring Are Coming

-High Fidelity Simulations-Item Algorithms, Item Cloning-Computer Scoring of Free Responses

State of Affairs Today, cont.:

4. Advances with Computer-Based Tests

5. Improvements in Score Reporting Practices (e.g., simpler, clearer, more informative displays)

6. And, Better Training in Psychometric Methods Is Needed(for Psychologists and Educational Research Specialists)

Two Goals of the Presentation

•Address these six (likely) advances and their impact on educational and psychological testing practices.

•Describe challenges that need to be addressed.

1. Use of Tests in International Markets• Interest in test translations and test adaptations has increased tremendously in the past 15 years:

--Several IQ and personality tests have been adapted into more than 100 languages.

--Achievement tests for large scale international assessments (PISA, TIMSS) in over 30 languages.

1. Use of Tests in International Markets

--International uses of credentialing exams is expanding (e.g., see Microsoft)

--Many high school graduation/college admissions tests are in multiple languages (e.g., see Israel, South Africa, USA).

--Health scientists with their “Quality of Life” measures are receiving wide use in many languages and cultures.

--Marketing researchers are doing more.

1. Use of Tests in International Markets

• But--major misunderstandings about the difficulties of translating and adapting tests from one language and culture to another. (See Hambleton, Merenda, & Spielberger, 2006; ITC Brussels Conference, 2006)

Example 1

Out of sight, out of mind(Back translated from French)

invisible, insane

Example 2 (IEA Study in Reading)Example 2 (IEA Study in Reading)

Are these words similar in meaning?

Pessimistic -- Sanguine

Pessimistic -- SanguineAdapted to

Pessimistic -- Optimistic

Example 3 (1995 TIMSS Pilot)Example 3 (1995 TIMSS Pilot)Alex reads his book for 1 hour and then

used a book mark to keep his place. How much longer will it take him to finish the book?

A. ½ hourB. 2 hoursC. 5 hoursD. 10 hours

Common Misunderstandings:• That most anyone who knows two languages can do the translation.

• That a backward translation design is sufficient. (Need a forward design.)

• That translators, if they have the correct training, can produce a valid instrument in a second language and culture.

• Use of bilinguals to compile empirical evidence is sufficient.

Challenges Ahead:

•Hire qualified translators (and several of them).

•Use forward and backward designs (and newer designs) to review test items.

•Compile empirical evidence to address construct, method, and item bias.

Challenges Ahead, cont.:

• Integrate best methodologies and practices to guide future test adaptation studies.

•Recognize the complexity of the work, so more resources, time, and expertise are available to do the job consistent with ITC and AERA/APA/NCME test standards.

2. Advances in Statistical Modelingof Test and Item Level Data

•IRT models have become popular and for several good reasons—lots of positive features (e.g., model parameter invariance, item and test information).

•Modern measurement theory and practices are now here.

Item Response Functions (4 choice item):

-3 -2 -1 0 1 2 3

Ability

ai = 1.00bi1 = -1.25bi2 = -0.25bi3 = 1.50

Graded Response Model:

ex mix

Da b ii ix

( )( ) , , ,...,θθ

−10 1

Pi0 10* ( ) .θ =

Pi mi( )* ( ) .+ =1 0 0θ

P P Pix ix i x( ) ( ) ( )*( )*θ θ θ= − +1

Generalized Partial Credit Model:

])([exp1

)](exp[)|(

∑∑

New IRT Polytomous Response Models•Partial credit model•Generalized partial credit model•Graded response model•Logistic multidimensional model•Rating scale models•Hundreds more models exist!

Many Examples of Successful IRT Applications

•Automated test assembly (targeting)•Computer-adaptive testing (shorten)•Detection of potentially biased test items

•Equating (fairness and change)•Test score reporting (e.g., item mapping) (IRT creates options)

Challenges Ahead:

•There are questions of model choice (fit, practicality), and calibration of items with small samples.

• Identifying and handling dependencies in the data (common with new item types).

Challenges Ahead, cont.:

• Establishing invariance of item parameters over subgroups of the population of interest. (e.g., Black, Hispanic, White; Male, Female; state to state, country to country)

• More training is needed for persons to do the IRT applications, read the test manuals, etc.

Ability Estimation [0-1 vs. Testlet Scoring]—See paper by Zenisky, et al., JEM, 2002.

Polytomously-Scored Ability Estimates

3210-1-2-3

3. Generation of New Item Types

•Lots of “sizzle” here with simulations (e.g., virtual reality, performance tasks) and other item types. But--

--Can new skills be measured?--Can old skills be measured better?--What’s the value-added versus the costs of development? Measurement/minute of testing?

Site Planning Vignettes (Bejar, 1991)

Image from NCARB (2000)

Site Planning Vignettes (Bejar, 1991)

Image from NCARB (2000)

Dynamic Problem Solving Simulation(Clauser, et al., 1997)

Image from NBME (2001)

Examples of Advances

•Pioneering research of Bennett and his colleagues with the architectural exams.

•Work of Clauser and Nungester with sequential problem solving tests in medicine.

Immediate, Less Costly, and Useful New Item Formats

•Multiple-Correct Answers•Short Answer •Extended Answer (Essay)•Highlighting Text• Inserting Text

•Ranking (or Ordering)•Numerical Responses (Including Multiple)

•“Drag and Drop”•Sequential Problems

•More than 50 new item formats.•Complex item stems, sorting tasks, interactive graphics, audio, visual, job aids, sequential problems, joy sticks, touch screens, pattern scoring, and more.

Challenges Ahead:•An increased commitment to validation of these new item types is needed:

--Face validity is important but not sufficient. Much more empirical validity evidence is needed to support the use of new item types.

--Need to judge increase in test score validity against extra time and costs.

4. Computer-Based Testing

•Advantages are well-known: --Flexibility in scheduling tests--Potential for immediate score reporting

--Assessment of higher level thinking with new item types (in principle)

--New test designs (to reduce time)•Many testing agencies on computer.

Computer-Based Test (CBT)Designs

LINEAR CATMULTISTAGE

Fixed Length Multiple Forms (Linear)

•A Single Form (acceptable if volume is low)

•Multiple Parallel Forms•“Linear on the Fly Tests” (LOFT)

E H E H H

High Low Proficiency Scale

Item Bank

_ _ _ +

Three-Stage Test Design

Stage 3E-E

Stage 3E-M

Stage 2Easy (E)

Stage 3M-E

Stage 3M-M

Stage 2Medium (M)

Stage 3H-M

Stage 3H-H

Stage 2Hard (H)

Stage 1(Routing Test)

Automated Test Construction•Mimicking test development committees

•Content and statistical considerations, exposure controls

•Operations research methodology, linear programming, IRT

•van der Linden, Luecht, Stocking, and others have advanced topic

One Big Challenge: Item Exposure

•Items exposed to candidates every day testing is done.

•How serious is item exposure? When present, test score validity is lowered. (e.g., GRE example)

andMoving Averages (Ning Hambleton, 2006)

M+2*SD

M-2*SD

Example of an Exposed Item

One Big Challenge: Item Exposure•How can item exposure be detected?

•How much more vulnerable are the performance based tasks?

•How can the tasks be disguised and/or cloned? Impact of even minor revisions on item statistics?

•Can item types be found that may be less susceptible to exposure?

Other Challenges, cont.: •How to make CBT cost effective for schools?

•Researching other ways to address item exposure: Increasing the size of item banks via cloning, algorithmic item writing, rotating banks, writing items to statistical specs., etc.

•Matching test designs to intended uses of the scores.

5. Improvements in Score Reporting

•Least studied topic today (do you know any research?) in assessment, and one of the most important:

•Lots of evidence that score users are easily confused. (Concept of measurement error is not understood; error bands are confusing.)

Score Reporting• Critically important topic, and almost no educational research studies available.

• Substantial empirical evidence suggesting that policy-makers, educators, and the public are confused by test score scales and reports. (What are typical IQ scores?)

• Thanks, April Zenisky for the next slide:

Put the results for both years for a single state together, then list next state

Lots of questions about the axis here

One Promising Advance:

•Placing meaningful points on test score scales--e.g., performance standards, defining skills at selected scores, providing averages, “market basket” concept (e.g., explaining what respondents can do in relation to a collection of test items).

0.000.10

0.200.30

0.400.50

0.600.70

0.800.90

-3 -2 -1 0 1 2 3Proficiency Scale

Reporting Items PointsCategoryTopic 1 13 16Topic 2 18 21Topic 3 9 12Topic 4 8 11Topic 5 12 15

Item Characteristic Curves for an Item Bank

P=0.65

Candidate Diagnostic Score Report 1Candidate

Performance Content / Skill AreasPerformance Level

PASSING

NEARPASSING

WEAKNESSES

MAJORWEAKNESSES

Score Range:75 to 100

Score Range:1 to 54

Candidates in this performance level can [text to be inserted here, text to be inserted here, text to be inserted here, and text to be inserted here].

Many candidates in this performance level do not [insert relevant text here].

Diagnostic Score Report No. 2

Challenges:

• Can we develop empirically-based principles to assist in the design of meaningful and useful score scales and reports?

• How can diagnostic reports be enhanced? (e.g., rule space methodology, MIRT, collateral and prior information)

Challenges, cont.:•Evaluation of new methods for studying score reports: Focus groups, “think aloud” studies, experimental studies, field-tests.

•Need to commit more resources and time to this immensely important topic!

6. Improvement in Training forSpecialists and Others

• Major shortage of persons with good psychometric training.

• We need to do a better job in training educators and psychologists to construct and to use tests incorporating recent advances.

--Many Schools of Education and Psychology offer only minimal training.

Challenges:

•What knowledge and skills do modern psychometricians need?

•What do counselors, teachers, and others need to learn about testing and testing practices to increase the validity of test score uses?

Conclusions• Easy to make the case that the emerging technology (IRT models, computers, item types, etc.) should be used to improve credentialing, selection, achievement, and personality tests—face validity is high.

• At the same time, research on the various advances must be carried out, and AERA-APA-NCME Test Standards followed, to confirm the strengths and weaknesses of these advances.

Conclusions, cont.:

• Innovations and technological advances without supporting research findings and validity evidence are simply “sizzle” and “marketing” and won’t lead, necessarily, to more valid assessments.

Conclusions, cont.:

six major challenges for educational and psychological ... · testing practices ronald k. hambleton...

Documents

hambleton district council report to:...

pewsheet - 13 september 2015 (hambleton)

robin hambleton elected mayor presentation 111128

pewsheet - 10 february 2016 (hambleton)

pewsheet - 8 november 2015 (hambleton)

comparison of trends in naep, massachusetts-naep, and...

hambleton ce primary school september 2020

pewsheet - 9 november 2014 (hambleton)

pewsheet - 5 april 2015 (hambleton)

42chne008) matthews81la1 hambleton … · 42chne008)...

choose wisely a blake hambleton production sponsored by

pewsheet - 26 june 2016 (hambleton)

1university of massachusetts amherst amherst, ma 01003

pewsheet - 14 february 2016 (hambleton)

pewsheet - 26 october 2014 (hambleton)

new orleans’ own hot 8 brass band - umass amherst...

hambleton local plan

national assessment collaboration standard setting …ron k....

pewsheet - 30 november 2014 (hambleton)

pewsheet - 23 november 2014 (hambleton)