standards

240
STANDARDS for Educational and Psychological Testing American Educational Research Association American Psychological Association National Council on Measurement in Education

Upload: others

Post on 16-Oct-2021

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STANDARDS

STANDARDSf o r E d u c a t i o n a l a n d P s y c h o l o g i c a l Te s t i n g

American Educational Research AssociationAmerican Psychological Association

National Council on Measurement in Education

Page 2: STANDARDS

Copyright © 2014 by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. All rights reserved. No part of thispublication may be reproduced or distributed in any form or by any means, including, but not limited to,the process of scanning and digitization, or stored in a database or retrieval system, without the prior writtenpermission of the publisher.

Published by theAmerican Educational Research Association1430 K St., NW, Suite 1200Washington, DC 20005

Printed in the United States of America

Prepared by the Joint Committee on the Standards for Educational and Psychological Testing of the American EducationalResearch Association, the American Psychological Association, and the National Council on Measurementin Education

Library of Congress Cataloging-in-Publication Data

American Educational Research Association.Standards for educational and psychological testing / American Educational Research Association, American Psychological Association, National Council on Measurement in Education.

pages cm“Prepared by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, American Psychological Association and National Councilon Measurement in Education”—T.p. verso.Include index.ISBN 978-0-935302-35-6 (alk. paper)1. Educational tests and measurements—Standards—United States. 2. Psychological tests—Standards—United States. I. American Psychological Association. II. National Council on Measurement in Education.III. Joint Committee on Standards for Educational and Psychological Testing (U.S.) IV. Title.LB3051.A693 2014371.26'0973—dc23

2014009333

ii

Page 3: STANDARDS

PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

INTRODUCTIONThe Purpose of the Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1Legal Disclaimer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1Tests and Test Uses to Which These Standards Apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2Participants in the Testing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Scope of the Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4Organization of the Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Categories of Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Presentation of Individual Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Cautions to Be Considered in Using the Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

PART IFOUNDATIONS

1. Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11Sources of Validity Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13Integrating the Validity Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

Standards for Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Cluster 1. Establishing Intended Uses and Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . .23Cluster 2. Issues Regarding Samples and Settings Used in Validation . . . . . . . . . . . . . . . . .25Cluster 3. Specific Forms of Validity Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

2. Reliability/Precision and Errors of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33Implications for Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34Specifications for Replications of the Testing Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .35Evaluating Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37Reliability/Generalizability Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37Factors Affecting Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38Standard Errors of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39Decision Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Reliability/Precision of Group Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Documenting Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

Standards for Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42Cluster 1. Specifications for Replications of the Testing Procedure . . . . . . . . . . . . . . . . . . .42Cluster 2. Evaluating Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43Cluster 3. Reliability/Generalizability Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44Cluster 4. Factors Affecting Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44Cluster 5. Standard Errors of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

iii

CONTENTS

Page 4: STANDARDS

Cluster 6. Decision Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46Cluster 7. Reliability/Precision of Group Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46Cluster 8. Documenting Reliability/Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

3. Fairness in Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49General Views of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50Threats to Fair and Valid Interpretations of Test Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . .54Minimizing Construct-Irrelevant Components Through Test Design and Testing Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

Standards for Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63Cluster 1. Test Design, Development, Administration, and Scoring Procedures That Minimize Barriers to Valid Score Interpretations for the Widest Possible Range of Individuals and Relevant Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63

Cluster 2. Validity of Test Score Interpretations for Intended Uses for the Intended Examinee Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

Cluster 3. Accommodations to Remove Construct-Irrelevant Barriers and Support Valid Interpretations of Scores for Their Intended Uses . . . . . . . . . . . . . . . . . . . . . . . . .67

Cluster 4. Safeguards Against Inappropriate Score Interpretations for Intended Uses . . . . .70

PART IIOPERATIONS

4. Test Design and Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Test Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Item Development and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81Assembling and Evaluating Test Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82Developing Procedures and Materials for Administration and Scoring . . . . . . . . . . . . . . . .83Test Revisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

Standards for Test Design and Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85Cluster 1. Standards for Test Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85Cluster 2. Standards for Item Development and Review . . . . . . . . . . . . . . . . . . . . . . . . . . .87Cluster 3. Standards for Developing Test Administration and Scoring Procedures and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90

Cluster 4. Standards for Test Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93

5. Scores, Scales, Norms, Score Linking, and Cut Scores . . . . . . . . . . . . . . . . . . . . . . . . . . .95Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95Interpretations of Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97Score Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97Cut Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100

Standards for Scores, Scales, Norms, Score Linking, and Cut Scores . . . . . . . . . . . . . . . . .102Cluster 1. Interpretations of Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102Cluster 2. Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

iv

CONTENTS

Page 5: STANDARDS

Cluster 3. Score Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105Cluster 4. Cut Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107

6. Test Administration, Scoring, Reporting, and Interpretation . . . . . . . . . . . . . . . . . . . . . .111Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111Standards for Test Administration, Scoring, Reporting, and Interpretation . . . . . . . . . .114Cluster 1. Test Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114Cluster 2. Test Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118Cluster 3. Reporting and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119

7. Supporting Documentation for Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123Standards for Supporting Documentation for Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .125Cluster 1. Content of Test Documents: Appropriate Use . . . . . . . . . . . . . . . . . . . . . . . . .125Cluster 2. Content of Test Documents: Test Development . . . . . . . . . . . . . . . . . . . . . . . .126Cluster 3. Content of Test Documents: Test Administration and Scoring . . . . . . . . . . . . .127Cluster 4. Timeliness of Delivery of Test Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

8. The Rights and Responsibilities of Test Takers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131Standards for Test Takers’ Rights and Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . .133Cluster 1. Test Takers’ Rights to Information Prior to Testing . . . . . . . . . . . . . . . . . . . . . .133Cluster 2. Test Takers’ Rights to Access Their Test Results and to Be Protected From Unauthorized Use of Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

Cluster 3. Test Takers’ Rights to Fair and Accurate Score Reports . . . . . . . . . . . . . . . . . . .136Cluster 4. Test Takers’ Responsibilities for Behavior Throughout the Test Administration Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136

9. The Rights and Responsibilities of Test Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139Standards for Test Users’ Rights and Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . .142Cluster 1. Validity of Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142Cluster 2. Dissemination of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146Cluster 3. Test Security and Protection of Copyrights . . . . . . . . . . . . . . . . . . . . . . . . . . . .147

PART IIITESTING APPLICATIONS

10. Psychological Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151Test Selection and Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152Test Score Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154Collateral Information Used in Psychological Testing and Assessment . . . . . . . . . . . . . . .155Types of Psychological Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155Purposes of Psychological Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163

v

CONTENTS

Page 6: STANDARDS

Standards for Psychological Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . .164Cluster 1. Test User Qualifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164Cluster 2. Test Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165Cluster 3. Test Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165Cluster 4. Test Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166Cluster 5. Test Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168

11. Workplace Testing and Credentialing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169Employment Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .170Testing in Professional and Occupational Credentialing . . . . . . . . . . . . . . . . . . . . . . . . . .174

Standards for Workplace Testing and Credentialing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178Cluster 1. Standards Generally Applicable to Both Employment Testing and Credentialing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178

Cluster 2. Standards for Employment Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179Cluster 3. Standards for Credentialing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181

12. Educational Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183Design and Development of Educational Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . .184Use and Interpretation of Educational Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .188Administration, Scoring, and Reporting of Educational Assessments . . . . . . . . . . . . . . . .192

Standards for Educational Testing and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195Cluster 1. Design and Development of Educational Assessments . . . . . . . . . . . . . . . . . . .195Cluster 2. Use and Interpretation of Educational Assessments . . . . . . . . . . . . . . . . . . . . . .197Cluster 3. Administration, Scoring, and Reporting of Educational Assessments . . . . . . . .200

13. Uses of Tests for Program Evaluation, Policy Studies, and Accountability . . . . . . . . . . .203Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203Evaluation of Programs and Policy Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204Test-Based Accountability Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205Issues in Program and Policy Evaluation and Accountability . . . . . . . . . . . . . . . . . . . . . . .206Additional Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

Standards for Uses of Tests for Program Evaluation, Policy Studies, and Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209Cluster 1. Design and Development of Testing Programs and Indices for Program Evaluation, Policy Studies, and Accountability Systems . . . . . . . . . . . . . . . . .209

Cluster 2. Interpretations and Uses of Information From Tests Used in Program Evaluation, Policy Studies, and Accountability Systems . . . . . . . . . . . . . . . .210

GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227

vi

CONTENTS

Page 7: STANDARDS

This edition of Standards for Educational and Psy-chological Testing is sponsored by the AmericanEducational Research Association (AERA), theAmerican Psychological Association (APA), andthe National Council on Measurement in Education(NCME). Earlier documents from the sponsoringorganizations also guided the development anduse of tests. The first was Technical Recommendationsfor Psychological Tests and Diagnostic Techniques,prepared by an APA committee and published byAPA in 1954. The second was Technical Recom-mendations for Achievement Tests, prepared by acommittee representing AERA and the NationalCouncil on Measurement Used in Education(NCMUE) and published by the National Edu-cation Association in 1955.

The third, which replaced the earlier two, wasprepared by a joint committee representing AERA,APA, and NCME and was published by APA in1966. It was the first edition of the Standards forEducational and Psychological Testing, also knownas the Standards. Three subsequent editions ofthe Standards were prepared by joint committeesrepresenting AERA, APA, and NCME, publishedin 1974, 1985, and 1999.

The current StandardsManagement Committeewas formed by AERA, APA, and NCME, thethree sponsoring organizations, in 2005, consistingof one representative from each organization. Thecommittee’s responsibilities included determiningwhether the 1999 Standards needed revision andthen creating the charge, budget, and work timelinefor a joint committee; appointing joint committeeco-chairs and members; overseeing finances and adevelopment fund; and performing other tasksrelated to the revision and publication of theStandards.

Standards Management Committee

Wayne J. Camara (Chair), appointed by APA David Frisbie (2008–present), appointed by NCME Suzanne Lane, appointed by AERABarbara S. Plake (2005–2007), appointed by NCME

The present edition of the Standards was developedby the Joint Committee on the Standards for Ed-ucational and Psychological Testing, appointed bythe Standards Management Committee in 2008.Members of the Joint Committee are members ofat least one of the three sponsoring organizations,AERA, APA, and NCME. The Joint Committeewas charged with the revision of the Standardsand the preparation of a final document for pub-lication. It held its first meeting in January 2009.

Joint Committee on the Standards forEducational and Psychological TestingBarbara S. Plake (Co-Chair)Lauress L. Wise (Co-Chair)Linda L. CookFritz DrasgowBrian T. GongLaura S. HamiltonJo-Ida HansenJoan L. HermanMichael T. KaneMichael J. KolenAntonio E. PuentePaul R. SackettNancy T. TippinsWalter D. WayFrank C. Worrell

Each sponsoring organization appointed one ortwo liaisons, some of whom were members of theJoint Committee, to serve as the communicationconduits between the sponsoring organizationsand the committee during the revision process.

Liaisons to the Joint CommitteeAERA: Joan L. HermanAPA: Michael J. Kolen and Frank C. WorrellNCME: Steve Ferrara

Marianne Ernesto (APA) served as the project di-rector for the Joint Committee, and Dianne L.Schneider (APA) served as the project coordinator.Gerald Sroufe (AERA) provided administrativesupport for the Management Committee. APA’s

vii

PREFACE

Page 8: STANDARDS

legal counsel managed the external legal review ofthe Standards. Daniel R. Eignor and James C.Impara reviewed the Standards for technicalaccuracy and consistency across chapters.

In 2008, each of the three sponsoring organi-zations released a call for comments on the 1999Standards. Based on a review of the comments re-ceived, the Management Committee identifiedfour main content areas of focus for the revision:technological advances in testing, increased useof tests for accountability and education policy-setting, access for all examinee populations, andissues associated with workplace testing. In addition,the committee gave special attention to ensuringa common voice and consistent use of technicallanguage across chapters.

In January 2011, a draft of the revised Standardswas made available for public review and comment.Organizations that submitted comments on thedraft and/or comments in response to the 2008call for comments are listed below. Many individualsfrom each organization contributed comments,as did many individual members of AERA, APA,and NCME. The Joint Committee consideredeach comment in its revision of the Standards.These thoughtful reviews from a variety of pro-fessional vantage points helped the Joint Committeein drafting the final revisions of the present editionof the Standards.

Comments came from the following organi-zations:

Sponsoring OrganizationsAmerican Educational Research AssociationAmerican Psychological AssociationNational Council on Measurement in Education

Professional AssociationsAmerican Academy of Clinical NeuropsychologyAmerican Board of Internal MedicineAmerican Counseling AssociationAmerican Institute of CPAs, Examinations TeamAPA Board for the Advancement of Psychology in

the Public InterestAPA Board of Educational Affairs APA Board of Professional Affairs APA Board of Scientific AffairsAPA Policy and Planning Board

APA Committee on Aging APA Committee on Children, Youth, and FamiliesAPA Committee on Ethnic Minority AffairsAPA Committee on International Relations in

PsychologyAPA Committee on Legal IssuesAPA Committee on Psychological Tests and

AssessmentAPA Committee on Socioeconomic StatusAPA Society for the Psychology of Women

(Division 35)APA Division of Evaluation, Measurement, and

Statistics (Division 5)APA Division of School Psychology (Division 16) APA Ethics CommitteeAPA Society for Industrial and Organizational

Psychology (Division 14)APA Society of Clinical Child and Adolescent

Psychology (Division 53)APA Society of Counseling Psychology (Division 17)Asian American Psychological AssociationAssociation of Test Publishers District of Columbia Psychological Association Massachusetts Neuropsychological SocietyMassachusetts Psychological AssociationNational Academy of NeuropsychologyNational Association of School PsychologistsNational Board of Medical ExaminersNational Council of Teachers of Mathematics NCME Board of DirectorsNCME Diversity Issues and Testing CommitteeNCME Standards and Test Use Committee

Testing Companies

ACTAlpine Testing SolutionsThe College BoardEducational Testing ServiceHarcourt Assessment, Inc.Hogan Assessment SystemsPearson PrometricVangent Human Capital ManagementWonderlic, Inc.

Academic and Research Institutions

Center for Educational Assessment, University ofMassachusetts

George Washington University Center for Equity and Excellence in Education

viii

PREFACE

Page 9: STANDARDS

Human Resources Research Organization (HumRRO) National Center on Educational Outcomes,

Univer sity of Minnesota

Credentialing OrganizationsAmerican Registry of Radiologic TechnologistsNational Board for Certified CounselorsNational Board of Medical Examiners

Other Institutions California Department of EducationEqual Employment Advisory CouncilFair Access Coalition on TestingInstituto de Evaluación e Ingeniería of Avanzada,

MexicoQualifications and Curriculum Authority, UK

Department for EducationPerformance Testing Council

When the Joint Committee completed its final re-vision of the Standards, it submitted the revisionto the three sponsoring organizations for approvaland endorsement. Each organization had its owngoverning body and mechanism for approval, aswell as a statement on the meaning of its approval:

AERA:The AERA’s approval of the Standardsmeans that the Council adopts the documentas AERA policy.

APA: The APA’s approval of the Standardsmeans that the Council of Representativesadopts the document as APA policy.

NCME: The Standards for Educational andPsychological Testing has been endorsed byNCME, and this endorsement carries withit an ethical imperative for all NCME membersto abide by these standards in the practice ofmeasurement.

Although the Standards is prescriptive, it does notcontain enforcement mechanisms. The Standardswas formulated with the intent of being consistentwith other standards, guidelines, and codes ofconduct published by the three sponsoring organizations.

Joint Committee on the Standards for Educational and Psychological Testing

ix

PREFACE

Page 10: STANDARDS
Page 11: STANDARDS

Educational and psychological testing and assess-ment are among the most important contributionsof cognitive and behavioral sciences to our society,providing fundamental and significant sources ofinformation about individuals and groups. Notall tests are well developed, nor are all testingpractices wise or beneficial, but there is extensiveevidence documenting the usefulness of well-con-structed, well-interpreted tests. Well-constructedtests that are valid for their intended purposeshave the potential to provide substantial benefitsfor test takers and test users. Their proper use canresult in better decisions about individuals andprograms than would result without their use andcan also provide a route to broader and more eq-uitable access to education and employment. Theimproper use of tests, on the other hand, cancause considerable harm to test takers and otherparties affected by test-based decisions. The intentof the Standards for Educational and PsychologicalTesting is to promote sound testing practices andto provide a basis for evaluating the quality ofthose practices. The Standards is intended forprofessionals who specify, develop, or select testsand for those who interpret, or evaluate thetechnical quality of, test results.

The Purpose of the Standards

The purpose of the Standards is to provide criteriafor the development and evaluation of tests andtesting practices and to provide guidelines for as-sessing the validity of interpretations of test scoresfor the intended test uses. Although such evaluationsshould depend heavily on professional judgment,the Standards provides a frame of reference toensure that relevant issues are addressed. All pro-fessional test developers, sponsors, publishers, andusers should make reasonable efforts to satisfyand follow the Standards and should encourageothers to do so. All applicable standards shouldbe met by all tests and in all test uses unless asound professional reason is available to show

why a standard is not relevant or technicallyfeasible in a particular case.The Standards makes no attempt to provide

psychometric answers to questions of public policyregarding the use of tests. In general, the Standardsadvocates that, within feasible limits, the relevanttechnical information be made available so thatthose involved in policy decisions may be fullyinformed.

Legal Disclaimer

The Standards is not a statement of legal require-ments, and compliance with the Standards is not asubstitute for legal advice. Numerous federal, state,and local statutes, regulations, rules, and judicialdecisions relate to some aspects of the use, pro-duction, maintenance, and development of testsand test results and impose standards that may bedifferent for different types of testing. A review ofthese legal issues is beyond the scope of theStandards, the distinct purpose of which is to setforth the criteria for sound testing practices fromthe perspective of cognitive and behavioral scienceprofessionals. Where it appears that one or morestandards address an issue on which establishedlegal requirements may be particularly relevant,the standard, comment, or introductory materialmay make note of that fact. Lack of specificreference to legal requirements, however, does notimply the absence of a relevant legal requirement.When applying standards across international bor-ders, legal differences may raise additional issuesor require different treatment of issues.In some areas, such as the collection, analysis,

and use of test data and results for different sub-groups, the law may both require participants inthe testing process to take certain actions andprohibit those participants from taking otheractions. Furthermore, because the science of testingis an evolving discipline, recent revisions to theStandards may not be reflected in existing legalauthorities, including judicial decisions and agency

1

INTRODUCTION

Page 12: STANDARDS

guidelines. In all situations, participants in thetesting process should obtain the advice of counselconcerning applicable legal requirements.In addition, although the Standards is not en-

forceable by the sponsoring organizations, it hasbeen repeatedly recognized by regulatory authoritiesand courts as setting forth the generally acceptedprofessional standards that developers and usersof tests and other selection procedures follow.Compliance or noncompliance with the Standardsmay be used as relevant evidence of legal liabilityin judicial and regulatory proceedings. The Standardstherefore merits careful consideration by all par-ticipants in the testing process. Nothing in the Standards is meant to constitute

legal advice. Moreover, the publishers disclaimany and all responsibility for liability created byparticipation in the testing process.

Tests and Test Uses to Which These Standards Apply

A test is a device or procedure in which a sampleof an examinee’s behavior in a specified domain isobtained and subsequently evaluated and scoredusing a standardized process. Whereas the labeltest is sometimes reserved for instruments onwhich responses are evaluated for their correctnessor quality, and the terms scale and inventory areused for measures of attitudes, interest, and dis-positions, the Standards uses the single term testto refer to all such evaluative devices.A distinction is sometimes made between tests

and assessments. Assessment is a broader term thantest, commonly referring to a process that integratestest information with information from othersources (e.g., information from other tests, inven-tories, and interviews; or the individual’s social,educational, employment, health, or psychologicalhistory). The applicability of the Standards to anevaluation device or method is determined bysubstance and not altered by the label applied toit (e.g., test, assessment, scale, inventory). TheStandards should not be used as a checklist, as isemphasized in the section “Cautions to Be Con-sidered in Using the Standards” at the end of thischapter.

Tests differ on a number of dimensions: themode in which test materials are presented (e.g.,paper-and-pencil, oral, or computerized adminis-tration); the degree to which stimulus materialsare standardized; the type of response format (se-lection of a response from a set of alternatives, asopposed to the production of a free-form response);and the degree to which test materials are designedto reflect or simulate a particular context. In allcases, however, tests standardize the process bywhich test takers’ responses to test materials areevaluated and scored. As noted in prior versionsof the Standards, the same general types of infor-mation are needed to judge the soundness ofresults obtained from using all varieties of tests.The precise demarcation between measurement

devices used in the fields of educational and psy-chological testing that do and do not fall withinthe purview of the Standards is difficult to identify.Although the Standards applies most directly tostandardized measures generally recognized as“tests,” such as measures of ability, aptitude,achievement, attitudes, interests, personality, cog-nitive functioning, and mental health, the Standardsmay also be usefully applied in varying degrees toa broad range of less formal assessment techniques.Rigorous application of the Standards to unstan-dardized employment assessments (such as somejob interviews) or to the broad range of unstructuredbehavior samples used in some forms of clinicaland school-based psychological assessment (e.g.,an intake interview), or to instructor-made teststhat are used to evaluate student performance ineducation and training, is generally not possible.It is useful to distinguish between devices that layclaim to the concepts and techniques of the fieldof educational and psychological testing anddevices that represent unstandardized or less stan-dardized aids to day-to-day evaluative decisions.Although the principles and concepts underlyingthe Standards can be fruitfully applied to day-to-day decisions— such as when a business ownerinterviews a job applicant, a manager evaluatesthe performance of subordinates, a teacher developsa classroom assessment to monitor student progresstoward an educational goal, or a coach evaluates aprospective athlete— it would be overreaching to

2

INTRODUCTION

Page 13: STANDARDS

expect that the standards of the educational andpsychological testing field be followed by thosemaking such decisions. In contrast, a structuredinterviewing system developed by a psychologistand accompanied by claims that the system hasbeen found to be predictive of job performancein a variety of other settings falls within thepurview of the Standards. Adhering to the Standardsbecomes more critical as the stakes for the testtaker and the need to protect the public increase.

Participants in the Testing Process

Educational and psychological testing and assess-ment involve and significantly affect individuals,institutions, and society as a whole. The individualsaffected include students, parents, families, teachers,educational administrators, job applicants, em-ployees, clients, patients, supervisors, executives,and evaluators, among others. The institutionsaffected include schools, colleges, businesses, in-dustry, psychological clinics, and governmentagencies. Individuals and institutions benefit whentesting helps them achieve their goals. Society, inturn, benefits when testing contributes to theachievement of individual and institutional goals. There are many participants in the testing

process, including, among others, (a) those whoprepare and develop the test; (b) those who publishand market the test; (c) those who administer andscore the test; (d) those who interpret test resultsfor clients; (e) those who use the test results forsome decision-making purpose (including policymakers and those who use data to inform socialpolicy); (f ) those who take the test by choice, di-rection, or necessity; (g) those who sponsor tests,such as boards that represent institutions or gov-ernmental agencies that contract with a testdeveloper for a specific instrument or service; and(h) those who select or review tests, evaluatingtheir comparative merits or suitability for the usesproposed. In general, those who are participantsin the testing process should have appropriateknowledge of tests and assessments to allow themto make good decisions about which tests to useand how to interpret test results.

The interests of the various parties involvedin the testing process may or may not be congruent.For example, when a test is given for counselingpurposes or for job placement, the interests of theindividual and the institution often coincide. Incontrast, when a test is used to select from amongmany individuals for a highly competitive job orfor entry into an educational or training program,the preferences of an applicant may be inconsistentwith those of an employer or admissions officer.Similarly, when testing is mandated by a court,the interests of the test taker may be differentfrom those of the party requesting the court order. Individuals or institutions may serve several

roles in the testing process. For example, in clinicsthe test taker is typically the intended beneficiaryof the test results. In some situations the test ad-ministrator is an agent of the test developer, andsometimes the test administrator is also the testuser. When an organization prepares its own em-ployment tests, it is both the developer and theuser. Sometimes a test is developed by a testauthor but published, marketed, and distributedby an independent publisher, although the publishermay play an active role in the test developmentprocess. Roles may also be further subdivided.For example, both an organization and a professionalassessor may play a role in the provision of an as-sessment center. Given this intermingling of roles,it is often difficult to assign precise responsibilityfor addressing various standards to specific par-ticipants in the testing process. Uses of tests andtesting practices are improved to the extent thatthose involved have adequate levels of assessmentliteracy.Tests are designed, developed, and used in a

wide variety of ways. In some cases, they are de-veloped and “published” for use outside the or-ganization that produces them. In other cases, aswith state educational assessments, they are designedby the state educational agency and developed bycontractors for exclusive and often one-time useby the state and not really “published” at all.Throughout the Standards, we use the generalterm test developer, rather than the more specificterm test publisher, to denote those involved in

3

INTRODUCTION

Page 14: STANDARDS

the design and development of tests across thefull range of test development scenarios.The Standards is based on the premise that ef-

fective testing and assessment require that all pro-fessionals in the testing process possess the knowl-edge, skills, and abilities necessary to fulfill theirroles, as well as an awareness of personal and con-textual factors that may influence the testingprocess. For example, test developers and thoseselecting tests and interpreting test results needadequate knowledge of psychometric principlessuch as validity and reliability. They also shouldobtain any appropriate supervised experience andlegislatively mandated practice credentials thatare required to perform competently those aspectsof the testing process in which they engage. Allprofessionals in the testing process should followthe ethical guidelines of their profession.

Scope of the Revision

This volume serves as a revision of the 1999 Stan-dards for Educational and Psychological Testing.The revision process started with the appointmentof a Management Committee, composed of rep-resentatives of the three sponsoring organizationsresponsible for overseeing the general direction ofthe effort: the American Educational ResearchAssociation (AERA), the American PsychologicalAssociation (APA), and the National Council onMeasurement in Education (NCME). To guidethe revision, the Management Committee solicitedand synthesized comments on the 1999 Standardsfrom members of the sponsoring organizationsand convened the Joint Committee for the Revisionof the 1999 Standards in 2009 to do the actual re-vision. The Joint Committee also was composedof members of the three sponsoring organizationsand was charged by the Management Committeewith addressing five major areas: considering theaccountability issues for use of tests in educationalpolicy; broadening the concept of accessibility oftests for all examinees; representing more com-prehensively the role of tests in the workplace;broadening the role of technology in testing; andproviding for a better organizational structure forcommunicating the standards.

To be responsive to this charge, several actionswere taken:

• The chapters “Educational Testing and As-sessment” and “Testing in Program Evaluationand Public Policy,” in the 1999 version, wererewritten to attend to the issues associatedwith the uses of tests for educational account-ability purposes.

• A new chapter, “Fairness in Testing,” waswritten to emphasize accessibility and fairnessas fundamental issues in testing. Specific con-cerns for fairness are threaded throughout allof the chapters of the Standards.

• The chapter “Testing in Employment andCredentialing” (now “Workplace Testing andCredentialing”) was reorganized to more clearlyidentify when a standard is relevant to em-ployment and/or credentialing.

• The impact of technology was consideredthroughout the volume. One of the majortechnology issues identified was the tensionbetween the use of proprietary algorithms andthe need for test users to be able to evaluatecomplex applications in areas such as automatedscoring of essays, administering and scoringof innovative item types, and computer-basedtesting. These issues are considered in thechapter “Test Design and Development.”

• A content editor was engaged to help with thetechnical accuracy and clarity of each chapterand with consistency of language across chapters.As noted below, chapters in Part I (“Founda-tions”) and Part II (“Operations”) now havean “overarching standard” as well as themesunder which the individual standards are or-ganized. In addition, the glossary from the1999 Standards for Educational and PsychologicalTesting was updated. As stated above, a majorchange in the organization of this volume in-volves the conceptualization of fairness. The1999 edition had a part devoted to this topic,with separate chapters titled “Fairness in Testingand Test Use,” “Testing Individuals of DiverseLinguistic Backgrounds,” and “Testing Indi-

4

INTRODUCTION

Page 15: STANDARDS

viduals With Disabilities.” In the presentedition, the topics addressed in those chaptersare combined into a single, comprehensivechapter, and the chapter is located in Part I.This change was made to emphasize thatfairness demands that all test takers be treatedequitably. Fairness and accessibility, the un-obstructed opportunity for all examinees todemonstrate their standing on the construct(s)being measured, are relevant for valid scoreinterpretations for all individuals and subgroupsin the intended population of test takers. Be-cause issues related to fairness in testing arenot restricted to individuals with diverse lin-guistic backgrounds or those with disabilities,the chapter was more broadly cast to supportappropriate testing experiences for all individ-uals. Although the examples in the chapteroften refer to individuals with diverse linguisticand cultural backgrounds and individuals withdisabilities, they also include examples relevantto gender and to older adults, people of variousethnicities and racial backgrounds, and youngchildren, to illustrate potential barriers to fairand equitable assessment for all examinees.

Organization of the Volume

Part I of the Standards, “Foundations,” containsstandards for validity (chap. 1); reliability/precisionand errors of measurement (chap. 2); and fairnessin testing (chap. 3). Part II, “Operations,” addressestest design and development (chap. 4); scores,scales, norms, score linking, and cut scores (chap.5); test administration, scoring, reporting, and in-terpretation (chap. 6); supporting documentationfor tests (chap. 7); the rights and responsibilitiesof test takers (chap. 8); and the rights and respon-sibilities of test users (chap. 9). Part III, “TestingApplications,” treats specific applications in psy-chological testing and assessment (chap. 10); work-place testing and credentialing (chap. 11); educa-tional testing and assessment (chap. 12); and usesof tests for program evaluation, policy studies,and accountability (chap. 13). Also included is aglossary, which provides definitions for terms asthey are used specifically in this volume.

Each chapter begins with introductory textthat provides background for the standards thatfollow. Although the introductory text is at timesprescriptive, it should not be interpreted asimposing additional standards.

Categories of Standards

The text of each standard and any accompanyingcommentary include the conditions under which astandard is relevant. Depending on the contextand purpose of test development or use, somestandards will be more salient than others. Moreover,some standards are broad in scope, setting forthconcerns or requirements relevant to nearly all testsor testing contexts, and other standards are narrowerin scope. However, all standards are important inthe contexts to which they apply. Any classificationthat gives the appearance of elevating the generalimportance of some standards over others couldinvite neglect of certain standards that need to beaddressed in particular situations. Rather than dif-ferentiate standards using priority labels, such as“primary,” “secondary,” or “conditional” (as wereused in the 1985 Standards), this edition emphasizesthat unless a standard is deemed clearly irrelevant,inappropriate, or technically infeasible for a particularuse, all standards should be met, making all ofthem essentially “primary” for that context.Unless otherwise specified in a standard or

commentary, and with the caveats outlined below,standards should be met before operational testuse. Each standard should be carefully consideredto determine its applicability to the testing contextunder consideration. In a given case there maybe a sound professional reason that adherence tothe standard is inappropriate. There may also beoccasions when technical feasibility influenceswhether a standard can be met prior to operationaltest use. For example, some standards may callfor analyses of data that are not available at thepoint of initial operational test use. In othercases, traditional quantitative analyses may notbe feasible due to small sample sizes. However,there may be other methodologies that could beused to gather information to support the standard,such as small sample methodologies, qualitative

5

INTRODUCTION

Page 16: STANDARDS

studies, focus groups, and even logical analysis.In such instances, test developers and users shouldmake a good faith effort to provide the kinds ofdata called for in the standard to support thevalid interpretations of the test results for theirintended purposes. If test developers, users, and,when applicable, sponsors have deemed a standardto be inapplicable or technically infeasible, theyshould be able, if called upon, to explain thebasis for their decision. However, there is no ex-pectation that documentation of all such decisionsbe routinely available.

Presentation of Individual Standards

Individual standards are presented after an intro-ductory text that presents some key concepts forinterpreting and applying the standards. In manycases, the standards themselves are coupled withone or more comments. These comments are in-tended to amplify, clarify, or provide examples toaid in the interpretation of the meaning of thestandards. The standards often direct a developeror user to implement certain actions. Dependingon the type of test, it is sometimes not clear in thestatement of a standard to whom the standard isdirected. For example, Standard 1.2 in the chapter“Validity” states:

A rationale should be presented foreach intended interpretation of testscores for a given use, together witha summary of the evidence andtheory bearing on the intended in-terpretation.

The party responsible for implementing this stan-dard is the party or person who is articulating therecommended interpretation of the test scores.This may be a test user, a test developer, orsomeone who is planning to use the test scoresfor a particular purpose, such as making classificationor licensure decisions. It often is not possible inthe statement of a standard to specify who is re-sponsible for such actions; it is intended that theparty or person performing the action specifiedin the standard be the party responsible foradhering to the standard.

Some of the individual standards and intro-ductory text refer to groups and subgroups. Theterm group is generally used to identify the fullexaminee population, referred to as the intendedexaminee group, the intended test-taker group, theintended examinee population, or the population.A subgroup includes members of the larger groupwho are identifiable in some way that is relevantto the standard being applied. When data oranalyses are indicated for various subgroups, theyare generally referred to as subgroups within theintended examinee group, groups from the intendedexaminee population, or relevant subgroups.In applying the Standards, it is important to

bear in mind that the intended referent subgroupsfor the individual standards are context specific.For example, referent ethnic subgroups to be con-sidered during the design phase of a test woulddepend on the expected ethnic composition ofthe intended test group. In addition, many moresubgroups could be relevant to a standard dealingwith the design of fair test questions than to astandard dealing with adaptations of a test’s format.Users of the Standards will need to exercise pro-fessional judgment when deciding which particularsubgroups are relevant for the application of aspecific standard.In deciding which subgroups are relevant for

a particular standard, the following factors, amongothers, may be considered: credible evidence thatsuggests a group may face particular construct-irrelevant barriers to test performance, statutes orregulations that designate a group as relevant toscore interpretations, and large numbers of indi-viduals in the group within the general population.Depending on the context, relevant subgroupsmight include, for example, males and females,individuals of differing socioeconomic status, in-dividuals differing by race and/or ethnicity, indi-viduals with different sexual orientations, individualswith diverse linguistic and cultural backgrounds(particularly when testing extends across interna-tional borders), individuals with disabilities, youngchildren, or older adults.Numerous examples are provided in the Stan-

dards to clarify points or to provide illustrationsof how to apply a particular standard. Many of

6

INTRODUCTION

Page 17: STANDARDS

the examples are drawn from research with studentswith disabilities or persons from diverse languageor cultural groups; fewer, from research with otheridentifiable groups, such as young children oradults. There was also a purposeful effort toprovide examples for educational, psychological,and industrial settings. The standards in each chapter in Parts I and

II (“Foundations” and “Operations”) are introducedby an overarching standard, designed to conveythe central intent of the chapter. These overarchingstandards are always numbered with .0 followingthe chapter number. For example, the overarchingstandard in chapter 1 is numbered 1.0. The over-arching standards summarize guiding principlesthat are applicable to all tests and test uses.Further, the themes and standards in each chapterare ordered to be consistent with the sequence ofthe material in the introductory text for thechapter. Because some users of the Standards mayturn only to chapters directly relevant to a givenapplication, certain standards are repeated in dif-ferent chapters, particularly in Part III, “TestingApplications.” When such repetition occurs, theessence of the standard is the same. Only thewording, area of application, or level of elaborationin the comment is changed.

Cautions to Be Considered in Using the Standards

In addition to the legal disclaimer set forth above,several cautions are important if we are to avoidmisinterpretations, misapplications, and misusesof the Standards:

• Evaluating the acceptability of a test or testapplication does not rest on the literal satis-faction of every standard in this document,and the acceptability of a test or test applicationcannot be determined by using a checklist.Specific circumstances affect the importanceof individual standards, and individual standards

should not be considered in isolation. Therefore,evaluating acceptability depends on (a) pro-fessional judgment that is based on a knowledgeof behavioral science, psychometrics, and therelevant standards in the professional field towhich the test applies; (b) the degree to whichthe intent of the standard has been satisfiedby the test developer and user; (c) the alternativemeasurement devices that are readily available;(d) research and experiential evidence regardingthe feasibility of meeting the standard; and(e) applicable laws and regulations.

• When tests are at issue in legal proceedingsand other situations requiring expert witnesstestimony, it is essential that professional judg-ment be based on the accepted corpus ofknowledge in determining the relevance ofparticular standards in a given situation. Theintent of the Standards is to offer guidance forsuch judgments.

• Claims by test developers or test users that atest, manual, or procedure satisfies or followsthe standards in this volume should be madewith care. It is appropriate for developers orusers to state that efforts were made to adhereto the Standards, and to provide documentsdescribing and supporting those efforts. Blanketclaims without supporting evidence shouldnot be made.

• The standards are concerned with a field thatis rapidly evolving. Consequently, there is acontinuing need to monitor changes in thefield and to revise this document as knowledgedevelops. The use of older versions of theStandardsmay be a disservice to test users andtest takers.

• Requiring the use of specific technical methodsis not the intent of the Standards. For example,where specific statistical reporting requirementsare mentioned, the phrase “or generally acceptedequivalent” should always be understood.

7

INTRODUCTION

Page 18: STANDARDS
Page 19: STANDARDS

IPART I

Foundations

Page 20: STANDARDS
Page 21: STANDARDS

Validity refers to the degree to which evidenceand theory support the interpretations of testscores for proposed uses of tests. Validity is,therefore, the most fundamental consideration indeveloping tests and evaluating tests. The processof validation involves accumulating relevantevidence to provide a sound scientific basis forthe proposed score interpretations. It is the inter-pretations of test scores for proposed uses that areevaluated, not the test itself. When test scores areinterpreted in more than one way (e.g., both todescribe a test taker’s current level of the attributebeing measured and to make a prediction about afuture outcome), each intended interpretationmust be validated. Statements about validityshould refer to particular interpretations forspecified uses. It is incorrect to use the unqualifiedphrase “the validity of the test.”

Evidence of the validity of a given interpretationof test scores for a specified use is a necessary con-dition for the justifiable use of the test. Where suf-ficient evidence of validity exists, the decision asto whether to actually administer a particular testgenerally takes additional considerations into ac-count. These include cost-benefit considerations,framed in different subdisciplines as utility analysisor as consideration of negative consequences oftest use, and a weighing of any negative consequencesagainst the positive consequences of test use.

Validation logically begins with an explicitstatement of the proposed interpretation of testscores, along with a rationale for the relevance ofthe interpretation to the proposed use. Theproposed interpretation includes specifying theconstruct the test is intended to measure. Theterm construct is used in the Standards to refer tothe concept or characteristic that a test is designedto measure. Rarely, if ever, is there a single possiblemeaning that can be attached to a test score or apattern of test responses. Thus, it is always in-cumbent on test developers and users to specify

the construct interpretation that will be made onthe basis of the score or response pattern.

Examples of constructs currently used in as-sessment include mathematics achievement, generalcognitive ability, racial identity attitudes, depression,and self-esteem. To support test development,the proposed construct interpretation is elaboratedby describing its scope and extent and by delin-eating the aspects of the construct that are to berepresented. The detailed description provides aconceptual framework for the test, delineatingthe knowledge, skills, abilities, traits, interests,processes, competencies, or characteristics to beassessed. Ideally, the framework indicates howthe construct as represented is to be distinguishedfrom other constructs and how it should relate toother variables.

The conceptual framework is partially shapedby the ways in which test scores will be used. Forinstance, a test of mathematics achievement mightbe used to place a student in an appropriate programof instruction, to endorse a high school diploma,or to inform a college admissions decision. Each ofthese uses implies a somewhat different interpretationof the mathematics achievement test scores: that astudent will benefit from a particular instructionalintervention, that a student has mastered a specifiedcurriculum, or that a student is likely to be successfulwith college-level work. Similarly, a test of consci-entiousness might be used for psychological coun-seling, to inform a decision about employment, orfor the basic scientific purpose of elaborating theconstruct of conscientiousness. Each of thesepotential uses shapes the specified framework andthe proposed interpretation of the test’s scores andalso can have implications for test developmentand evaluation. Validation can be viewed as aprocess of constructing and evaluating argumentsfor and against the intended interpretation of testscores and their relevance to the proposed use. Theconceptual framework points to the kinds of

11

1. VALIDITY

BACKGROUND

Page 22: STANDARDS

evidence that might be collected to evaluate theproposed interpretation in light of the purposes oftesting. As validation proceeds, and new evidenceregarding the interpretations that can and cannotbe drawn from test scores becomes available,revisions may be needed in the test, in the conceptualframework that shapes it, and even in the constructunderlying the test.

The wide variety of tests and circumstancesmakes it natural that some types of evidence willbe especially critical in a given case, whereasother types will be less useful. Decisions aboutwhat types of evidence are important for the val-idation argument in each instance can be clarifiedby developing a set of propositions or claimsthat support the proposed interpretation for theparticular purpose of testing. For instance, whena mathematics achievement test is used to assessreadiness for an advanced course, evidence forthe following propositions might be relevant:(a) that certain skills are prerequisite for the ad-vanced course; (b) that the content domain ofthe test is consistent with these prerequisiteskills; (c) that test scores can be generalizedacross relevant sets of items; (d) that test scoresare not unduly influenced by ancillary variables,such as writing ability; (e) that success in the ad-vanced course can be validly assessed; and (f )that test takers with high scores on the test willbe more successful in the advanced course thantest takers with low scores on the test. Examplesof propositions in other testing contexts mightinclude, for instance, the proposition that testtakers with high general anxiety scores experiencesignificant anxiety in a range of settings, theproposition that a child’s score on an intelligencescale is strongly related to the child’s academicperformance, or the proposition that a certainpattern of scores on a neuropsychological batteryindicates impairment that is characteristic ofbrain injury. The validation process evolves asthese propositions are articulated and evidenceis gathered to evaluate their soundness.

Identifying the propositions implied by a pro-posed test interpretation can be facilitated byconsidering rival hypotheses that may challengethe proposed interpretation. It is also useful to

consider the perspectives of different interestedparties, existing experience with similar tests andcontexts, and the expected consequences of theproposed test use. A finding of unintended con-sequences of test use may also prompt a consider-ation of rival hypotheses. Plausible rival hypothesescan often be generated by considering whether atest measures less or more than its proposed con-struct. Such considerations are referred to asconstruct underrepresentation (or construct deficiency)and construct-irrelevant variance (or construct con-tamination), respectively.

Construct underrepresentation refers to thedegree to which a test fails to capture importantaspects of the construct. It implies a narrowedmeaning of test scores because the test does notadequately sample some types of content, engagesome psychological processes, or elicit some waysof responding that are encompassed by the intendedconstruct. Take, for example, a test intended as acomprehensive measure of anxiety. A particulartest might underrepresent the intended constructbecause it measures only physiological reactionsand not emotional, cognitive, or situational com-ponents. As another example, a test of readingcomprehension intended to measure children’sability to read and interpret stories with under-standing might not contain a sufficient variety ofreading passages or might ignore a common typeof reading material.

Construct-irrelevance refers to the degree towhich test scores are affected by processes that areextraneous to the test’s intended purpose. Thetest scores may be systematically influenced tosome extent by processes that are not part of theconstruct. In the case of a reading comprehensiontest, these might include material too far above orbelow the level intended to be tested, an emotionalreaction to the test content, familiarity with thesubject matter of the reading passages on the test,or the writing skill needed to compose a response.Depending on the detailed definition of the con-struct, vocabulary knowledge or reading speedmight also be irrelevant components. On a testdesigned to measure anxiety, a response bias tounderreport one’s anxiety might be considered asource of construct-irrelevant variance. In the case

12

CHAPTER 1

Page 23: STANDARDS

of a mathematics test, it might include overrelianceon reading comprehension skills that English lan-guage learners may be lacking. On a test designedto measure science knowledge, test-taker inter-nalizing of gender-based stereotypes about womenin the sciences might be a source of construct-ir-relevant variance.

Nearly all tests leave out elements that somepotential users believe should be measured andinclude some elements that some potential usersconsider inappropriate. Validation involves carefulattention to possible distortions in meaningarising from inadequate representation of theconstruct and also to aspects of measurement,such as test format, administration conditions,or language level, that may materially limit orqualify the interpretation of test scores for variousgroups of test takers. That is, the process of vali-dation may lead to revisions in the test, in theconceptual framework of the test, or both. Inter-pretations drawn from the revised test wouldagain need validation.

When propositions have been identified thatwould support the proposed interpretation of testscores, one can proceed with validation by obtainingempirical evidence, examining relevant literature,and/or conducting logical analyses to evaluateeach of the propositions. Empirical evidence mayinclude both local evidence, produced within thecontexts where the test will be used, and evidencefrom similar testing applications in other settings.Use of existing evidence from similar tests andcontexts can enhance the quality of the validityargument, especially when data for the test andcontext in question are limited.

Because an interpretation for a given use typ-ically depends on more than one proposition,strong evidence in support of one part of the in-terpretation in no way diminishes the need forevidence to support other parts of the interpretation.For example, when an employment test is beingconsidered for selection, a strong predictor-criterionrelationship in an employment setting is ordinarilynot sufficient to justify use of the test. One shouldalso consider the appropriateness and meaning-fulness of the criterion measure, the appropriateness

of the testing materials and procedures for thefull range of applicants, and the consistency ofthe support for the proposed interpretation acrossgroups. Professional judgment guides decisionsregarding the specific forms of evidence that canbest support the intended interpretation for aspecified use. As in all scientific endeavors, thequality of the evidence is paramount. A few piecesof solid evidence regarding a particular propositionare better than numerous pieces of evidence ofquestionable quality. The determination that agiven test interpretation for a specific purpose iswarranted is based on professional judgment thatthe preponderance of the available evidencesupports that interpretation. The quality andquantity of evidence sufficient to reach this judg-ment may differ for test uses depending on thestakes involved in the testing. A given interpretationmay not be warranted either as a result of insufficientevidence in support of it or as a result of credibleevidence against it.

Validation is the joint responsibility of thetest developer and the test user. The test developeris responsible for furnishing relevant evidence anda rationale in support of any test score interpretationsfor specified uses intended by the developer. Thetest user is ultimately responsible for evaluatingthe evidence in the particular setting in which thetest is to be used. When a test user proposes aninterpretation or use of test scores that differsfrom those supported by the test developer, theresponsibility for providing validity evidence insupport of that interpretation for the specifieduse is the responsibility of the user. It should benoted that important contributions to the validityevidence may be made as other researchers reportfindings of investigations that are related to themeaning of scores on the test.

Sources of Validity Evidence

The following sections outline various sourcesof evidence that might be used in evaluating thevalidity of a proposed interpretation of testscores for a particular use. These sources of evi-dence may illuminate different aspects of validity,

13

VALIDITY

Page 24: STANDARDS

but they do not represent distinct types ofvalidity. Validity is a unitary concept. It is thedegree to which all the accumulated evidencesupports the intended interpretation of testscores for the proposed use. Like the 1999 Stan-dards, this edition refers to types of validity evi-dence, rather than distinct types of validity. Toemphasize this distinction, the treatment thatfollows does not follow historical nomenclature(i.e., the use of the terms content validity or pre-dictive validity).

As the discussion in the prior section emphasizes,each type of evidence presented below is notrequired in all settings. Rather, support is neededfor each proposition that underlies a proposedtest interpretation for a specified use. A propositionthat a test is predictive of a given criterion can besupported without evidence that the test samplesa particular content domain. In contrast, a propo-sition that a test covers a representative sample ofa particular curriculum may be supported withoutevidence that the test predicts a given criterion.However, a more complex set of propositions,e.g., that a test samples a specified domain andthus is predictive of a criterion reflecting a relateddomain, will require evidence supporting bothparts of this set of propositions. Tests developersare also expected to make the case that the scoresare not unduly influenced by construct-irrelevantvariance (see chap. 3 for detailed treatment ofissues related to construct-irrelevant variance). Ingeneral, adequate support for proposed interpre-tations for specific uses will require multiplesources of evidence.

The position developed above also underscoresthe fact that if a given test is interpreted inmultiple ways for multiple uses, the propositionsunderlying these interpretations for different usesalso are likely to differ. Support is needed for thepropositions underlying each interpretation for aspecific use. Evidence supporting the interpretationof scores on a mathematics achievement test forplacing students in subsequent courses (i.e.,evidence that the test interpretation is valid for itsintended purpose) does not permit inferringvalidity for other purposes (e.g., promotion orteacher evaluation).

Evidence Based on Test Content

Important validity evidence can be obtained froman analysis of the relationship between the contentof a test and the construct it is intended tomeasure. Test content refers to the themes, wording,and format of the items, tasks, or questions on atest. Administration and scoring may also berelevant to content-based evidence. Test developersoften work from a specification of the contentdomain. The content specification carefully describesthe content in detail, often with a classification ofareas of content and types of items. Evidencebased on test content can include logical orempirical analyses of the adequacy with whichthe test content represents the content domainand of the relevance of the content domain to theproposed interpretation of test scores. Evidencebased on content can also come from expert judg-ments of the relationship between parts of thetest and the construct. For example, in developinga licensure test, the major facets that are relevantto the purpose for which the occupation is regulatedcan be specified, and experts in that occupationcan be asked to assign test items to the categoriesdefined by those facets. These or other expertscan then judge the representativeness of the chosenset of items.

Some tests are based on systematic observationsof behavior. For example, a list of the tasks con-stituting a job domain may be developed fromobservations of behavior in a job, together withjudgments of subject matter experts. Expert judg-ments can be used to assess the relative importance,criticality, and/or frequency of the various tasks.A job sample test can then be constructed from arandom or stratified sampling of tasks rated highlyon these characteristics. The test can then be ad-ministered under standardized conditions in anoff-the-job setting.

The appropriateness of a given content domainis related to the specific inferences to be madefrom test scores. Thus, when considering anavailable test for a purpose other than that forwhich it was first developed, it is especiallyimportant to evaluate the appropriateness of theoriginal content domain for the proposed new

14

CHAPTER 1

Page 25: STANDARDS

purpose. For example, a test given for researchpurposes to compare student achievement acrossstates in a given domain may properly also covermaterial that receives little or no attention in thecurriculum. Policy makers can then evaluatestudent achievement with respect to both contentneglected and content addressed. On the otherhand, when student mastery of a delivered cur-riculum is tested for purposes of informingdecisions about individual students, such as pro-motion or graduation, the framework elaboratinga content domain is appropriately limited to whatstudents have had an opportunity to learn fromthe curriculum as delivered.

Evidence about content can be used, in part, toaddress questions about differences in the meaningor interpretation of test scores across relevant sub-groups of test takers. Of particular concern is theextent to which construct underrepresentation orconstruct-irrelevance may give an unfair advantageor disadvantage to one or more subgroups of testtakers. For example, in an employment test, theuse of vocabulary more complex than needed onthe job may be a source of construct-irrelevantvariance for English language learners or others.Careful review of the construct and test contentdomain by a diverse panel of experts may point topotential sources of irrelevant difficulty (or easiness)that require further investigation.

Content-oriented evidence of validation is atthe heart of the process in the educational arenaknown as alignment, which involves evaluating thecorrespondence between student learning standardsand test content. Content-sampling issues in thealignment process include evaluating whether testcontent appropriately samples the domain set forwardin curriculum standards, whether the cognitive de-mands of test items correspond to the level reflectedin the student learning standards (e.g., contentstandards), and whether the test avoids the inclusionof features irrelevant to the standard that is the in-tended target of each test item.

Evidence Based on Response Processes

Some construct interpretations involve more orless explicit assumptions about the cognitiveprocesses engaged in by test takers. Theoretical

and empirical analyses of the response processesof test takers can provide evidence concerning thefit between the construct and the detailed natureof the performance or response actually engagedin by test takers. For instance, if a test is intendedto assess mathematical reasoning, it becomes im-portant to determine whether test takers are, infact, reasoning about the material given insteadof following a standard algorithm applicable onlyto the specific items on the test.

Evidence based on response processes generallycomes from analyses of individual responses.Questioning test takers from various groupsmaking up the intended test-taking populationabout their performance strategies or responsesto particular items can yield evidence that enrichesthe definition of a construct. Maintaining recordsthat monitor the development of a response to awriting task, through successive written draftsor electronically monitored revisions, for instance,also provides evidence of process. Documentationof other aspects of performance, like eye move-ments or response times, may also be relevant tosome constructs. Inferences about processes in-volved in performance can also be developed byanalyzing the relationship among parts of thetest and between the test and other variables.Wide individual differences in process can be re-vealing and may lead to reconsideration of certaintest formats.

Evidence of response processes can contributeto answering questions about differences in meaningor interpretation of test scores across relevant sub-groups of test takers. Process studies involvingtest takers from different subgroups can assist indetermining the extent to which capabilities irrel-evant or ancillary to the construct may be differ-entially influencing test takers’ test performance.

Studies of response processes are not limitedto the test taker. Assessments often rely on observersor judges to record and/or evaluate test takers’performances or products. In such cases, relevantvalidity evidence includes the extent to which theprocesses of observers or judges are consistentwith the intended interpretation of scores. For in-stance, if judges are expected to apply particularcriteria in scoring test takers’ performances, it is

15

VALIDITY

Page 26: STANDARDS

important to ascertain whether they are, in fact,applying the appropriate criteria and not beinginfluenced by factors that are irrelevant to the in-tended interpretation (e.g., quality of handwritingis irrelevant to judging the content of an writtenessay). Thus, validation may include empiricalstudies of how observers or judges record andevaluate data along with analyses of the appropri-ateness of these processes to the intended inter-pretation or construct definition.

While evidence about response processes maybe central in settings where explicit claims aboutresponse processes are made by test developers orwhere inferences about responses are made by testusers, there are many other cases where claimsabout response processes are not part of thevalidity argument. In some cases, multiple responseprocesses are available for solving the problems ofinterest, and the construct of interest is only con-cerned with whether the problem was solved cor-rectly. As a simple example, there may be multiplepossible routes to obtaining the correct solutionto a mathematical problem.

Evidence Based on Internal Structure

Analyses of the internal structure of a test canindicate the degree to which the relationshipsamong test items and test components conform tothe construct on which the proposed test score in-terpretations are based. The conceptual frameworkfor a test may imply a single dimension of behavior,or it may posit several components that are eachexpected to be homogeneous, but that are alsodistinct from each other. For example, a measureof discomfort on a health survey might assess bothphysical and emotional health. The extent to whichitem interrelationships bear out the presumptionsof the framework would be relevant to validity.

The specific types of analyses and their inter-pretation depend on how the test will be used.For example, if a particular application posited aseries of increasingly difficult test components,empirical evidence of the extent to which responsepatterns conformed to this expectation would beprovided. A theory that posited unidimensionalitywould call for evidence of item homogeneity. Inthis case, the number of items and item interrela-

tionships form the basis for an estimate of scorereliability, but such an index would be inappropriatefor tests with a more complex internal structure.

Some studies of the internal structure of testsare designed to show whether particular itemsmay function differently for identifiable subgroupsof test takers (e.g., racial/ethnic or gender sub-groups.) Differential item functioning occurs whendifferent groups of test takers with similar overallability, or similar status on an appropriate criterion,have, on average, systematically different responsesto a particular item. This issue is discussed inchapter 3. However, differential item functioningis not always a flaw or weakness. Subsets of itemsthat have a specific characteristic in common(e.g., specific content, task representation) mayfunction differently for different groups of similarlyscoring test takers. This indicates a kind of multi-dimensionality that may be unexpected or mayconform to the test framework.

Evidence Based on Relations to Other Variables

In many cases, the intended interpretation for agiven use implies that the construct should berelated to some other variables, and, as a result,analyses of the relationship of test scores tovariables external to the test provide another im-portant source of validity evidence. Externalvariables may include measures of some criteriathat the test is expected to predict, as well as rela-tionships to other tests hypothesized to measurethe same constructs, and tests measuring relatedor different constructs. Measures other than testscores, such as performance criteria, are oftenused in employment settings. Categorical variables,including group membership variables, becomerelevant when the theory underlying a proposedtest use suggests that group differences should bepresent or absent if a proposed test score interpre-tation is to be supported. Evidence based on rela-tionships with other variables provides evidenceabout the degree to which these relationships areconsistent with the construct underlying the pro-posed test score interpretations.

Convergent and discriminant evidence. Rela-tionships between test scores and other measures

16

CHAPTER 1

Page 27: STANDARDS

intended to assess the same or similar constructsprovide convergent evidence, whereas relationshipsbetween test scores and measures purportedly ofdifferent constructs provide discriminant evidence.For instance, within some theoretical frameworks,scores on a multiple-choice test of reading com-prehension might be expected to relate closely(convergent evidence) to other measures of readingcomprehension based on other methods, such asessay responses. Conversely, test scores might beexpected to relate less closely (discriminant evidence)to measures of other skills, such as logical reasoning.Relationships among different methods of meas-uring the construct can be especially helpful insharpening and elaborating score meaning andinterpretation.

Evidence of relations with other variables caninvolve experimental as well as correlational evi-dence. Studies might be designed, for instance, toinvestigate whether scores on a measure of anxietyimprove as a result of some psychological treatmentor whether scores on a test of academic achievementdifferentiate between instructed and noninstructedgroups. If performance increases due to short-term coaching are viewed as a threat to validity, itwould be useful to investigate whether coachedand uncoached groups perform differently.

Test-criterion relationships. Evidence of therelation of test scores to a relevant criterion maybe expressed in various ways, but the fundamentalquestion is always, how accurately do test scorespredict criterion performance? The degree of ac-curacy and the score range within which accuracyis needed depends on the purpose for which thetest is used.

The criterion variable is a measure of some at-tribute or outcome that is operationally distinctfrom the test. Thus, the test is not a measure of acriterion, but rather is a measure hypothesized asa potential predictor of that targeted criterion.Whether a test predicts a given criterion in agiven context is a testable hypothesis. The criteriathat are of interest are determined by test users,for example administrators in a school system ormanagers of a firm. The choice of the criterionand the measurement procedures used to obtain

criterion scores are of central importance. Thecredibility of a test-criterion study depends onthe relevance, reliability, and validity of the inter-pretation based on the criterion measure for agiven testing application.

Historically, two designs, often called predictiveand concurrent, have been distinguished for eval-uating test-criterion relationships. A predictivestudy indicates the strength of the relationshipbetween test scores and criterion scores that areobtained at a later time. A concurrent studyobtains test scores and criterion information atabout the same time. When prediction is actuallycontemplated, as in academic admission or em-ployment settings, or in planning rehabilitationregimens, predictive studies can retain the temporaldifferences and other characteristics of the practicalsituation. Concurrent evidence, which avoids tem-poral changes, is particularly useful for psychodi-agnostic tests or in investigating alternative measuresof some specified construct for which an acceptedmeasurement procedure already exists. The choiceof a predictive or concurrent research strategy ina given domain is also usefully informed by priorresearch evidence regarding the extent to whichpredictive and concurrent studies in that domainyield the same or different results.

Test scores are sometimes used in allocatingindividuals to different treatments in a way that isadvantageous for the institution and/or for theindividuals. Examples would include assigningindividuals to different jobs within an organization,or determining whether to place a given studentin a remedial class or a regular class. In thatcontext, evidence is needed to judge the suitabilityof using a test when classifying or assigning aperson to one job versus another or to onetreatment versus another. Support for the validityof the classification procedure is provided byshowing that the test is useful in determiningwhich persons are likely to profit differentiallyfrom one treatment or another. It is possible fortests to be highly predictive of performance fordifferent education programs or jobs without pro-viding the information necessary to make a com-parative judgment of the efficacy of assignmentsor treatments. In general, decision rules for selection

17

VALIDITY

Page 28: STANDARDS

or placement are also influenced by the numberof persons to be accepted or the numbers that canbe accommodated in alternative placement cate-gories (see chap. 11).

Evidence about relations to other variables isalso used to investigate questions of differentialprediction for subgroups. For instance, a findingthat the relation of test scores to a relevant criterionvariable differs from one subgroup to anothermay imply that the meaning of the scores is notthe same for members of the different groups,perhaps due to construct underrepresentation orconstruct-irrelevant sources of variance. However,the difference may also imply that the criterionhas different meaning for different groups. Thedifferences in test-criterion relationships can alsoarise from measurement error, especially whengroup means differ, so such differences do notnecessarily indicate differences in score meaning.See the discussion of fairness in chapter 3 formore extended consideration of possible coursesof action when scores have different meanings fordifferent groups.

Validity generalization. An important issue ineducational and employment settings is the degreeto which validity evidence based on test-criterionrelations can be generalized to a new situationwithout further study of validity in that new situ-ation. When a test is used to predict the same orsimilar criteria (e.g., performance of a given job)at different times or in different places, it istypically found that observed test-criterion corre-lations vary substantially. In the past, this hasbeen taken to imply that local validation studiesare always required. More recently, a variety ofapproaches to generalizing evidence from othersettings has been developed, with meta-analysisthe most widely used in the published literature.In particular, meta-analyses have shown that insome domains, much of this variability may bedue to statistical artifacts such as sampling fluctu-ations and variations across validation studies inthe ranges of test scores and in the reliability ofcriterion measures. When these and other influencesare taken into account, it may be found that theremaining variability in validity coefficients is rel-

atively small. Thus, statistical summaries of pastvalidation studies in similar situations may beuseful in estimating test-criterion relationships ina new situation. This practice is referred to as thestudy of validity generalization.

In some circumstances, there is a strongbasis for using validity generalization. Thiswould be the case where the meta-analytic data-base is large, where the meta-analytic data ade-quately represent the type of situation to whichone wishes to generalize, and where correctionfor statistical artifacts produces a clear and con-sistent pattern of validity evidence. In such cir-cumstances, the informational value of a localvalidity study may be relatively limited if notactually misleading, especially if its sample sizeis small. In other circumstances, the inferentialleap required for generalization may be muchlarger. The meta-analytic database may be small,the findings may be less consistent, or the newsituation may involve features markedly differentfrom those represented in the meta-analyticdatabase. In such circumstances, situation-specificvalidity evidence will be relatively more inform-ative. Although research on validity generalizationshows that results of a single local validationstudy may be quite imprecise, there are situationswhere a single study, carefully done, with adequatesample size, provides sufficient evidence tosupport or reject test use in a new situation.This highlights the importance of examiningcarefully the comparative informational valueof local versus meta-analytic studies.

In conducting studies of the generalizabilityof validity evidence, the prior studies that are in-cluded may vary according to several situationalfacets. Some of the major facets are (a) differencesin the way the predictor construct is measured,(b) the type of job or curriculum involved, (c) thetype of criterion measure used, (d) the type of testtakers, and (e) the time period in which the studywas conducted. In any particular study of validitygeneralization, any number of these facets mightvary, and a major objective of the study is to de-termine empirically the extent to which variationin these facets affects the test-criterion correlationsobtained.

18

CHAPTER 1

Page 29: STANDARDS

The extent to which predictive or concurrentvalidity evidence can be generalized to newsituations is in large measure a function of accu-mulated research. Although evidence of general-ization can often help to support a claim ofvalidity in a new situation, the extent of availabledata limits the degree to which the claim can besustained.

The above discussion focuses on the use ofcumulative databases to estimate predictor-criterionrelationships. Meta-analytic techniques can alsobe used to summarize other forms of data relevantto other inferences one may wish to draw fromtest scores in a particular application, such aseffects of coaching and effects of certain alterationsin testing conditions for test takers with specifieddisabilities. Gathering evidence about how wellvalidity findings can be generalized across groupsof test takers is an important part of the validationprocess. When the evidence suggests that inferencesfrom test scores can be drawn for some subgroupsbut not for others, pursuing options such as thosediscussed in chapter 3 can reduce the risk ofunfair test use.

Evidence for Validity and Consequences of Testing

Some consequences of test use follow directlyfrom the interpretation of test scores for uses in-tended by the test developer. The validationprocess involves gathering evidence to evaluatethe soundness of these proposed interpretationsfor their intended uses.

Other consequences may also be part of aclaim that extends beyond the interpretation oruse of scores intended by the test developer. Forexample, a test of student achievement mightprovide data for a system intended to identifyand improve lower-performing schools. The claimthat testing results, used this way, will result inimproved student learning may rest on propositionsabout the system or intervention itself, beyondpropositions based on the meaning of the testitself. Consequences may point to the need forevidence about components of the system thatwill go beyond the interpretation of test scores asa valid measure of student achievement.

Still other consequences are unintended, andare often negative. For example, school district orstatewide educational testing on selected subjectsmay lead teachers to focus on those subjects atthe expense of others. As another example, a testdeveloped to measure knowledge needed for agiven job may result in lower passing rates for onegroup than for another. Unintended consequencesmerit close examination. While not all consequencescan be anticipated, in some cases factors such asprior experiences in other settings offer a basis foranticipating and proactively addressing unintendedconsequences. See chapter 12 for additional ex-amples from educational settings. In some cases,actions to address one consequence bring aboutother consequences. One example involves thenotion of “missed opportunities,” as in the case ofmoving to computerized scoring of student essaysto increase grading consistency, thus forgoing theeducational benefits of addressing the same problemby training teachers to grade more consistently.

These types of consideration of consequencesof testing are discussed further below.

Interpretation and uses of test scores intended bytest developers. Tests are commonly administeredin the expectation that some benefit will be realizedfrom the interpretation and use of the scores intendedby the test developers. A few of the many possiblebenefits that might be claimed are selection of effi-cacious therapies, placement of workers in suitablejobs, prevention of unqualified individuals fromentering a profession, or improvement of classroominstructional practices. A fundamental purpose ofvalidation is to indicate whether these specificbenefits are likely to be realized. Thus, in the case ofa test used in placement decisions, the validationwould be informed by evidence that alternativeplacements, in fact, are differentially beneficial tothe persons and the institution. In the case of em-ployment testing, if a test publisher asserts that useof the test will result in reduced employee trainingcosts, improved workforce efficiency, or some otherbenefit, then the validation would be informed byevidence in support of that proposition.

It is important to note that the validity of testscore interpretations depends not only on the uses

19

VALIDITY

Page 30: STANDARDS

of the test scores but specifically on the claims thatunderlie the theory of action for these uses. Forexample, consider a school district that wants todetermine children’s readiness for kindergarten,and so administers a test battery and screens outstudents with low scores. If higher scores do, infact, predict higher performance on key kindergartentasks, the claim that use of the test scores forscreening results in higher performance on thesekey tasks is supported and the interpretation ofthe test scores as a predictor of kindergartenreadiness would be valid. If, however, the claimwere made that use of the test scores for screeningwould result in the greatest benefit to students,the interpretation of test scores as indicators ofreadiness for kindergarten might not be validbecause students with low scores might actuallybenefit more from access to kindergarten. In thiscase, different evidence is needed to supportdifferent claims that might be made about thesame use of the screening test (for example, evidencethat students below a certain cut score benefitmore from another assignment than from assignmentto kindergarten). The test developer is responsiblefor the validation of the interpretation that thetest scores assess the indicated readiness skills. Theschool district is responsible for the validation ofthe proper interpretation of the readiness testscores and for evaluation of the policy of using thereadiness test for placement/admissions decisions.

Claims made about test use that are not directlybased on test score interpretations. Claims aresometimes made for benefits of testing that gobeyond the direct interpretations or uses of thetest scores themselves that are specified by the testdevelopers. Educational tests, for example, maybe advocated on the grounds that their use willimprove student motivation to learn or encouragechanges in classroom instructional practices byholding educators accountable for valued learningoutcomes. Where such claims are central to therationale advanced for testing, the direct exami-nation of testing consequences necessarily assumeseven greater importance. Those making the claimsare responsible for evaluation of the claims. Insome cases, such information can be drawn from

existing data collected for purposes other thantest validation; in other cases new informationwill be needed to address the impact of the testingprogram.

Consequences that are unintended. Test scoreinterpretation for a given use may result in unin-tended consequences. A key distinction is betweenconsequences that result from a source of error inthe intended test score interpretation for a givenuse and consequences that do not result fromerror in test score interpretation. Examples ofeach are given below.

As discussed at some length in chapter 3, onedomain in which unintended negative consequencesof test use are at times observed involves test scoredifferences for groups defined in terms of race/eth-nicity, gender, age, and other characteristics. Insuch cases, however, it is important to distinguishbetween evidence that is directly relevant to validityand evidence that may inform decisions aboutsocial policy but falls outside the realm of validity.For example, concerns have been raised about theeffect of group differences in test scores on em-ployment selection and promotion, the placementof children in special education classes, and thenarrowing of a school’s curriculum to excludelearning objectives that are not assessed. Althoughinformation about the consequences of testingmay influence decisions about test use, such con-sequences do not, in and of themselves, detractfrom the validity of intended interpretations ofthe test scores. Rather, judgments of validity orinvalidity in the light of testing consequencesdepend on a more searching inquiry into thesources of those consequences.

Take, as an example, a finding of differenthiring rates for members of different groups as aconsequence of using an employment test. If thedifference is due solely to an unequal distributionof the skills the test purports to measure, and ifthose skills are, in fact, important contributors tojob performance, then the finding of group dif-ferences per se does not imply any lack of validityfor the intended interpretation. If, however, thetest measured skill differences unrelated to jobperformance (e.g., a sophisticated reading test for

20

CHAPTER 1

Page 31: STANDARDS

a job that required only minimal functionalliteracy), or if the differences were due to thetest’s sensitivity to some test-taker characteristicnot intended to be part of the test construct, thenthe intended interpretation of test scores as pre-dicting job performance in a comparable mannerfor all groups of applicants would be rendered in-valid, even if test scores correlated positively withsome measure of job performance. If a test coversmost of the relevant content domain but omitssome areas, the content coverage might be judgedadequate for some purposes. However, if it isfound that excluding some components that couldreadily be assessed has a noticeable impact on se-lection rates for groups of interest (e.g., subgroupdifferences are found to be smaller on excludedcomponents than on included components), theintended interpretation of test scores as predictingjob performance in a comparable manner for allgroups of applicants would be rendered invalid.Thus, evidence about consequences is relevant tovalidity when it can be traced to a source ofinvalidity such as construct underrepresentationor construct-irrelevant components. Evidenceabout consequences that cannot be so traced isnot relevant to the validity of the intended inter-pretations of the test scores.

As another example, consider the case whereresearch supports an employer’s use of a particulartest in the personality domain (i.e., the test provesto be predictive of an aspect of subsequent jobperformance), but it is found that some applicantsform a negative opinion of the organization dueto the perception that the test invades personalprivacy. Thus, there is an unintended negativeconsequence of test use, but one that is not dueto a flaw in the intended interpretation of testscores as predicting subsequent performance. Someemployers faced with this situation may concludethat this negative consequence is grounds for dis-continuing test use; others may conclude that thebenefits gained by screening applicants outweighthis negative consequence. As this example illus-trates, a consideration of consequences can influencea decision about test use, even though the conse-quence is independent of the validity of theintended test score interpretation. The example

also illustrates that different decision makers maymake different value judgments about the impactof consequences on test use.

The fact that the validity evidence supportsthe intended interpretation of test scores for usein applicant screening does not mean that test useis thus required: Issues other than validity, includinglegal constraints, can play an important and, insome cases, a determinative role in decisions abouttest use. Legal constraints may also limit an em-ployer’s discretion to discard test scores from teststhat have already been administered, when thatdecision is based on differences in scores for sub-groups of different races, ethnicities, or genders.

Note that unintended consequences can alsobe positive. Reversing the above example of testtakers who form a negative impression of an or-ganization based on the use of a particular test, adifferent test may be viewed favorably by applicants,leading to a positive impression of the organization.A given test use may result in multiple consequences,some positive and some negative.

In short, decisions about test use are appro-priately informed by validity evidence about in-tended test score interpretations for a given use,by evidence evaluating additional claims aboutconsequences of test use that do not follow directlyfrom test score interpretations, and by value judg-ments about unintended positive and negativeconsequences of test use.

Integrating the Validity Evidence

A sound validity argument integrates variousstrands of evidence into a coherent account of thedegree to which existing evidence and theory sup-port the intended interpretation of test scores forspecific uses. It encompasses evidence gatheredfrom new studies and evidence available fromearlier reported research. The validity argumentmay indicate the need for refining the definitionof the construct, may suggest revisions in the testor other aspects of the testing process, and mayindicate areas needing further study.

It is commonly observed that the validationprocess never ends, as there is always additionalinformation that can be gathered to more fully

21

VALIDITY

Page 32: STANDARDS

understand a test and the inferences that can bedrawn from it. In this way an inference of validityis similar to any scientific inference. However, atest interpretation for a given use rests on evidencefor a set of propositions making up the validityargument, and at some point validation evidenceallows for a summary judgment of the intendedinterpretation that is well supported and defensible.At some point the effort to provide sufficientvalidity evidence to support a given test interpre-tation for a specific use does end (at least provi-sionally, pending the emergence of a strong basisfor questioning that judgment). Legal requirementsmay necessitate that the validation study beupdated in light of such factors as changes in thetest population or newly developed alternativetesting methods.

The amount and character of evidence requiredto support a provisional judgment of validityoften vary between areas and also within an area

as research on a topic advances. For example, pre-vailing standards of evidence may vary with thestakes involved in the use or interpretation of thetest scores. Higher stakes may entail higherstandards of evidence. As another example, inareas where data collection comes at a greatercost, one may find it necessary to base interpretationson fewer data than in areas where data collectioncomes with less cost.

Ultimately, the validity of an intended inter-pretation of test scores relies on all the availableevidence relevant to the technical quality of atesting system. Different components of validityevidence are described in subsequent chapters ofthe Standards, and include evidence of careful testconstruction; adequate score reliability; appropriatetest administration and scoring; accurate scorescaling, equating, and standard setting; and carefulattention to fairness for all test takers, as appropriateto the test interpretation in question.

22

CHAPTER 1

Page 33: STANDARDS

23

VALIDITY

The standards in this chapter begin with an over-arching standard (numbered 1.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intothree thematic clusters labeled as follows:

1. Establishing Intended Uses and Interpreta-tions

2. Issues Regarding Samples and Settings Usedin Validation

3. Specific Forms of Validity Evidence

Standard 1.0

Clear articulation of each intended test score in-terpretation for a specified use should be set forth,and appropriate validity evidence in support ofeach intended interpretation should be provided.

Cluster 1. Establishing Intended Uses and Interpretations

Standard 1.1

The test developer should set forth clearly howtest scores are intended to be interpreted andconsequently used. The population(s) for whicha test is intended should be delimited clearly,and the construct or constructs that the test isintended to assess should be described clearly.

Comment: Statements about validity should referto particular interpretations and consequent uses.It is incorrect to use the unqualified phrase “thevalidity of the test.” No test permits interpretationsthat are valid for all purposes or in all situations.Each recommended interpretation for a given userequires validation. The test developer shouldspecify in clear language the population for whichthe test is intended, the construct it is intended tomeasure, the contexts in which test scores are to

be employed, and the processes by which the testis to be administered and scored.

Standard 1.2

A rationale should be presented for each intendedinterpretation of test scores for a given use,together with a summary of the evidence andtheory bearing on the intended interpretation.

Comment: The rationale should indicate whatpropositions are necessary to investigate theintended interpretation. The summary shouldcombine logical analysis with empirical evidenceto provide support for the test rationale. Evidencemay come from studies conducted locally, in thesetting where the test is to be used; from specificprior studies; or from comprehensive statisticalsyntheses of available studies meeting clearly spec-ified study quality criteria. No type of evidence isinherently preferable to others; rather, the qualityand relevance of the evidence to the intended testscore interpretation for a given use determine thevalue of a particular kind of evidence. A presentationof empirical evidence on any point should givedue weight to all relevant findings in the scientificliterature, including those inconsistent with theintended interpretation or use. Test developershave the responsibility to provide support fortheir own recommendations, but test users bearultimate responsibility for evaluating the qualityof the validity evidence provided and its relevanceto the local situation.

Standard 1.3

If validity for some common or likely interpretationfor a given use has not been evaluated, or if suchan interpretation is inconsistent with availableevidence, that fact should be made clear and po-tential users should be strongly cautioned aboutmaking unsupported interpretations.

Comment: If past experience suggests that a testis likely to be used inappropriately for certain

STANDARDS FOR VALIDITY

Page 34: STANDARDS

kinds of decisions or certain kinds of test takers,specific warnings against such uses should begiven. Professional judgment is required to evaluatethe extent to which existing validity evidence sup-ports a given test use.

Standard 1.4

If a test score is interpreted for a given use in away that has not been validated, it is incumbenton the user to justify the new interpretation forthat use, providing a rationale and collectingnew evidence, if necessary.

Comment: Professional judgment is required toevaluate the extent to which existing validity evi-dence applies in the new situation or to the newgroup of test takers and to determine what newevidence may be needed. The amount and kindsof new evidence required may be influenced byexperience with similar prior test uses or interpre-tations and by the amount, quality, and relevanceof existing data.

A test that has been altered or administered inways that change the construct underlying thetest for use with subgroups of the population re-quires evidence of the validity of the interpretationmade on the basis of the modified test (see chap.3). For example, if a test is adapted for use withindividuals with a particular disability in a waythat changes the underlying construct, the modifiedtest should have its own evidence of validity forthe intended interpretation.

Standard 1.5

When it is clearly stated or implied that a rec-ommended test score interpretation for a givenuse will result in a specific outcome, the basisfor expecting that outcome should be presented,together with relevant evidence.

Comment: If it is asserted, for example, that in-terpreting and using scores on a given test for em-ployee selection will result in reduced employeeerrors or training costs, evidence in support ofthat assertion should be provided. A given claimmay be supported by logical or theoretical argument

as well as empirical data. Appropriate weightshould be given to findings in the scientificliterature that may be inconsistent with the statedexpectation.

Standard 1.6

When a test use is recommended on the groundsthat testing or the testing program itself willresult in some indirect benefit, in addition tothe utility of information from interpretation ofthe test scores themselves, the recommendershould make explicit the rationale for anticipatingthe indirect benefit. Logical or theoretical argu-ments and empirical evidence for the indirectbenefit should be provided. Appropriate weightshould be given to any contradictory findings inthe scientific literature, including findings sug-gesting important indirect outcomes other thanthose predicted.

Comment: For example, certain educational testingprograms have been advocated on the groundsthat they would have a salutary influence on class-room instructional practices or would clarify stu-dents’ understanding of the kind or level ofachievement they were expected to attain. To theextent that such claims enter into the justificationfor a testing program, they become part of the ar-gument for test use. Evidence for such claimsshould be examined—in conjunction with evidenceabout the validity of intended test score interpre-tation and evidence about unintended negativeconsequences of test use—in making an overalldecision about test use. Due weight should begiven to evidence against such predictions, for ex-ample, evidence that under some conditions edu-cational testing may have a negative effect onclassroom instruction.

Standard 1.7

If test performance, or a decision made therefrom,is claimed to be essentially unaffected by practiceand coaching, then the propensity for test per-formance to change with these forms of instructionshould be documented.

24

CHAPTER 1

Page 35: STANDARDS

Comment:Materials to aid in score interpretationshould summarize evidence indicating the degreeto which improvement with practice or coachingcan be expected. Also, materials written for testtakers should provide practical guidance aboutthe value of test preparation activities, includingcoaching.

Cluster 2. Issues Regarding Samplesand Settings Used in Validation

Standard 1.8

The composition of any sample of test takersfrom which validity evidence is obtained shouldbe described in as much detail as is practical andpermissible, including major relevant socio -demographic and developmental characteristics.

Comment: Statistical findings can be influencedby factors affecting the sample on which theresults are based. When the sample is intended torepresent a population, that population shouldbe described, and attention should be drawn toany systematic factors that may limit the repre-sentativeness of the sample. Factors that mightreasonably be expected to affect the results includeself-selection, attrition, linguistic ability, disabilitystatus, and exclusion criteria, among others. Ifthe participants in a validity study are patients,for example, then the diagnoses of the patientsare important, as well as other characteristics,such as the severity of the diagnosed conditions.For tests used in employment settings, the em-ployment status (e.g., applicants versus currentjob holders), the general level of experience andeducational background, and the gender andethnic composition of the sample may be relevantinformation. For tests used in credentialing, thestatus of those providing information (e.g., can-didates for a credential versus already-credentialedindividuals) is important for interpreting the re-sulting data. For tests used in educational settings,relevant information may include educationalbackground, developmental level, communitycharacteristics, or school admissions policies, as

well as the gender and ethnic composition of thesample. Sometimes legal restrictions about privacypreclude obtaining or disclosing such populationinformation or limit the level of particularity atwhich such data may be disclosed. The specificprivacy laws, if any, governing the type of datashould be considered, in order to ensure that anydescription of a population does not have the po-tential to identify an individual in a manner in-consistent with such standards. The extent ofmissing data, if any, and the methods for handlingmissing data (e.g., use of imputation procedures)should be described.

Standard 1.9

When a validation rests in part on the opinionsor decisions of expert judges, observers, or raters,procedures for selecting such experts and foreliciting judgments or ratings should be fullydescribed. The qualifications and experience ofthe judges should be presented. The descriptionof procedures should include any training andinstructions provided, should indicate whetherparticipants reached their decisions independently,and should report the level of agreement reached.If participants interacted with one another orexchanged information, the procedures throughwhich they may have influenced one anothershould be set forth.

Comment: Systematic collection of judgments oropinions may occur at many points in test con-struction (e.g., eliciting expert judgments of contentappropriateness or adequate content representation),in the formulation of rules or standards for scoreinterpretation (e.g., in setting cut scores), or in testscoring (e.g., rating of essay responses). Wheneversuch procedures are employed, the quality of theresulting judgments is important to the validation.Level of agreement should be specified clearly (e.g.,whether percent agreement refers to agreementprior to or after a consensus discussion, and whetherthe criterion for agreement is exact agreement ofratings or agreement within a certain number ofscale points.) The basis for specifying certain typesof individuals (e.g., experienced teachers, experienced

25

VALIDITY

Page 36: STANDARDS

26

CHAPTER 1

job incumbents, supervisors) as appropriate expertsfor the judgment or rating task should be articulated.It may be entirely appropriate to have experts worktogether to reach consensus, but it would not thenbe appropriate to treat their respective judgmentsas statistically independent. Different judges maybe used for different purposes (e.g., one set mayrate items for cultural sensitivity while anothermay rate for reading level) or for different portionsof a test.

Standard 1.10

When validity evidence includes statistical analysesof test results, either alone or together with dataon other variables, the conditions under whichthe data were collected should be described inenough detail that users can judge the relevanceof the statistical findings to local conditions. At-tention should be drawn to any features of a val-idation data collection that are likely to differfrom typical operational testing conditions andthat could plausibly influence test performance.

Comment: Such conditions might include (butwould not be limited to) the following: test-takermotivation or prior preparation, the range of testscores over test takers, the time allowed for testtakers to respond or other administrative conditions,the mode of test administration (e.g., unproctoredonline testing versus proctored on-site testing),examiner training or other examiner characteristics,the time intervals separating collection of data ondifferent measures, or conditions that may havechanged since the validity evidence was obtained.

Cluster 3. Specific Forms of Validity Evidence

(a) Content-Oriented Evidence

Standard 1.11

When the rationale for test score interpretationfor a given use rests in part on the appropriatenessof test content, the procedures followed in spec-

ifying and generating test content should be de-scribed and justified with reference to the intendedpopulation to be tested and the construct thetest is intended to measure or the domain it isintended to represent. If the definition of thecontent sampled incorporates criteria such asimportance, frequency, or criticality, these criteriashould also be clearly explained and justified.

Comment: For example, test developers mightprovide a logical structure that maps the items onthe test to the content domain, illustrating therelevance of each item and the adequacy withwhich the set of items represents the content do-main. Areas of the content domain that are notincluded among the test items could be indicatedas well. The match of test content to the targeteddomain in terms of cognitive complexity and theaccessibility of the test content to all members ofthe intended population are also important con-siderations.

(b) Evidence Regarding CognitiveProcesses

Standard 1.12

If the rationale for score interpretation for a givenuse depends on premises about the psychologicalprocesses or cognitive operations of test takers,then theoretical or empirical evidence in supportof those premises should be provided. When state-ments about the processes employed by observersor scorers are part of the argument for validity,similar information should be provided.

Comment: If the test specification delineates theprocesses to be assessed, then evidence is neededthat the test items do, in fact, tap the intendedprocesses.

(c) Evidence Regarding Internal Structure

Standard 1.13

If the rationale for a test score interpretation fora given use depends on premises about the rela-

Page 37: STANDARDS

27

VALIDITY

tionships among test items or among parts ofthe test, evidence concerning the internal structureof the test should be provided.

Comment: It might be claimed, for example,that a test is essentially unidimensional. Such aclaim could be supported by a multivariate statisticalanalysis, such as a factor analysis, showing thatthe score variability attributable to one major di-mension was much greater than the score variabilityattributable to any other identified dimension, orshowing that a single factor adequately accountsfor the covariation among test items. When a testprovides more than one score, the interrelationshipsof those scores should be shown to be consistentwith the construct(s) being assessed.

Standard 1.14

When interpretation of subscores, score differences,or profiles is suggested, the rationale and relevantevidence in support of such interpretation shouldbe provided. Where composite scores are devel-oped, the basis and rationale for arriving at thecomposites should be given.

Comment: When a test provides more than onescore, the distinctiveness and reliability of theseparate scores should be demonstrated, and theinterrelationships of those scores should be shownto be consistent with the construct(s) beingassessed. Moreover, evidence for the validity ofinterpretations of two or more separate scoreswould not necessarily justify a statistical or sub-stantive interpretation of the difference betweenthem. Rather, the rationale and supporting evidencemust pertain directly to the specific score, scorecombination, or score pattern to be interpretedfor a given use. When subscores from one test orscores from different tests are combined into acomposite, the basis for combining scores and forhow scores are combined (e.g., differential weightingversus simple summation) should be specified.

Standard 1.15

When interpretation of performance on specificitems, or small subsets of items, is suggested,

the rationale and relevant evidence in support ofsuch interpretation should be provided. Wheninterpretation of individual item responses islikely but is not recommended by the developer,the user should be warned against making suchinterpretations.

Comment:Users should be given sufficient guidanceto enable them to judge the degree of confidencewarranted for any interpretation for a use recom-mended by the test developer. Test manuals andscore reports should discourage overinterpretationof information that may be subject to considerableerror. This is especially important if interpretationof performance on isolated items, small subsets ofitems, or subtest scores is suggested.

(d) Evidence Regarding RelationshipsWith Conceptually Related Constructs

Standard 1.16

When validity evidence includes empirical analysesof responses to test items together with data onother variables, the rationale for selecting the ad-ditional variables should be provided. Where ap-propriate and feasible, evidence concerning theconstructs represented by other variables, as wellas their technical properties, should be presentedor cited. Attention should be drawn to any likelysources of dependence (or lack of independence)among variables other than dependencies amongthe construct(s) they represent.

Comment: The patterns of association betweenand among scores on the test under study andother variables should be consistent with theoreticalexpectations. The additional variables might bedemographic characteristics, indicators of treatmentconditions, or scores on other measures. Theymight include intended measures of the sameconstruct or of different constructs. The reliabilityof scores from such other measures and the validityof intended interpretations of scores from thesemeasures are an important part of the validity ev-idence for the test under study. If such variablesinclude composite scores, the manner in which

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 27

Page 38: STANDARDS

28

CHAPTER 1

the composites were constructed should be explained(e.g., transformation or standardization of thevariables, and weighting of the variables). Inaddition to considering the properties of eachvariable in isolation, it is important to guardagainst faulty interpretations arising from spurioussources of dependency among measures, includingcorrelated errors or shared variance due to commonmethods of measurement or common elements.

(e) Evidence Regarding RelationshipsWith Criteria

Standard 1.17

When validation relies on evidence that testscores are related to one or more criterion variables,information about the suitability and technicalquality of the criteria should be reported.

Comment: The description of each criterionvariable should include evidence concerning itsreliability, the extent to which it represents theintended construct (e.g., task performance on thejob), and the extent to which it is likely to be in-fluenced by extraneous sources of variance. Specialattention should be given to sources that previousresearch suggests may introduce extraneous variancethat might bias the criterion for or against identi-fiable groups.

Standard 1.18

When it is asserted that a certain level of testperformance predicts adequate or inadequatecriterion performance, information about thelevels of criterion performance associated withgiven levels of test scores should be provided.

Comment: For purposes of linking specific testscores with specific levels of criterion performance,regression equations are more useful than correlationcoefficients, which are generally insufficient tofully describe patterns of association between testsand other variables. Means, standard deviations,and other statistical summaries are needed, as well

as information about the distribution of criterionperformances conditional upon a given test score.In the case of categorical rather than continuousvariables, techniques appropriate to such datashould be used (e.g., the use of logistic regressionin the case of a dichotomous criterion). Evidenceabout the overall association between variablesshould be supplemented by information aboutthe form of that association and about the variabilityof that association in different ranges of test scores.Note that data collections employing test takersselected for their extreme scores on one or moremeasures (extreme groups) typically cannot provideadequate information about the association.

Standard 1.19

If test scores are used in conjunction with othervariables to predict some outcome or criterion,analyses based on statistical models of the pre-dictor-criterion relationship should include thoseadditional relevant variables along with the testscores.

Comment: In general, if several predictors ofsome criterion are available, the optimum combi-nation of predictors cannot be determined solelyfrom separate, pairwise examinations of the criterionvariable with each separate predictor in turn, dueto intercorrelation among predictors. It is ofteninformative to estimate the increment in predictiveaccuracy that may be expected when each variable,including the test score, is introduced in additionto all other available variables. As empiricallyderived weights for combining predictors can cap-italize on chance factors in a given sample, analysesinvolving multiple predictors should be verifiedby cross-validation or equivalent analysis wheneverfeasible, and the precision of estimated regressioncoefficients or other indices should be reported.Cross-validation procedures include formula esti-mates of validity in subsequent samples and em-pirical approaches such as deriving weights in oneportion of a sample and applying them to an in-dependent subsample.

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 28

Page 39: STANDARDS

Standard 1.20

When effect size measures (e.g., correlations be-tween test scores and criterion measures, stan-dardized mean test score differences betweensubgroups) are used to draw inferences that gobeyond describing the sample or samples onwhich data have been collected, indices of thedegree of uncertainty associated with these meas-ures (e.g., standard errors, confidence intervals,or significance tests) should be reported.

Comment: Effect size measures are usefully pairedwith indices reflecting their sampling error tomake meaningful evaluation possible. There arevarious possible measures of effect size, each ap-plicable to different settings. In the presentationof indices of uncertainty, standard errors or confi-dence intervals provide more information andthus are preferred in place of, or as supplementsto, significance testing.

Standard 1.21

When statistical adjustments, such as those forrestriction of range or attenuation, are made,both adjusted and unadjusted coefficients, aswell as the specific procedure used, and allstatistics used in the adjustment, should be re-ported. Estimates of the construct-criterion re-lationship that remove the effects of measurementerror on the test should be clearly reported asadjusted estimates.

Comment:The correlation between two variables,such as test scores and criterion measures, dependson the range of values on each variable. For example,the test scores and the criterion values of a selectedsubset of test takers (e.g., job applicants who havebeen selected for hire) will typically have a smallerrange than the scores of all test takers (e.g., theentire applicant pool.) Statistical methods areavailable for adjusting the correlation to reflect thepopulation of interest rather than the sampleavailable. Such adjustments are often appropriate,as when results are compared across various situations.The correlation between two variables is also affectedby measurement error, and methods are available

for adjusting the correlation to estimate the strengthof the correlation net of the effects of measurementerror in either or both variables. Reporting of anadjusted correlation should be accompanied by astatement of the method and the statistics used inmaking the adjustment.

Standard 1.22

When a meta-analysis is used as evidence of thestrength of a test-criterion relationship, the testand the criterion variables in the local situationshould be comparable with those in the studiessummarized. If relevant research includes credibleevidence that any other specific features of thetesting application may influence the strengthof the test-criterion relationship, the correspon-dence between those features in the local situationand in the meta-analysis should be reported.Any significant disparities that might limit theapplicability of the meta-analytic findings to thelocal situation should be noted explicitly.

Comment:The meta-analysis should incorporateall available studies meeting explicitly stated in-clusion criteria. Meta-analytic evidence used intest validation typically is based on a number oftests measuring the same or very similar constructsand criterion measures that likewise measure thesame or similar constructs. A meta-analytic studymay also be limited to multiple studies of a singletest and a single criterion. For each study includedin the analysis, the test-criterion relationship isexpressed in some common metric, often as aneffect size. The strength of the test-criterion rela-tionship may be moderated by features of the sit-uation in which the test and criterion measureswere obtained (e.g., types of jobs, characteristicsof test takers, time interval separating collectionof test and criterion measures, year or decade inwhich the data were collected). If test-criterionrelationships vary according to such moderatorvariables, then the meta-analysis should reportseparate estimated effect-size distributions condi-tional upon levels of these moderator variableswhen the number of studies available for analysispermits doing so. This might be accomplished,

29

VALIDITY

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 29

Page 40: STANDARDS

30

CHAPTER 1

for example, by reporting separate distributionsfor subsets of studies or by estimating the magni-tudes of the influences of situational features oneffect sizes.

This standard addresses the responsibilities ofthe individual who is drawing on meta-analyticevidence to support a test score interpretation fora given use. In some instances, that individualmay also be the one conducting the meta-analysis;in other instances, existing meta-analyses are reliedon. In the latter instance, the individual drawingon meta-analytic evidence does not have controlover how the meta-analysis was conducted or re-ported, and must evaluate the soundness of themeta-analysis for the setting in question.

Standard 1.23

Any meta-analytic evidence used to support anintended test score interpretation for a given useshould be clearly described, including method-ological choices in identifying and coding studies,correcting for artifacts, and examining potentialmoderator variables. Assumptions made in cor-recting for artifacts such as criterion unreliabilityand range restriction should be presented, andthe consequences of these assumptions madeclear.

Comment:The description should include docu-mented information about each study used asinput to the meta-analysis, thus permitting evalu-ation by an independent party. Note also thatmeta-analysis inevitably involves judgments re-garding a number of methodological choices. Thebases for these judgments should be articulated.In the case of choices involving some degree ofuncertainty, such as artifact corrections based onassumed values, the uncertainty should be ac-knowledged and the degree to which conclusionsabout validity hinge on these assumptions shouldbe examined and reported.

As in the case of Standard 1.22, the individualwho is drawing on meta-analytic evidence tosupport a test score interpretation for a given usemay or may not also be the one conducting themeta-analysis. As Standard 1.22 addresses the re-

porting of meta-analytic evidence, the individualdrawing on existing meta-analytic evidence mustevaluate the soundness of the meta-analysis forthe setting in question.

Standard 1.24

If a test is recommended for use in assigningpersons to alternative treatments, and if outcomesfrom those treatments can reasonably be comparedon a common criterion, then, whenever feasible,supporting evidence of differential outcomesshould be provided.

Comment: If a test is used for classification intoalternative occupational, therapeutic, or educationalprograms, it is not sufficient just to show that thetest predicts treatment outcomes. Support for thevalidity of the classification procedure is providedby showing that the test is useful in determiningwhich persons are likely to profit differentiallyfrom one treatment or another. Treatment categoriesmay have to be combined to assemble sufficientcases for statistical analysis. It is recognized,however, that such research may not be feasible,because ethical and legal constraints on differentialassignments may forbid control groups.

(f) Evidence Based on Consequences ofTests

Standard 1.25

When unintended consequences result from testuse, an attempt should be made to investigatewhether such consequences arise from the test’ssensitivity to characteristics other than those itis intended to assess or from the test’s failure tofully represent the intended construct.

Comment: The validity of test score interpreta-tions may be limited by construct-irrelevantcomponents or construct underrepresentation.When unintended consequences appear to stem,at least in part, from the use of one or moretests, it is especially important to check thatthese consequences do not arise from construct-

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 30

Page 41: STANDARDS

31

VALIDITY

irrelevant components or construct underrepre-sentation. For example, although group differ-ences, in and of themselves, do not call intoquestion the validity of a proposed interpretation,they may increase the salience of plausible rivalhypotheses that should be evaluated as part ofthe validation effort. A finding of unintendedconsequences may also lead to reconsiderationof the appropriateness of the construct in

question. Ensuring that unintended consequencesare evaluated is the responsibility of those makingthe decision whether to use a particular test, al-though legal constraints may limit the test user’sdiscretion to discard the results of a previouslyadministered test, when that decision is basedon differences in scores for subgroups of differentraces, ethnicities, or genders. These issues arediscussed further in chapter 3.

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 31

Page 42: STANDARDS

ch01.qxp_AERA Standards 6/18/14 5:22 PM Page 32

Page 43: STANDARDS

A test, broadly defined, is a set of tasks or stimulidesigned to elicit responses that provide a sampleof an examinee’s behavior or performance in aspecified domain. Coupled with the test is a scoringprocedure that enables the scorer to evaluate thebehavior or work samples and generate a score. Ininterpreting and using test scores, it is importantto have some indication of their reliability. The term reliability has been used in two ways

in the measurement literature. First, the term hasbeen used to refer to the reliability coefficients ofclassical test theory, defined as the correlation be-tween scores on two equivalent forms of the test,presuming that taking one form has no effect onperformance on the second form. Second, theterm has been used in a more general sense, torefer to the consistency of scores across replicationsof a testing procedure, regardless of how this con-sistency is estimated or reported (e.g., in terms ofstandard errors, reliability coefficients per se, gen-eralizability coefficients, error/tolerance ratios,item response theory (IRT) information functions,or various indices of classification consistency).To maintain a link to the traditional notions ofreliability while avoiding the ambiguity inherentin using a single, familiar term to refer to a widerange of concepts and indices, we use the term re-liability/precision to denote the more general notionof consistency of the scores across instances of thetesting procedure, and the term reliability coefficientto refer to the reliability coefficients of classicaltest theory. The reliability/precision of measurement is

always important. However, the need for precisionincreases as the consequences of decisions and in-terpretations grow in importance. If a test scoreleads to a decision that is not easily reversed, suchas rejection or admission of a candidate to a pro-fessional school, or a score-based clinical judgment(e.g., in a legal context) that a serious cognitiveinjury was sustained, a higher degree of

reliability/precision is warranted. If a decision canand will be corroborated by information fromother sources or if an erroneous initial decisioncan be easily corrected, scores with more modestreliability/precision may suffice.Interpretations of test scores generally depend

on assumptions that individuals and groups exhibitsome degree of consistency in their scores acrossindependent administrations of the testing pro-cedure. However, different samples of performancefrom the same person are rarely identical. An in-dividual’s performances, products, and responsesto sets of tasks or test questions vary in quality orcharacter from one sample of tasks to anotherand from one occasion to another, even understrictly controlled conditions. Different raters mayaward different scores to a specific performance.All of these sources of variation are reflected inthe examinees’ scores, which will vary across in-stances of a measurement procedure.The reliability/precision of the scores depends

on how much the scores vary across replicationsof the testing procedure, and analyses ofreliability/precision depend on the kinds of vari-ability allowed in the testing procedure (e.g., overtasks, contexts, raters) and the proposed interpre-tation of the test scores. For example, if the inter-pretation of the scores assumes that the constructbeing assessed does not vary over occasions, thevariability over occasions is a potential source ofmeasurement error. If the test tasks vary over al-ternate forms of the test, and the observed per-formances are treated as a sample from a domainof similar tasks, the random variability in scoresfrom one form to another would be considerederror. If raters are used to assign scores to responses,the variability in scores over qualified raters is asource of error. Variations in a test taker’s scoresthat are not consistent with the definition of theconstruct being assessed are attributed to errorsof measurement.

33

2. RELIABILITY/PRECISION ANDERRORS OF MEASUREMENT

BACKGROUND

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 33

Page 44: STANDARDS

A very basic way to evaluate the consistencyof scores involves an analysis of the variation ineach test taker’s scores across replications of thetesting procedure. The test is administered andthen, after a brief period during which the exam-inee’s standing on the variable being measuredwould not be expected to change, the test (or adistinct but equivalent form of the test) is admin-istered a second time; it is assumed that the firstadministration has no influence on the secondadministration. Given that the attribute beingmeasured is assumed to remain the same for eachtest taker over the two administrations and thatthe test administrations are independent of eachother, more variation across the two administrationsindicates more error in the test scores and thereforelower reliability/precision. The impact of such measurement errors can

be summarized in a number of ways, but typically,in educational and psychological measurement, itis conceptualized in terms of the standard deviationin the scores for a person over replications of thetesting procedure. In most testing contexts, it isnot possible to replicate the testing procedure re-peatedly, and therefore it is not possible to estimatethe standard error for each person’s score viarepeated measurement. Instead, using model-based assumptions, the average error of measure-ment is estimated over some population, and thisaverage is referred to as the standard error of meas-urement (SEM). The SEM is an indicator of alack of consistency in the scores generated by thetesting procedure for some population. A relativelylarge SEM indicates relatively low reliability/pre-cision. The conditional standard error of measurementfor a score level is the standard error of measurementat that score level. To say that a score includes error implies that

there is a hypothetical error-free value that char-acterizes the variable being assessed. In classicaltest theory this error-free value is referred to asthe person’s true score for the test procedure. It isconceptualized as the hypothetical average scoreover an infinite set of replications of the testingprocedure. In statistical terms, a person’s truescore is an unknown parameter, or constant, andthe observed score for the person is a random

variable that fluctuates around the true score forthe person.

Generalizability theory provides a differentframework for estimating reliability/precision.While classical test theory assumes a single dis-tribution for the errors in a test taker’s scores,generalizability theory seeks to evaluate the con-tributions of different sources of error (e.g., items,occasions, raters) to the overall error. The universescore for a person is defined as the expected valueover a universe of all possible replications of thetesting procedure for the test taker. The universescore of generalizability theory plays a role that issimilar to the role of true scores in classical testtheory.

Item response theory (IRT) addresses the basicissue of reliability/precision using informationfunctions, which indicate the precision with whichobserved task/item performances can be used toestimate the value of a latent trait for each testtaker. Using IRT, indices analogous to traditionalreliability coefficients can be estimated from theitem information functions and distributions ofthe latent trait in some population.In practice, the reliability/precision of the

scores is typically evaluated in terms of variouscoefficients, including reliability coefficients, gen-eralizability coefficients, and IRT informationfunctions, depending on the focus of the analysisand the measurement model being used. The co-efficients tend to have high values when the vari-ability associated with the error is small comparedwith the observed variation in the scores (or scoredifferences) to be estimated.

Implications for Validity

Although reliability/precision is discussed here asan independent characteristic of test scores, itshould be recognized that the level of reliability/pre-cision of scores has implications for validity. Reli-ability/precision of data ultimately bears on thegeneralizability or dependability of the scoresand/or the consistency of classifications of indi-viduals derived from the scores. To the extentthat scores are not consistent across replicationsof the testing procedure (i.e., to the extent that

34

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 34

Page 45: STANDARDS

they reflect random errors of measurement), theirpotential for accurate prediction of criteria, forbeneficial examinee diagnosis, and for wise decisionmaking is limited.

Specifications for Replications of the Testing Procedure

As indicated earlier, the general notion of reliability/precision is defined in terms of consistency overreplications of the testing procedure. Reliability/pre-cision is high if the scores for each person areconsistent over replications of the testing procedureand is low if the scores are not consistent overreplications. Therefore, in evaluating reliability/pre-cision, it is important to be clear about whatconstitutes a replication of the testing procedure. Replications involve independent administra-

tions of the testing procedure, such that theattribute being measured would not be expectedto change. For example, in assessing an attributethat is not expected to change over an extendedperiod of time (e.g., in measuring a trait), scoresgenerated on two successive days (using differenttest forms if appropriate) would be consideredreplications. For a state variable (e.g., mood orhunger), where fairly rapid changes are common,scores generated on two successive days wouldnot be considered replications; the scores obtainedon each occasion would be interpreted in termsof the value of the state variable on that occasion.For many tests of knowledge or skill, the admin-istration of alternate forms of a test with differentsamples of items would be considered replicationsof the test; for survey instruments and some per-sonality measures, it is expected that the samequestions will be used every time the test is ad-ministered, and any substantial change in wordingwould constitute a different test form.Standardized tests present the same or very

similar test materials to all test takers, maintainclose adherence to stipulated procedures for testadministration, and employ prescribed scoringrules that can be applied with a high degree ofconsistency. Administering the same questions orcommonly scaled questions to all test takers underthe same conditions promotes fairness and facilitates

comparisons of scores across individuals. Conditionsof observation that are fixed or standardized forthe testing procedure remain the same acrossreplications. However, some aspects of any stan-dardized testing procedure will be allowed to vary.The time and place of testing, as well as thepersons administering the test, are generally allowedto vary to some extent. The particular tasksincluded in the test may be allowed to vary (assamples from a common content domain), andthe persons who score the results can vary oversome set of qualified scorers.

Alternate forms (or parallel forms) of a stan-dardized test are designed to have the same generaldistribution of content and item formats (as de-scribed, for example, in detailed test specifications),the same administrative procedures, and at leastapproximately the same score means and standarddeviations in some specified population or popu-lations. Alternate forms of a test are consideredinterchangeable, in the sense that they are built tothe same specifications, and are interpreted asmeasures of the same construct. In classical test theory, strictly parallel tests are

assumed to measure the same construct and toyield scores that have the same means and standarddeviations in the populations of interest and havethe same correlations with all other variables. Aclassical reliability coefficient is defined in termsof the correlation between scores from strictlyparallel forms of the test, but it is estimated interms of the correlation between alternate formsof the test that may not quite be strictly parallel. Different approaches to the estimation of reli-

ability/precision can be implemented to fit differentdata-collection designs and different interpretationsand uses of scores. In some cases, it may befeasible to estimate the variability over replicationsdirectly (e.g., by having a number of qualifiedraters evaluate a sample of test performances foreach test taker). In other cases, it may be necessaryto use less direct estimates of the reliability coeffi-cient. For example, internal-consistency estimatesof reliability (e.g., split halves coefficient, KR–20,coefficient alpha) use the observed extent of agree-ment between different parts of one test to estimatethe reliability associated with form-to-form vari-

35

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 35

Page 46: STANDARDS

ability. For the split-halves method, scores on twomore-or-less parallel halves of the test (e.g., odd-numbered items and even-numbered items) arecorrelated, and the resulting half-test reliabilitycoefficient is statistically adjusted to estimate reli-ability for the full-length test. However, when atest is designed to reflect rate of work, internal-consistency estimates of reliability (particularlyby the odd-even method) are likely to yield inflatedestimates of reliability for highly speeded tests. In some cases, it may be reasonable to assume

that a potential source of variability is likely to benegligible or that the user will be able to infer ad-equate reliability from other types of evidence.For example, if test scores are used mainly topredict some criterion scores and the test does anacceptable job in predicting the criterion, it canbe inferred that the test scores are reliable/preciseenough for their intended use.The definition of what constitutes a standardized

test or measurement procedure has broadenedsignificantly over the last few decades. Variouskinds of performance assessments, simulations,and portfolio-based assessments have been developedto provide measures of constructs that might oth-erwise be difficult to assess. Each step towardgreater flexibility in the assessment proceduresenlarges the scope of the variations allowed inreplications of the testing procedure, and thereforetends to increase the measurement error. However,some of these sacrifices in reliability/precisionmay reduce construct irrelevance or construct un-derrepresentation and thereby improve the validityof the intended interpretations of the scores. Forexample, performance assessments that dependon ratings of extended responses tend to havelower reliability than more structured assessments(e.g., multiple-choice or short-answer tests), butthey can sometimes provide more direct measuresof the attribute of interest.

Random errors of measurement are viewed asunpredictable fluctuations in scores. They areconceptually distinguished from systematic errors,which may also affect the performances of indi-viduals or groups but in a consistent rather than arandom manner. For example, an incorrect answerkey would contribute systematic error, as would

differences in the difficulty of test forms thathave not been adequately equated or linked; ex-aminees who take one form may receive higherscores on average than if they had taken the otherform. Such systematic errors would not generallybe included in the standard error of measurement,and they are not regarded as contributing to alack of reliability/precision. Rather, systematicerrors constitute construct-irrelevant factors thatreduce validity but not reliability/precision.Important sources of random error may be

grouped in two broad categories: those rootedwithin the test takers and those external to them.Fluctuations in the level of an examinee’s motivation,interest, or attention and the inconsistent applicationof skills are clearly internal sources that may leadto random error. Variations in testing conditions(e.g., time of day, level of distractions) andvariations in scoring due to scorer subjectivity areexamples of external sources that may lead to ran-dom error. The importance of any particularsource of variation depends on the specific condi-tions under which the measures are taken, howperformances are scored, and the interpretationsderived from the scores.Some changes in scores from one occasion to

another are not regarded as error (random or sys-tematic), because they result, in part, from changesin the construct being measured (e.g., due tolearning or maturation that has occurred betweenthe initial and final measures). In such cases, thechanges in performance would constitute the phe-nomenon of interest and would not be considerederrors of measurement.Measurement error reduces the usefulness of

test scores. It limits the extent to which test resultscan be generalized beyond the particulars of agiven replication of the testing procedure. Itreduces the confidence that can be placed in theresults from any single measurement and thereforethe reliability/precision of the scores. Because ran-dom measurement errors are unpredictable, theycannot be removed from observed scores. However,their aggregate magnitude can be summarized inseveral ways, as discussed below, and they can becontrolled to some extent (e.g., by standardizationor by averaging over multiple scores).

36

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 36

Page 47: STANDARDS

The standard error of measurement, as such,provides an indication of the expected level ofrandom error over score points and replicationsfor a specific population. In many cases, it isuseful to have estimates of the standard errors forindividual examinees (or for examinees with scoresin certain score ranges). These conditional standarderrors are difficult to estimate directly, but can beestimated indirectly. For example, the test infor-mation functions based on IRT models can beused to estimate standard errors for differentvalues of a latent ability parameter and/or for dif-ferent observed scores. In using any of these mod-el-based estimates of conditional standard errors,it is important that the model assumptions beconsistent with the data.

Evaluating Reliability/Precision

The ideal approach to the evaluation of reliability/pre-cision would require many independent replicationsof the testing procedure on a large sample of testtakers. The range of differences allowed in replicationsof the testing procedure and the proposed inter-pretation of the scores provide a framework for in-vestigating reliability/precision.For most testing programs, scores are expected

to generalize over alternate forms of the test, oc-casions (within some period), testing contexts,and raters (if judgment is required in scoring). Tothe extent that the impact of any of these sourcesof variability is expected to be substantial, thevariability should be estimated in some way. It isnot necessary that the different sources of variancebe estimated separately. The overall reliability/pre-cision, given error variance due to the samplingof forms, occasions, and raters, can be estimatedthrough a test-retest study involving differentforms administered on different occasions andscored by different raters.The interpretation of reliability/precision analy-

ses depends on the population being tested. Forexample, reliability or generalizability coefficientsderived from scores of a nationally representativesample may differ significantly from those obtainedfrom a more homogeneous sample drawn fromone gender, one ethnic group, or one community.

Therefore, to the extent feasible (i.e., if samplesizes are large enough), reliability/precision shouldbe estimated separately for all relevant subgroups(e.g., defined in terms of race/ethnicity, gender,language proficiency) in the population. (Also seechap. 3, “Fairness in Testing.”)

Reliability/Generalizability Coefficients

In classical test theory, the consistency of test scoresis evaluated mainly in terms of reliability coefficients,defined in terms of the correlation between scoresderived from replications of the testing procedureon a sample of test takers. Three broad categoriesof reliability coefficients are recognized: (a) coefficientsderived from the administration of alternate formsin independent testing sessions (alternate-form co-efficients); (b) coefficients obtained by administrationof the same form on separate occasions (test-retestcoefficients); and (c) coefficients based on the rela-tionships/interactions among scores derived fromindividual items or subsets of the items within atest, all data accruing from a single administration(internal-consistency coefficients). In addition, wheretest scoring involves a high level of judgment,indices of scorer consistency are commonly obtained.In formal treatments of classical test theory, reliabilitycan be defined as the ratio of true-score variance toobserved score variance, but it is estimated in termsof reliability coefficients of the kinds mentionedabove.In generalizability theory, these different reli-

ability analyses are treated as special cases of amore general framework for estimating error vari-ance in terms of the variance components associatedwith different sources of error. A generalizabilitycoefficient is defined as the ratio of universe scorevariance to observed score variance. Unlike tradi-tional approaches to the study of reliability, gen-eralizability theory encourages the researcher tospecify and estimate components of true scorevariance, error score variance, and observed scorevariance, and to calculate coefficients based onthese estimates. Estimation is typically accomplishedby the application of analysis-of-variance techniques.The separate numerical estimates of the componentsof variance (e.g., variance components for items,

37

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 37

Page 48: STANDARDS

occasions, and raters, and for the interactionsamong these potential sources of error) can beused to evaluate the contribution of each sourceof error to the overall measurement error; thevariance-component estimates can be helpful inidentifying an effective strategy for controllingoverall error variance.Different reliability (and generalizability) co-

efficients may appear to be interchangeable, butthe different coefficients convey different infor-mation. A coefficient may encompass one or moresources of error. For example, a coefficient mayreflect error due to scorer inconsistencies but notreflect the variation over an examinee’s performancesor products. A coefficient may reflect only the in-ternal consistency of item responses within an in-strument and fail to reflect measurement error as-sociated with day-to-day changes in examineeperformance.It should not be inferred, however, that alter-

nate-form or test-retest coefficients based on testadministrations several days or weeks apart are al-ways preferable to internal-consistency coefficients.In cases where we can assume that scores are notlikely to change, based on past experience and/ortheoretical considerations, it may be reasonableto assume invariance over occasions (without con-ducting a test-retest study). Another limitation oftest-retest coefficients is that, when the same formof the test is used, the correlation between thefirst and second scores could be inflated by thetest taker’s recall of initial responses. The test information function, an important

result of IRT, summarizes how well the test dis-criminates among individuals at various levels ofability on the trait being assessed. Under the IRTconceptualization for dichotomously scored items,the item characteristic curve or item response functionis used as a model to represent the increasing pro-portion of correct responses to an item at increasinglevels of the ability or trait being measured. Givenappropriate data, the parameters of the characteristiccurve for each item in a test can be estimated. Thetest information function can then be calculatedfrom the parameter estimates for the set of items inthe test and can be used to derive coefficients withinterpretations similar to reliability coefficients.

The information function may be viewed as amathematical statement of the precision of meas-urement at each level of the given trait. The IRTinformation function is based on the resultsobtained on a specific occasion or in a specificcontext, and therefore it does not provide an in-dication of generalizability over occasions or con-texts.Coefficients (e.g., reliability, generalizability,

and IRT-based coefficients) have two major ad-vantages over standard errors. First, as indicatedabove, they can be used to estimate standarderrors (overall and/or conditional) in cases whereit would not be possible to do so directly. Second,coefficients (e.g., reliability and generalizabilitycoefficients), which are defined in terms of ratiosof variances for scores on the same scale, areinvariant over linear transformations of the scorescale and can be useful in comparing differenttesting procedures based on different scales. How-ever, such comparisons are rarely straightforward,because they can depend on the variability of thegroups on which the coefficients are based, thetechniques used to obtain the coefficients, thesources of error reflected in the coefficients, andthe lengths and contents of the instruments beingcompared.

Factors Affecting Reliability/Precision

A number of factors can have significant effectson reliability/precision, and in some cases, thesefactors can lead to misinterpretations of the results,if not taken into account.First, any evaluation of reliability/precision

applies to a particular assessment procedure andis likely to change if the procedure is changed inany substantial way. In general, if the assessmentis shortened (e.g., by decreasing the number ofitems or tasks), the reliability is likely to decrease;and if the assessment is lengthened with comparabletasks or items, the reliability is likely to increase.In fact, lengthening the assessment, and therebyincreasing the size of the sample of tasks/items(or raters or occasions) being employed, is an ef-fective and commonly used method for improvingreliability/precision.

38

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 38

Page 49: STANDARDS

Second, if the variability associated with ratersis estimated for a select group of raters who havebeen especially well trained (and were perhapsinvolved in the development of the procedures),but raters are not as well trained in some operationalcontexts, the error associated with rater variabilityin these operational settings may be much higherthan is indicated by the reported interrater reliabilitycoefficients. Similarly, if raters are still refining theirperformance in the early days of an extended scoringwindow, the error associated with rater variabilitymay be greater for examinees testing early in thewindow than for examinees who test later.Reliability/precision can also depend on the

population for which the procedure is being used.In particular, if variability in the construct ofinterest in the population for which scores arebeing generated is substantially different from whatit is in the population for which reliability/precisionwas evaluated, the reliability/precision can be quitedifferent in the two populations. When the variabilityin the construct being measured is low, reliabilityand generalizability coefficients tend to be small,and when the variability in the construct beingmeasured is higher, the coefficients tend to belarger. Standard errors of measurement are less de-pendent than reliability and generalizability coeffi-cients on the variability in the sample of test takers. In addition, reliability/precision can vary from

one population to another, even if the variabilityin the construct of interest in the two populationsis the same. The reliability can vary from one pop-ulation to another because particular sources oferror (rater effects, familiarity with formats andinstructions, etc.) have more impact in one popu-lation than they do in the other. In general, if anyaspects of the assessment procedures or the popu-lation being assessed are changed in an operationalsetting, the reliability/precision may change.

Standard Errors of Measurement

The standard error of measurement can be usedto generate confidence intervals around reportedscores. It is therefore generally more informativethan a reliability or generalizability coefficient,once a measurement procedure has been adopted

and the interpretation of scores has become theuser’s primary concern. Estimates of the standard errors at different

score levels (that is, conditional standard errors)are usually a valuable supplement to the single sta-tistic for all score levels combined. Conditionalstandard errors of measurement can be much moreinformative than a single average standard errorfor a population. If decisions are based on testscores and these decisions are concentrated in onearea or a few areas of the score scale, then the con-ditional errors in those areas are of special interest.Like reliability and generalizability coefficients,

standard errors may reflect variation from manysources of error or only a few. A more comprehensivestandard error (i.e., one that includes the mostrelevant sources of error, given the definition ofthe testing procedure and the proposed interpre-tation) tends to be more informative than a lesscomprehensive standard error. However, practicalconstraints often preclude the kinds of studiesthat would yield information on all potentialsources of error, and in such cases, it is most in-formative to evaluate the sources of error that arelikely to have the greatest impact. Interpretations of test scores may be broadly

categorized as relative or absolute. Relative inter-pretations convey the standing of an individual orgroup within a reference population. Absolute in-terpretations relate the status of an individual orgroup to defined performance standards. The stan-dard error is not the same for the two types of in-terpretations. Any source of error that is the samefor all individuals does not contribute to the relativeerror but may contribute to the absolute error.Traditional norm-referenced reliability coeffi-

cients were developed to evaluate the precisionwith which test scores estimate the relative standingof examinees on some scale, and they evaluate re-liability/precision in terms of the ratio of true-score variance to observed-score variance. As therange of uses of test scores has expanded and thecontexts of use have been extended (e.g., diagnosticcategorization, the evaluation of educational pro-grams), the range of indices that are used toevaluate reliability/precision has also grown to in-clude indices for various kinds of change scores

39

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 39

Page 50: STANDARDS

and difference scores, indices of decision consistency,and indices appropriate for evaluating the precisionof group means.Some indices of precision, especially standard

errors and conditional standard errors, also dependon the scale in which they are reported. An indexstated in terms of raw scores or the trait-level esti-mates of IRT may convey a very different perceptionof the error if restated in terms of scale scores. Forexample, for the raw-score scale, the conditionalstandard error may appear to be high at one scorelevel and low at another, but when the conditionalstandard errors are restated in units of scale scores,quite different trends in comparative precisionmay emerge.

Decision Consistency

Where the purpose of measurement is classification,some measurement errors are more serious thanothers. Test takers who are far above or far belowthe cut score established for pass/fail or foreligibility for a special program can have considerableerror in their observed scores without any effecton their classification decisions. Errors of meas-urement for examinees whose true scores are closeto the cut score are more likely to lead to classifi-cation errors. The choice of techniques used toquantify reliability/precision should take thesecircumstances into account. This can be done byreporting the conditional standard error in thevicinity of the cut score or the decision-consistency/accuracy indices (e.g., percentage ofcorrect decisions, Cohen’s kappa), which vary asfunctions of both score reliability/precision andthe location of the cut score.

Decision consistency refers to the extent towhich the observed classifications of examineeswould be the same across replications of thetesting procedure. Decision accuracy refers to theextent to which observed classifications of examineesbased on the results of a single replication wouldagree with their true classification status. Statisticalmethods are available to calculate indices for bothdecision consistency and decision accuracy. Thesemethods evaluate the consistency or accuracy ofclassifications rather than the consistency in scores

per se. Note that the degree of consistency oragreement in examinee classification is specific tothe cut score employed and its location withinthe score distribution.

Reliability/Precision of Group Means

Estimates of mean (or average) scores of groups(or proportions in certain categories) involvesources of error that are different from those thatoperate at the individual level. Such estimates areoften used as measures of program effectiveness(and, under some educational accountability sys-tems, may be used to evaluate the effectiveness ofschools and teachers). In evaluating group performance by estimating

the mean performance or mean improvement inperformance for samples from the group, the vari-ation due to the sampling of persons can be amajor source of error, especially if the samplesizes are small. To the extent that different samplesfrom the group of interest (e.g., all students whouse certain educational materials) yield differentresults, conclusions about the expected outcomeover all students in the group (including thosewho might join the group in the future) are un-certain. For large samples, the variability due tothe sampling of persons in the estimates of thegroup means may be quite small. However, incases where the samples of persons are not verylarge (e.g., in evaluating the mean achievement ofstudents in a single classroom or the average ex-pressed satisfaction of samples of clients in aclinical program), the error associated with thesampling of persons may be a major componentof overall error. It can be a significant source oferror in inferences about programs even if there isa high degree of precision in individual test scores.Standard errors for individual scores are not

appropriate measures of the precision of group av-erages. A more appropriate statistic is the standarderror for the estimates of the group means.

Documenting Reliability/Precision

Typically, developers and distributors of tests haveprimary responsibility for obtaining and reporting

40

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 40

Page 51: STANDARDS

evidence for reliability/precision (e.g., appropriatestandard errors, reliability or generalizability co-efficients, or test information functions). The testuser must have such data to make an informedchoice among alternative measurement approachesand will generally be unable to conduct adequatereliability/precision studies prior to operationaluse of an instrument. In some instances, however, local users of a

test or assessment procedure must accept at leastpartial responsibility for documenting the precisionof measurement. This obligation holds when oneof the primary purposes of measurement is toclassify students using locally developed performancestandards, or to rank examinees within the localpopulation. It also holds when users must rely onlocal scorers who are trained to use the scoringrubrics provided by the test developer. In suchsettings, local factors may materially affect themagnitude of error variance and observed scorevariance. Therefore, the reliability/precision ofscores may differ appreciably from that reportedby the developer.Reported evaluations of reliability/precision

should identify the potential sources of error forthe testing program, given the proposed uses ofthe scores. These potential sources of error canthen be evaluated in terms of previously reportedresearch, new empirical studies, or analyses of thereasons for assuming that a potential source oferror is likely to be negligible and therefore canbe ignored.

The reporting of indices of reliability/precision alone— with little detail regarding the methodsused to estimate the indices reported, the natureof the group from which the data were derived,and the conditions under which the data were obtained— constitutes inadequate documentation.General statements to the effect that a test is“reliable” or that it is “sufficiently reliable to permitinterpretations of individual scores” are rarely, ifever, acceptable. It is the user who must take re-sponsibility for determining whether scores aresufficiently trustworthy to justify anticipated usesand interpretations for particular uses. Nevertheless,test constructors and publishers are obligated toprovide sufficient data to make informed judgmentspossible.If scores are to be used for classification, indices

of decision consistency are useful in addition toestimates of the reliability/precision of the scores.If group means are likely to play a substantial rolein the use of the scores, the reliability/precision ofthese mean scores should be reported.As the foregoing comments emphasize, there

is no single, preferred approach to quantificationof reliability/precision. No single index adequatelyconveys all of the relevant information. No onemethod of investigation is optimal in all situations,nor is the test developer limited to a singleapproach for any instrument. The choice of esti-mation techniques and the minimum acceptablelevel for any index remain a matter of professionaljudgment.

41

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 41

Page 52: STANDARDS

The standards in this chapter begin with an over-arching standard (numbered 2.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intoeight thematic clusters labeled as follows:

1. Specifications for Replications of the TestingProcedure

2. Evaluating Reliability/Precision3. Reliability/Generalizability Coefficients4. Factors Affecting Reliability/Precision 5. Standard Errors of Measurement6. Decision Consistency7. Reliability/Precision of Group Means8. Documenting Reliability/Precision

Standard 2.0

Appropriate evidence of reliability/precisionshould be provided for the interpretation foreach intended score use.

Comment: The form of the evidence (reliabilityor generalizability coefficient, information function,conditional standard error, index of decision con-sistency) for reliability/precision should be ap-propriate for the intended uses of the scores, thepopulation involved, and the psychometric modelsused to derive the scores. A higher degree of relia-bility/precision is required for score uses that havemore significant consequences for test takers.Conversely, a lower degree may be acceptablewhere a decision based on the test score is reversibleor dependent on corroboration from other sourcesof information.

Cluster 1. Specifications forReplications of the Testing Procedure

Standard 2.1

The range of replications over which reliability/pre-cision is being evaluated should be clearly stated,along with a rationale for the choice of this def-inition, given the testing situation.

Comment: For any testing program, some aspectsof the testing procedure (e.g., time limits andavailability of resources such as books, calculators,and computers) are likely to be fixed, and someaspects will be allowed to vary from one adminis-tration to another (e.g., specific tasks or stimuli,testing contexts, raters, and, possibly, occasions).Any test administration that maintains fixed con-ditions and involves acceptable samples of theconditions that are allowed to vary would be con-sidered a legitimate replication of the testing pro-cedure. As a first step in evaluating the reliability/pre-cision of the scores obtained with a testing proce-dure, it is important to identify the range of con-ditions of various kinds that are allowed to vary,and over which scores are to be generalized.

Standard 2.2

The evidence provided for the reliability/precisionof the scores should be consistent with thedomain of replications associated with the testingprocedures, and with the intended interpretationsfor use of the test scores.

Comment:The evidence for reliability/precisionshould be consistent with the design of thetesting procedures and with the proposed inter-pretations for use of the test scores. For example,if the test can be taken on any of a range of oc-casions, and the interpretation presumes thatthe scores are invariant over these occasions,then any variability in scores over these occasionsis a potential source of error. If the tasks or

42

CHAPTER 2

STANDARDS FOR RELIABILITY/PRECISION

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 42

Page 53: STANDARDS

stimuli are allowed to vary over alternate formsof the test, and the observed performances aretreated as a sample from a domain of similartasks, the variability in scores from one form toanother would be considered error. If raters areused to assign scores to responses, the variabilityin scores over qualified raters is a source of error.Different sources of error can be evaluated in asingle coefficient or standard error, or they canbe evaluated separately, but they should all beaddressed in some way. Reports of reliability/pre-cision should specify the potential sources oferror included in the analyses.

Cluster 2. EvaluatingReliability/Precision

Standard 2.3

For each total score, subscore, or combinationof scores that is to be interpreted, estimates ofrelevant indices of reliability/precision shouldbe reported.

Comment: It is not sufficient to report estimatesof reliabilities and standard errors of measurementonly for total scores when subscores are also in-terpreted. The form-to-form and day-to-day con-sistency of total scores on a test may be acceptablyhigh, yet subscores may have unacceptably lowreliability, depending on how they are definedand used. Users should be supplied with reliabilitydata for all scores to be interpreted, and thesedata should be detailed enough to enable theusers to judge whether the scores are preciseenough for the intended interpretations for use.Composites formed from selected subtests withina test battery are frequently proposed for predictiveand diagnostic purposes. Users need informationabout the reliability of such composites.

Standard 2.4

When a test score interpretation emphasizesdifferences between two observed scores of an

individual or two averages of a group, reliability/precision data, including standard errors, shouldbe provided for such differences.

Comment: Observed score differences are usedfor a variety of purposes. Achievement gains arefrequently of interest for groups as well as indi-viduals. In some cases, the reliability/precision ofchange scores can be much lower than the relia-bilities of the separate scores involved. Differencesbetween verbal and performance scores on testsof intelligence and scholastic ability are often em-ployed in the diagnosis of cognitive impairmentand learning problems. Psychodiagnostic inferencesare frequently drawn from the differences betweensubtest scores. Aptitude and achievement batteries,interest inventories, and personality assessmentsare commonly used to identify and quantify therelative strengths and weaknesses, or the patternof trait levels, of a test taker. When the interpretationof test scores centers on the peaks and valleys inthe examinee’s test score profile, the reliability ofscore differences is critical.

Standard 2.5

Reliability estimation procedures should be con-sistent with the structure of the test.

Comment: A single total score can be computedon tests that are multidimensional. The totalscore on a test that is substantially multidimensionalshould be treated as a composite score. If an in-ternal-consistency estimate of total score reliabilityis obtained by the split-halves procedure, thehalves should be comparable in content and sta-tistical characteristics.In adaptive testing procedures, the set of tasks

included in the test and the sequencing of tasksare tailored to the test taker, using model-basedalgorithms. In this context, reliability/precisioncan be estimated using simulations based on themodel. For adaptive testing, model-based condi-tional standard errors may be particularly usefuland appropriate in evaluating the technical adequacyof the procedure.

43

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 43

Page 54: STANDARDS

Cluster 3. Reliability/GeneralizabilityCoefficients

Standard 2.6

A reliability or generalizability coefficient (orstandard error) that addresses one kind of vari-ability should not be interpreted as interchangeablewith indices that address other kinds of variability,unless their definitions of measurement errorcan be considered equivalent.

Comment: Internal-consistency, alternate-form,and test-retest coefficients should not be consideredequivalent, as each incorporates a unique definitionof measurement error. Error variances derived viaitem response theory are generally not equivalentto error variances estimated via other approaches.Test developers should state the sources of errorthat are reflected in, and those that are ignoredby, the reported reliability or generalizability co-efficients.

Standard 2.7

When subjective judgment enters into test scoring,evidence should be provided on both interraterconsistency in scoring and within-examinee con-sistency over repeated measurements. A clear dis-tinction should be made among reliability databased on (a) independent panels of raters scoringthe same performances or products, (b) a singlepanel scoring successive performances or newproducts, and (c) independent panels scoringsuccessive performances or new products.

Comment: Task-to-task variations in the qualityof an examinee’s performance and rater-to-raterinconsistencies in scoring represent independentsources of measurement error. Reports ofreliability/precision studies should make clearwhich of these sources are reflected in the data.Generalizability studies and variance componentanalyses can be helpful in estimating the errorvariances arising from each source of error. Theseanalyses can provide separate error variance estimatesfor tasks, for judges, and for occasions within the

time period of trait stability. Information shouldbe provided on the qualifications and training ofthe judges used in reliability studies. Interrater orinterobserver agreement may be particularly im-portant for ratings and observational data that in-volve subtle discriminations. It should be noted,however, that when raters evaluate positively cor-related characteristics, a favorable or unfavorableassessment of one trait may color their opinionsof other traits. Moreover, high interrater consistencydoes not imply high examinee consistency fromtask to task. Therefore, interrater agreement doesnot guarantee high reliability of examinee scores.

Cluster 4. Factors AffectingReliability/Precision

Standard 2.8

When constructed-response tests are scored locally,reliability/precision data should be gathered andreported for the local scoring when adequate-size samples are available.

Comment: For example, many statewide testingprograms depend on local scoring of essays, con-structed-response exercises, and performance tasks.Reliability/precision analyses can indicate that ad-ditional training of scorers is needed and, hence,should be an integral part of program monitoring.Reliability/precision data should be released onlywhen sufficient to yield statistically sound resultsand consistent with applicable privacy obligations.

Standard 2.9

When a test is available in both long and shortversions, evidence for reliability/precision shouldbe reported for scores on each version, preferablybased on independent administration(s) of eachversion with independent samples of test takers.

Comment: The reliability/precision of scores oneach version is best evaluated through an inde-pendent administration of each, using the designatedtime limits. Psychometric models can be used toestimate the reliability/precision of a shorter (or

44

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 44

Page 55: STANDARDS

longer) version of an existing test, based on datafrom an administration of the existing test.However, these models generally make assumptionsthat may not be met (e.g., that the items in theexisting test and the items to be added or droppedare all randomly sampled from a single domain).Context effects are commonplace in tests of max-imum performance, and the short version of astandardized test often comprises a nonrandomsample of items from the full-length version. As aresult, the predicted value of the reliability/precisionmay not provide a very good estimate of theactual value, and therefore, where feasible, the re-liability/precision of both forms should be evaluateddirectly and independently.

Standard 2.10

When significant variations are permitted intests or test administration procedures, separatereliability/precision analyses should be providedfor scores produced under each major variationif adequate sample sizes are available.

Comment: To make a test accessible to all exam-inees, test publishers or users might authorize, ormight be legally required to authorize, accommo-dations or modifications in the procedures thatare specified for the administration of a test. Forexample, audio or large print versions may beused for test takers who are visually impaired.Any alteration in standard testing materials orprocedures may have an impact on thereliability/precision of the resulting scores, andtherefore, to the extent feasible, the reliability/pre-cision should be examined for all versions of thetest and testing procedures.

Standard 2.11

Test publishers should provide estimates of reli-ability/precision as soon as feasible for eachrelevant subgroup for which the test is recom-mended.

Comment: Reporting estimates of reliability/pre-cision for relevant subgroups is useful in manycontexts, but it is especially important if the inter-

pretation of scores involves within-group inferences(e.g., in terms of subgroup norms). For example,test users who work with a specific linguistic andcultural subgroup or with individuals who have aparticular disability would benefit from an estimateof the standard error for the subgroup. Likewise,evidence that preschool children tend to respondto test stimuli in a less consistent fashion than doolder children would be helpful to test users inter-preting scores across age groups.When considering the reliability/precision of

test scores for relevant subgroups, it is useful toevaluate and report the standard error of measure-ment as well as any coefficients that are estimated.Reliability and generalizability coefficients candiffer substantially when subgroups have differentvariances on the construct being assessed. Differencesin within-group variability tend to have less impacton the standard error of measurement.

Standard 2.12

If a test is proposed for use in several grades orover a range of ages, and if separate norms areprovided for each grade or each age range, relia-bility/precision data should be provided for eachage or grade-level subgroup, not just for allgrades or ages combined.

Comment:A reliability or generalizability coefficientbased on a sample of examinees spanning severalgrades or a broad range of ages in which averagescores are steadily increasing will generally give aspuriously inflated impression of reliability/precision.When a test is intended to discriminate withinage or grade populations, reliability or generaliz-ability coefficients and standard errors should bereported separately for each subgroup.

Cluster 5. Standard Errors ofMeasurement

Standard 2.13

The standard error of measurement, both overalland conditional (if reported), should be providedin units of each reported score.

45

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 45

Page 56: STANDARDS

Comment: The standard error of measurement(overall or conditional) that is reported should beconsistent with the scales that are used in reportingscores. Standard errors in scale-score units for thescales used to report scores and/or to makedecisions are particularly helpful to the typicaltest user. The data on examinee performanceshould be consistent with the assumptions builtinto any statistical models used to generate scalescores and to estimate the standard errors forthese scores.

Standard 2.14

When possible and appropriate, conditional stan-dard errors of measurement should be reportedat several score levels unless there is evidencethat the standard error is constant across scorelevels. Where cut scores are specified for selectionor classification, the standard errors of measure-ment should be reported in the vicinity of eachcut score.

Comment: Estimation of conditional standarderrors is usually feasible with the sample sizes thatare used for analyses of reliability/precision. If itis assumed that the standard error is constant overa broad range of score levels, the rationale for thisassumption should be presented. The model onwhich the computation of the conditional standarderrors is based should be specified.

Standard 2.15

When there is credible evidence for expectingthat conditional standard errors of measurementor test information functions will differ sub-stantially for various subgroups, investigation ofthe extent and impact of such differences shouldbe undertaken and reported as soon as is feasible.

Comment: If differences are found, they shouldbe clearly indicated in the appropriate documen-tation. In addition, if substantial differences doexist, the test content and scoring models shouldbe examined to see if there are legally acceptablealternatives that do not result in such differences.

Cluster 6. Decision Consistency

Standard 2.16

When a test or combination of measures is usedto make classification decisions, estimates shouldbe provided of the percentage of test takers whowould be classified in the same way on tworeplications of the procedure.

Comment:When a test score or composite scoreis used to make classification decisions (e.g.,pass/fail, achievement levels), the standard errorof measurement at or near the cut scores has im-portant implications for the trustworthiness ofthese decisions. However, the standard error cannotbe translated into the expected percentage of con-sistent or accurate decisions without strong as-sumptions about the distributions of measurementerrors and true scores. Although decision consistencyis typically estimated from the administration ofa single form, it can and should be estimateddirectly through the use of a test-retest approach,if consistent with the requirements of test security,and if the assumption of no change in the constructis met and adequate samples are available.

Cluster 7. Reliability/Precision of Group Means

Standard 2.17

When average test scores for groups are the focusof the proposed interpretation of the test results,the groups tested should generally be regarded asa sample from a larger population, even if all ex-aminees available at the time of measurement aretested. In such cases the standard error of thegroup mean should be reported, because it reflectsvariability due to sampling of examinees as well asvariability due to individual measurement error.

Comment: The overall levels of performance invarious groups tend to be the focus in programevaluation and in accountability systems, and thegroups that are of interest include all students/clientswho could participate in the program over some

46

CHAPTER 2

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 46

Page 57: STANDARDS

period. Therefore, the students in a particular classor school at the current time, the current clients ofa social service agency, and analogous groupsexposed to a program of interest typically constitutea sample in a longitudinal sense. Presumably, com-parable groups from the same population will recurin future years, given static conditions. The factorsleading to uncertainty in conclusions about programeffectiveness arise from the sampling of persons aswell as from individual measurement error.

Standard 2.18

When the purpose of testing is to measure theperformance of groups rather than individuals,subsets of items can be assigned randomly to dif-ferent subsamples of examinees. Data are aggregatedacross subsamples and item subsets to obtain ameasure of group performance. When such pro-cedures are used for program evaluation or pop-ulation descriptions, reliability/precision analysesmust take the sampling scheme into account.

Comment:This type of measurement program istermed matrix sampling. It is designed to reducethe time demanded of individual examinees andyet to increase the total number of items onwhich data can be obtained. This testing approachprovides the same type of information aboutgroup performances that would be obtained if allexaminees had taken all of the items. Reliability/pre-cision statistics should reflect the sampling planused with respect to examinees and items.

Cluster 8. DocumentingReliability/Precision

Standard 2.19

Each method of quantifying the reliability/pre-cision of scores should be described clearly andexpressed in terms of statistics appropriate tothe method. The sampling procedures used toselect test takers for reliability/precision analysesand the descriptive statistics on these samples,subject to privacy obligations where applicable,should be reported.

Comment: Information on the method of datacollection, sample sizes, means, standard deviations,and demographic characteristics of the groupstested helps users judge the extent to whichreported data apply to their own examinee popu-lations. If the test-retest or alternate-form approachis used, the interval between administrationsshould be indicated. Because there are many ways of estimating re-

liability/precision, and each is influenced bydifferent sources of measurement error, it is unac-ceptable to say simply, “The reliability/precisionof scores on test X is .90.” A better statementwould be, “The reliability coefficient of .90reported for scores on test X was obtained by cor-relating scores from forms A and B, administeredon successive days. The data were based on asample of 400 10th-grade students from five mid-dle-class suburban schools in New York State.The demographic breakdown of this group wasas follows: . . .” In some cases, for example, whensmall sample sizes or particularly sensitive dataare involved, applicable legal restrictions governingprivacy may limit the level of information thatshould be disclosed.

Standard 2.20

If reliability coefficients are adjusted for restrictionof range or variability, the adjustment procedureand both the adjusted and unadjusted coefficientsshould be reported. The standard deviations ofthe group actually tested and of the target popu-lation, as well as the rationale for the adjustment,should be presented.

Comment:Application of a correction for restrictionin variability presumes that the available sampleis not representative (in terms of variability) ofthe test-taker population to which users might beexpected to generalize. The rationale for the cor-rection should consider the appropriateness ofsuch a generalization. Adjustment formulas thatpresume constancy in the standard error acrossscore levels should not be used unless constancycan be defended.

47

RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 47

Page 58: STANDARDS

ch02.qxp_AERA Standards 6/18/14 2:31 PM Page 48

Page 59: STANDARDS

This chapter addresses the importance of fairnessas a fundamental issue in protecting test takersand test users in all aspects of testing. The termfairness has no single technical meaning and isused in many different ways in public discourse.It is possible that individuals endorse fairness intesting as a desirable social goal, yet reach quitedifferent conclusions about the fairness of a giventesting program. A full consideration of the topicwould explore the multiple functions of testingin relation to its many goals, including the broadgoal of achieving equality of opportunity in oursociety. It would consider the technical propertiesof tests, the ways in which test results are reportedand used, the factors that affect the validity ofscore interpretations, and the consequences oftest use. A comprehensive analysis of fairness intesting also would examine the regulations, statutes,and case law that govern test use and the remediesfor harmful testing practices. The Standards cannothope to deal adequately with all of these broadissues, some of which have occasioned sharp dis-agreement among testing specialists and othersinterested in testing. Our focus must be limitedhere to delineating the aspects of tests, testing,and test use that relate to fairness as described inthis chapter, which are the responsibility of thosewho develop, use, and interpret the results oftests, and upon which there is general professionaland technical agreement.

Fairness is a fundamental validity issue andrequires attention throughout all stages of test de-velopment and use. In previous versions of theStandards, fairness and the assessment of individualsfrom specific subgroups of test takers, such as in-dividuals with disabilities and individuals withdiverse linguistic and cultural backgrounds, werepresented in separate chapters. In the currentversion of the Standards, these issues are presentedin a single chapter to emphasize that fairness toall individuals in the intended population of test

takers is an overriding, foundational concern, andthat common principles apply in responding totest-taker characteristics that could interfere withthe validity of test score interpretation. This isnot to say that the response to test-taker charac-teristics is the same for individuals from diversesubgroups such as those defined by race, ethnicity,gender, culture, language, age, disability or so-cioeconomic status, but rather that these responsesshould be sensitive to individual characteristicsthat otherwise would compromise validity. Nonethe-less, as discussed in the Introduction, it is importantto bear in mind, when using the Standards, thatapplicability depends on context. For example,potential threats to test validity for examineeswith limited English proficiency are differentfrom those for examinees with disabilities. Moreover,threats to validity may differ even for individualswithin the same subgroup. For example, individualswith diverse specific disabilities constitute thesubgroup of “individuals with disabilities,” andexaminees classified as “limited English proficient”represent a range of language proficiency levels,educational and cultural backgrounds, and priorexperiences. Further, the equivalence of theconstruct being assessed is a central issue infairness, whether the context is, for example, in-dividuals with diverse special disabilities, individualswith limited English proficiency, or individualsacross countries and cultures.

As in the previous versions of the Standards,the current chapter addresses measurement biasas a central threat to fairness in testing. However,it also adds two major concepts that have emergedin the literature, particularly in literature regardingeducation, for minimizing bias and thereby in-creasing fairness. The first concept is accessibility,the notion that all test takers should have an un-obstructed opportunity to demonstrate their stand-ing on the construct(s) being measured. For ex-ample, individuals with limited English proficiency

49

3. FAIRNESS IN TESTING

BACKGROUND

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 49

Page 60: STANDARDS

may not be adequately diagnosed on the targetconstruct of a clinical examination if the assessmentrequires a level of English proficiency that theydo not possess. Similarly, standard print and someelectronic formats can disadvantage examineeswith visual impairments and some older adultswho need magnification for reading, and the dis-advantage is considered unfair if visual acuity isirrelevant to the construct being measured. Theseexamples show how access to the construct thetest is measuring can be impeded by characteristicsand/or skills that are unrelated to the intendedconstruct and thereby can limit the validity ofscore interpretations for intended uses for certainindividuals and/or subgroups in the intended test-taking population. Accessibility is a legal require-ment in some testing contexts.

The second new concept contained in thischapter is that of universal design. Universal designis an approach to test design that seeks to maximizeaccessibility for all intended examinees. Universaldesign, as described more thoroughly later in thischapter, demands that test developers be clear onthe construct(s) to be measured, including thetarget of the assessment, the purpose for whichscores will be used, the inferences that will bemade from the scores, and the characteristics ofexaminees and subgroups of the intended testpopulation that could influence access. Test itemsand tasks can then be purposively designed anddeveloped from the outset to reflect the intendedconstruct, to minimize construct-irrelevant featuresthat might otherwise impede the performance ofintended examinee groups, and to maximize, tothe extent possible, access for as many examineesas possible in the intended population regardlessof race, ethnicity, age, gender, socioeconomicstatus, disability, or language or cultural background.

Even so, for some individuals in some testcontexts and for some purposes— as is described later— there may be need for additional test adap-tations to respond to individual characteristicsthat otherwise would limit access to the constructas measured. Some examples are creating a brailleversion of a test, allowing additional testing time,and providing test translations or language sim-plification. Any test adaption must be carefully

considered, as some adaptations may alter a test’sintended construct. Responding to individualcharacteristics that would otherwise impede accessand improving the validity of test score interpre-tations for intended uses are dual considerationsfor supporting fairness.

In summary, this chapter interprets fairness asresponsiveness to individual characteristics andtesting contexts so that test scores will yield validinterpretations for intended uses. The Standards’definition of fairness is often broader than what islegally required. A test that is fair within themeaning of the Standards reflects the same con-struct(s) for all test takers, and scores from it havethe same meaning for all individuals in theintended population; a fair test does not advantageor disadvantage some individuals because of char-acteristics irrelevant to the intended construct. Tothe degree possible, characteristics of all individualsin the intended test population, including thoseassociated with race, ethnicity, gender, age, so-cioeconomic status, or linguistic or cultural back-ground, must be considered throughout all stagesof development, administration, scoring, inter-pretation, and use so that barriers to fair assessmentcan be reduced. At the same time, test scoresmust yield valid interpretations for intended uses,and different test contexts and uses may call fordifferent approaches to fairness. For example, intests used for selection purposes, adaptations tostandardized procedures that increase accessibilityfor some individuals but change the constructbeing measured could reduce the validity of scoreinferences for the intended purposes and unfairlyadvantage those who qualify for adaptation relativeto those who do not. In contrast, for diagnosticpurposes in medicine and education, adapting atest to increase accessibility for some individualscould increase the accuracy of the diagnosis.

These issues are discussed in the sections belowand are represented in the standards that followthe chapter introduction.

General Views of Fairness

The first view of fairness in testing described inthis chapter establishes the principle of fair and

50

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 50

Page 61: STANDARDS

equitable treatment of all test takers during thetesting process. The second, third, and fourthviews presented here emphasize issues of fairnessin measurement quality: fairness as the lack orabsence of measurement bias, fairness as access tothe constructs measured, and fairness as validityof individual test score interpretations for the in-tended use(s).

Fairness in Treatment During the Testing Process

Regardless of the purpose of testing, the goal offairness is to maximize, to the extent possible, theopportunity for test takers to demonstrate theirstanding on the construct(s) the test is intendedto measure. Traditionally, careful standardizationof tests, administration conditions, and scoringprocedures have helped to ensure that test takershave comparable contexts in which to demonstratethe abilities or attributes to be measured. For ex-ample, uniform directions, specified time limits,specified room arrangements, use of proctors, anduse of consistent security procedures are imple-mented so that differences in administration con-ditions will not inadvertently influence the per-formance of some test takers relative to others.Similarly, concerns for equity in treatment mayrequire, for some tests, that all test takers havequalified test administrators with whom they cancommunicate and feel comfortable to the extentpracticable. Where technology is involved, it isimportant that examinees have had similar priorexposure to the technology and that the equipmentprovided to all test takers be of similar processingspeed and provide similar clarity and size forimages and other media. Procedures for the stan-dardized administration of a test should be carefullydocumented by the test developer and followedcarefully by the test administrator.

Although standardization has been a funda-mental principle for assuring that all examineeshave the same opportunity to demonstrate theirstanding on the construct that a test is intendedto measure, sometimes flexibility is needed toprovide essentially equivalent opportunities forsome test takers. In these cases, aspects of a stan-dardized testing process that pose no particularchallenge for most test takers may prevent specific

groups or individuals from accurately demonstratingtheir standing with respect to the construct of in-terest. For example, challenges may arise due toan examinee’s disability, cultural background, lin-guistic background, race, ethnicity, socioeconomicstatus, limitations that may come with aging, orsome combination of these or other factors. Insome instances, greater comparability of scoresmay be attained if standardized procedures arechanged to address the needs of specific groups orindividuals without any adverse effects on the va-lidity or reliability of the results obtained. For ex-ample, a braille test form, a large-print answersheet, or a screen reader may be provided toenable those with some visual impairments toobtain more equitable access to test content. Legalconsiderations may also influence how to addressindividualized needs.

Fairness as Lack of Measurement Bias

Characteristics of the test itself that are not relatedto the construct being measured, or the mannerin which the test is used, may sometimes result indifferent meanings for scores earned by membersof different identifiable subgroups. For example,differential item functioning (DIF) is said to occurwhen equally able test takers differ in their prob-abilities of answering a test item correctly as afunction of group membership. DIF can be eval-uated in a variety of ways. The detection of DIFdoes not always indicate bias in an item; thereneeds to be a suitable, substantial explanation forthe DIF to justify the conclusion that the item isbiased. Differential test functioning (DTF) refersto differences in the functioning of tests (or setsof items) for different specially defined groups.When DTF occurs, individuals from differentgroups who have the same standing on the char-acteristic assessed by the test do not have thesame expected test score.

The term predictive bias may be used whenevidence is found that differences exist in the pat-terns of associations between test scores and othervariables for different groups, bringing with itconcerns about bias in the inferences drawn fromthe use of test scores. Differential prediction isexamined using regression analysis. One approach

51

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 51

Page 62: STANDARDS

examines slope and intercept differences betweentwo targeted groups (e.g., African American ex-aminees and Caucasian examinees), while anotherexamines systematic deviations from a commonregression line for any number of groups ofinterest. Both approaches provide valuable infor-mation when examining differential prediction.Correlation coefficients provide inadequate evidencefor or against a differential prediction hypothesisif groups are found to have unequal means andvariances on the test and the criterion.

When credible evidence indicates potentialbias in measurement (i.e., lack of consistent con-struct meaning across groups, DIF, DTF) or biasin predictive relations, these potential sources ofbias should be independently investigated becausethe presence or absence of one form of such biasmay have no relationship with other forms ofbias. For example, a predictor test may show nosignificant levels of DIF, yet show group differencesin regression lines in predicting a criterion.Although it is important to guard against thepossibility of measurement bias for the subgroupsthat have been defined as relevant in the intendedtest population, it may not be feasible to fully in-vestigate all possibilities, particularly in the em-ployment context. For example, the number ofsubgroup members in the field test or normingpopulation may limit the possibility of standardempirical analyses. In these cases, previous research,a construct-based rationale, and/or data fromsimilar tests may address concerns related to po-tential bias in measurement. In addition, and es-pecially where credible evidence of potential biasexists, small sample methodologies should be con-sidered. For example, potential bias for relevantsubgroups may be examined through small-scaletryouts that use cognitive labs and/or interviewsor focus groups to solicit evidence on the validityof interpretations made from the test scores.

A related issue is the extent to which the con-struct being assessed has equivalent meaning acrossthe individuals and groups within the intendedpopulation of test takers. This is especially importantwhen the assessment crosses international bordersand cultures. Evaluation of the underlying constructand properties of the test within one country or

culture may not generalize across borders orcultures. This can lead to invalid test score inter-pretations. Careful attention to bias in score inter-pretations should be practiced in such contexts.

Fairness in Access to the Construct(s) as Measured

The goal that all intended test takers have a fullopportunity to demonstrate their standing on theconstruct being measured has given rise to concernsabout accessibility in testing. Accessible testingsituations are those that enable all test takers inthe intended population, to the extent feasible, toshow their status on the target construct(s) withoutbeing unduly advantaged or disadvantaged by in-dividual characteristics (e.g., characteristics relatedto age, disability, race/ethnicity, gender, or language)that are irrelevant to the construct(s) the test isintended to measure. Accessibility is actually atest bias issue because obstacles to accessibilitycan result in different interpretations of test scoresfor individuals from different groups. Accessibilityalso has important ethical and legal ramifications.

Accessibility can best be understood by con-trasting the knowledge, skills, and abilities thatreflect the construct(s) the test is intended tomeasure with the knowledge, skills, and abilitiesthat are not the target of the test but are requiredto respond to the test tasks or test items. Forsome test takers, factors related to individual char-acteristics such as age, race, ethnicity, socioeconomicstatus, cultural background, disability, and/orEnglish language proficiency may restrict accessi-bility and thus interfere with the measurement ofthe construct(s) of interest. For example, a testtaker with impaired vision may not be able toaccess the printed text of a personality test. If thetest were provided in large print, the test questionscould be more accessible to the test taker andwould be more likely to lead to a valid measurementof the test taker’s personality characteristics. It isimportant to be aware of test characteristics thatmay inadvertently render test questions less ac-cessible for some subgroups of the intended testingpopulation. For example, a test question that em-ploys idiomatic phrases unrelated to the constructbeing measured could have the effect of making

52

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 52

Page 63: STANDARDS

the test less accessible for test takers who are notnative speakers of English. The accessibility of atest could also be decreased by questions that useregional vocabulary unrelated to the target constructor use stimulus contexts that are less familiar toindividuals from some cultural subgroups thanothers.

As discussed later in this chapter, some test-taker characteristics that impede access are relatedto the construct being measured, for example,dyslexia in the context of tests of reading. In thesecases, providing individuals with access to theconstruct and getting some measure of it may re-quire some adaptation of the construct as well. Insituations like this, it may not be possible todevelop a measurement that is comparable acrossadapted and unadapted versions of the test;however, the measure obtained by the adaptedtest will most likely provide a more accurate as-sessment of the individual’s skills and/or abilities(although perhaps not of the full intended construct)than that obtained without using the adaptation.

Providing access to a test construct becomesparticularly challenging for individuals with morethan one characteristic that could interfere withtest performance; for example, older adults whoare not fluent in English or English learners whohave moderate cognitive disabilities.

Fairness as Validity of Individual Test ScoreInterpretations for the Intended Uses

It is important to keep in mind that fairness con-cerns the validity of individual score interpretationsfor intended uses. In attempting to ensure fairness,we often generalize across groups of test takerssuch as individuals with disabilities, older adults,individuals who are learning English, or thosefrom different racial or ethnic groups or differentcultural and/or socioeconomic backgrounds; how-ever, this is done for convenience and is notmeant to imply that these groups are homogeneousor that, consequently, all members of a groupshould be treated similarly when making inter-pretations of test scores for individuals (unlessthere is validity evidence to support such general-izations). It is particularly important, when drawinginferences about an examinee’s skills or abilities,

to take into account the individual characteristicsof the test taker and how these characteristicsmay interact with the contextual features of thetesting situation.

The complex interplay of language proficiencyand context provides one example of the challengesto valid interpretation of test scores for sometesting purposes. Proficiency in English not onlyaffects the interpretation of an English languagelearner’s test scores on tests administered in Englishbut, more important, also may affect the individual’sdevelopmental and academic progress. Individualswho differ culturally and linguistically from themajority of the test takers are at risk for inaccuratescore interpretations because of multiple factorsassociated with the assumption that, absentlanguage proficiency issues, these individuals havedevelopmental trajectories comparable to thoseof individuals who have been raised in an envi-ronment mediated by a single language andculture. For instance, consider two sixth-gradechildren who entered school as limited Englishspeakers. The first child entered school in kinder-garten and has been instructed in academic coursesin English; the second also entered school inkindergarten but has been instructed in his or hernative language. The two will have a different de-velopmental pattern. In the former case, the in-terrupted native language development has an at-tenuating effect on learning and academic per-formance, but the individual’s English proficiencymay not be a significant barrier to testing. In con-trast, the examinee who has had instruction in hisor her native language through the sixth gradehas had the opportunity for fully age-appropriatecognitive, academic, and language development;but, if tested in English, the examinee will needthe test administered in such a way as to minimizethe language barrier if proficiency in English isnot part of the construct being measured.

As the above examples show, adaptation to in-dividual characteristics and recognition of the het-erogeneity within subgroups may be important tothe validity of individual interpretations of testresults in situations where the intent is to understandand respond to individual performance. Professionalsmay be justified in deviating from standardized

53

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 53

Page 64: STANDARDS

procedures to gain a more accurate measurementof the intended construct and to provide more ap-propriate individual decisions. However, for othercontexts and uses, deviations from standardizedprocedures may be inappropriate because theychange the construct being measured, compromisethe comparability of scores or use of norms, and/orunfairly advantage some individuals.

In closing this section on the meanings offairness, note that the Standards’ measurementperspective explicitly excludes one common viewof fairness in public discourse: fairness as theequality of testing outcomes for relevant test-taker subgroups. Certainly, most testing professionalsagree that group differences in testing outcomesshould trigger heightened scrutiny for possiblesources of test bias. Examination of group differencesalso may be important in generating new hypothesesabout bias, fair treatment, and the accessibility ofthe construct as measured; and in fact, there maybe legal requirements to investigate certain differ-ences in the outcomes of testing among subgroups.However, group differences in outcomes do notin themselves indicate that a testing application isbiased or unfair.

In many cases, it is not clear whether the dif-ferences are due to real differences between groupsin the construct being measured or to some sourceof bias (e.g., construct-irrelevant variance or con-struct underrepresentation). In most cases, it maybe some combination of real differences and bias.A serious search for possible sources of bias thatcomes up empty provides reassurance that thepotential for bias is limited, but even a veryextensive research program cannot rule the possi-bility out. It is always possible that somethingwas missed, and therefore, prudence would suggestthat an attempt be made to minimize the differences.For example, some racial and ethnic subgroupshave lower mean scores on some standardizedtests than do other subgroups. Some of the factorsthat contribute to these differences are understood(e.g., large differences in family income and otherresources, differences in school quality and students’opportunity to learn the material to be assessed),but even where serious efforts have been made toeliminate possible sources of bias in test content

and formats, the potential for some score biascannot be completely ruled out. Therefore, con-tinuing efforts in test design and development toeliminate potential sources of bias without com-promising validity, and consistent with legal andregulatory standards, are warranted.

Threats to Fair and Valid Interpretations of Test Scores

A prime threat to fair and valid interpretation oftest scores comes from aspects of the test ortesting process that may produce construct-irrel-evant variance in scores that systematically lowersor raises scores for identifiable groups of testtakers and results in inappropriate score inter-pretations for intended uses. Such construct-ir-relevant components of scores may be introducedby inappropriate sampling of test content, aspectsof the test context such as lack of clarity in testinstructions, item complexities that are unrelatedto the construct being measured, and/or test re-sponse expectations or scoring criteria that mayfavor one group over another. In addition, op-portunity to learn (i.e., the extent to which anexaminee has been exposed to instruction or ex-periences assumed by the test developer and/oruser) can influence the fair and valid interpretationsof test scores for their intended uses.

Test Content

One potential source of construct-irrelevant variancein test scores arises from inappropriate test content,that is, test content that confounds the measurementof the target construct and differentially favorsindividuals from some subgroups over others. Atest intended to measure critical reading, for ex-ample, should not include words and expressionsespecially associated with particular occupations,disciplines, cultural backgrounds, socioeconomicstatus, racial/ethnic groups, or geographical loca-tions, so as to maximize the measurement of theconstruct (the ability to read critically) and tominimize confounding of this measurement withprior knowledge and experience that are likely toadvantage, or disadvantage, test takers from par-ticular subgroups.

54

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 54

Page 65: STANDARDS

Differential engagement and motivationalvalue may also be factors in exacerbating con-struct-irrelevant components of content. Materialthat is likely to be differentially interesting shouldbe balanced to appeal broadly to the full range ofthe targeted testing population (except where theinterest level is part of the construct being meas-ured). In testing, such balance extends to repre-sentation of individuals from a variety of subgroupswithin the test content itself. For example, appliedproblems can feature children and families fromdifferent racial/ethnic, socioeconomic, and languagegroups. Also, test content or situations that areoffensive or emotionally disturbing to some testtakers and may impede their ability to engagewith the test should not appear in the test unlessthe use of the offensive or disturbing content isneeded to measure the intended construct. Ex-amples of this type of content are graphic de-scriptions of slavery or the Holocaust, when suchdescriptions are not specifically required by theconstruct.

Depending on the context and purpose oftests, it is both common and advisable for test de-velopers to engage an independent and diversepanel of experts to review test content for language,illustrations, graphics, and other representationsthat might be differentially familiar or interpreteddifferently by members of different groups andfor material that might be offensive or emotionallydisturbing to some test takers.

Test Context

The term test context, as used here, refers tomultiple aspects of the test and testing environmentthat may affect the performance of an examineeand consequently give rise to construct-irrelevantvariance in the test scores. As research on contextualfactors (e.g., stereotype threat) is ongoing, testdevelopers and test users should pay attention tothe emerging empirical literature on these topicsso that they can use this information if and whenthe preponderance of evidence dictates that it isappropriate to do so. Construct-irrelevant variancemay result from a lack of clarity in test instructions,from unrelated complexity or language demandsin test tasks, and/or from other characteristics of

test items that are unrelated to the construct butlead some individuals to respond in particularways. For example, examinees from diverseracial/ethnic, linguistic, or cultural backgroundsor who differ by gender may be poorly assessedby a vocational interest inventory whose questionsdisproportionately ask about competencies, ac-tivities, and interests that are stereotypically asso-ciated with particular subgroups.

When test settings have an interpersonalcontext, the interaction of examiner with testtaker can be a source of construct-irrelevantvariance or bias. Users of tests should be alert tothe possibility that such interactions may sometimesaffect test fairness. Practitioners administering thetest should be aware of the possibility of complexinteractions with test takers and other situationalvariables. Factors that may affect the performanceof the test taker include the race, ethnicity, gender,and linguistic and cultural background of bothexaminer and test taker, the test taker’s experiencewith formal education, the testing style of the ex-aminer, the level of acculturation of the test takerand examiner, the test taker’s primary language,the language used for test administration (if it isnot the primary language of the test taker), andthe use of a bilingual or bicultural interpreter.

Testing of individuals who are bilingual ormultilingual poses special challenges. An individualwho knows two or more languages may not testwell in one or more of the languages. For example,children from homes whose families speak Spanishmay be able to understand Spanish but expressthemselves best in English or vice versa. In addition,some persons who are bilingual use their nativelanguage in most social situations and use Englishprimarily for academic and work-related activities;the use of one or both languages depends on thenature of the situation. Non-native English speakerswho give the impression of being fluent in con-versational English may be slower or not completelycompetent in taking tests that require Englishcomprehension and literacy skills. Thus, in somesettings, an understanding of an individual’s typeand degree of bilingualism or multilingualism isimportant for testing the individual appropriately.Note that this concern may not apply when the

55

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 55

Page 66: STANDARDS

construct of interest is defined as a particularkind of language proficiency (e.g., academic lan-guage of the kind found in text books, languageand vocabulary specific to workplace and em-ployment testing).

Test Response

In some cases, construct-irrelevant variance mayarise because test items elicit varieties of responsesother than those intended or because items canbe solved in ways that were not intended. To theextent that such responses are more typical ofsome subgroups than others, biased score inter-pretations may result. For example, some clientsresponding to a neuropsychological test mayattempt to provide the answers they think the testadministrator expects, as opposed to the answersthat best describe themselves.

Construct-irrelevant components in test scoresmay also be associated with test response formatsthat pose particular difficulties or are differentiallyvalued by particular individuals. For example,test performance may rely on some capability(e.g., English language proficiency or fine-motorcoordination) that is irrelevant to the target con-struct(s) but nonetheless poses impediments tothe test responses for some test takers not havingthe capability. Similarly, different values associatedwith the nature and degree of verbal output caninfluence test-taker responses. Some individualsmay judge verbosity or rapid speech as rude,whereas others may regard those speech patternsas indications of high mental ability or friendliness.An individual of the first type who is evaluatedwith values appropriate to the second may beconsidered taciturn, withdrawn, or of low mentalability. Another example is a person with memoryor language problems or depression; such a person’sability to communicate or show interest in com-municating verbally may be constrained, whichmay result in interpretations of the outcomes ofthe assessment that are invalid and potentiallyharmful to the person being tested.

In the development and use of scoring rubrics,it is particularly important that credit be awardedfor response characteristics central to the constructbeing measured and not for response characteristics

that are irrelevant or tangential to the construct.Scoring rubrics may inadvertently advantage someindividuals over others. For example, a scoringrubric for a constructed response item mightreserve the highest score level for test takers whoprovide more information or elaboration thanwas actually requested. In this situation, test takerswho simply follow instructions, or test takerswho value succinctness in responses, will earnlower scores; thus, characteristics of the individualsbecome construct-irrelevant components of thetest scores. Similarly, the scoring of open-endedresponses may introduce construct-irrelevant vari-ance for some test takers if scorers and/or automatedscoring routines are not sensitive to the fulldiversity of ways in which individuals expresstheir ideas. With the advent of automated scoringfor complex performance tasks, for example, it isimportant to examine the validity of the automatedscoring results for relevant subgroups in the test-taking population.

Opportunity to Learn

Finally, opportunity to learn— the extent to whichindividuals have had exposure to instruction orknowledge that affords them the opportunity tolearn the content and skills targeted by the test— has several implications for the fair and valid in-terpretation of test scores for their intended uses.Individuals’ prior opportunity to learn can be animportant contextual factor to consider in inter-preting and drawing inferences from test scores.For example, a recent immigrant who has hadlittle prior exposure to school may not have hadthe opportunity to learn concepts assumed to becommon knowledge by a personality inventoryor ability measure, even if that measure is admin-istered in the native language of the test taker.Similarly, as another example, there has been con-siderable public discussion about potential inequitiesin school resources available to students from tra-ditionally disadvantaged groups, for example,racial, ethnic, language, and cultural minoritiesand rural students. Such inequities affect thequality of education received. To the extent thatinequity exists, the validity of inferences aboutstudent ability drawn from achievement test scores

56

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 56

Page 67: STANDARDS

may be compromised. Not taking into accountprior opportunity to learn could lead to misdiag-nosis, inappropriate placement, and/or inappropriateassignment of services, which could have significantconsequences for an individual.

Beyond its impact on the validity of test scoreinterpretations for intended uses, opportunity tolearn has important policy and legal ramificationsin education. Opportunity to learn is a fairnessissue when an authority provides differential accessto opportunity to learn for some individuals andthen holds those individuals who have not beenprovided that opportunity accountable for theirtest performance. This problem may affect high-stakes competency tests in education, for example,when educational authorities require a certain levelof test performance for high school graduation.Here, there is a fairness concern that students notbe held accountable for, or face serious permanentnegative consequences from, their test results whentheir school experiences have not provided themthe opportunity to learn the subject matter coveredby the test. In such cases, students’ low scores mayaccurately reflect what they know and can do, sothat, technically, the interpretation of the testresults for the purpose of measuring how muchthe students have learned may not be biased.However, it may be considered unfair to severelypenalize students for circumstances that are notunder their control, that is, for not learning contentthat their schools have not taught. It is generallyaccepted that before high-stakes consequences canbe imposed for failing an examination in educationalsettings, there must be evidence that students havebeen provided curriculum and instruction that in-corporates the constructs addressed by the test.

Several important issues arise when opportunityto learn is considered as a component of fairness.First, it is difficult to define opportunity to learnin educational practice, particularly at the individuallevel. Opportunity is generally a matter of degreeand is difficult to quantify; moreover, the meas-urement of some important learning outcomesmay require students to work with materials thatthey have not seen before. Second, even if it ispossible to document the topics included in thecurriculum for a group of students, specific content

coverage for any one student may be impossibleto determine. Third, granting a diploma to a low-scoring examinee on the grounds that the studenthad insufficient opportunity to learn the materialtested means certificating someone who has notattained the degree of proficiency the diploma isintended to signify.

It should be noted that concerns about op-portunity to learn do not necessarily apply to sit-uations where the same authority is not responsiblefor both the delivery of instruction and the testingand/or interpretation of results. For example, incollege admissions decisions, opportunity to learnmay be beyond the control of the test users and itmay not influence the validity of test interpretationsfor their intended use (e.g., selection and/or ad-missions decisions). Chapter 12, “EducationalTesting and Assessment,” provides additional per-spective on opportunity to learn.

Minimizing Construct-IrrelevantComponents Through Test Design andTesting Adaptations

Standardized tests should be designed to facilitateaccessibility and minimize construct-irrelevantbarriers for all test takers in the target population,as far as practicable. Before considering the needfor any assessment adaptations for test takers whomay have special needs, the assessment developerfirst must attempt to improve accessibility withinthe test itself. Some of these basic principles areincluded in the test design process called universaldesign. By using universal design, test developersbegin the test development process with an eyetoward maximizing fairness. Universal design em-phasizes the need to develop tests that are asusable as possible for all test takers in the intendedtest population, regardless of characteristics suchas gender, age, language background, culture, so-cioeconomic status, or disability.

Principles of universal design include definingconstructs precisely, so that what is being measuredcan be clearly differentiated from test-taker char-acteristics that are irrelevant to the construct butthat could otherwise interfere with some test

57

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 57

Page 68: STANDARDS

takers’ ability to respond. Universal design avoids,where possible, item characteristics and formats,or test characteristics (for example, inappropriatetest speededness), that may bias scores for individualsor subgroups due to construct-irrelevant charac-teristics that are specific to these test takers.

Universal design processes strive to minimizeaccess challenges by taking into account test char-acteristics that may impede access to the constructfor certain test takers, such as the choice ofcontent, test tasks, response procedures, and testingprocedures. For example, the content of tests canbe made more accessible by providing user-selectedfont sizes in a technology-based test, by avoidingitem contexts that would likely be unfamiliar toindividuals because of their cultural background,by providing extended administration time whenspeed is not relevant to the construct being meas-ured, or by minimizing the linguistic load of testitems intended to measure constructs other thancompetencies in the language in which the test isadministered.

Although the principles of universal designfor assessment provide a useful guide for developingassessments that reduce construct-irrelevant variance,researchers are still in the process of gatheringempirical evidence to support some of these prin-ciples. It is important to note that not all tests canbe made accessible for everyone by attention todesign changes such as those discussed above.Even when tests are developed to maximize fairnessthrough the use of universal design and otherpractices to increase access, there will still be situ-ations where the test is not appropriate for all testtakers in the intended population. Therefore,some test adaptations may be needed for thoseindividuals whose characteristics would otherwiseimpede their access to the examination.

Adaptations are changes to the original testdesign or administration to increase access to thetest for such individuals. For example, a personwho is blind may read only in braille format, andan individual with hemiplegia may be unable tohold a pencil and thus have difficulty completinga standard written exam. Students with limitedEnglish proficiency may be proficient in physicsbut may not be able to demonstrate their knowledge

if the physics test is administered in English. De-pending on testing circumstances and purposesof the test, as well as individual characteristics,such adaptations might include changing the con-tent or presentation of the test items, changingthe administration conditions, and/or changingthe response processes. The term adaptation isused to refer to any such change. It is important,however, to differentiate between changes thatresult in comparable scores and changes that maynot produce scores that are comparable to thosefrom the original test. Although the terms mayhave different meanings under applicable laws, asused in the Standards the term accommodation isused to denote changes with which the compara-bility of scores is retained, and the term modificationis used to denote changes that affect the constructmeasured by the test. With a modification, thechanges affect the construct being measured andconsequently lead to scores that differ in meaningfrom those from the original test.1

It is important to keep in mind that attentionto design and the provision of altered tests do notalways ensure that test results will be fair andvalid for all examinees. Those who administertests and interpret test scores need to develop afull understanding of the usefulness and limitationsof test design procedures for accessibility and anyalterations that are offered.

A Range of Test Adaptations

Rather than a simple dichotomy, potential testadaptations reflect a broad range of test changes.At one end of the range are test accommodations.As the term is used in the Standards, accommoda-tions consist of relatively minor changes to thepresentation and/or format of the test, test ad-

58

CHAPTER 3

1The Americans with Disabilities Act (ADA) uses theterms accommodation and modification differently from theStandards. Title I of the ADA uses the term reasonable accom-modation to refer to changes that enable qualified individualswith disabilities to obtain employment to perform their jobs.Titles II and III use the term reasonable modification in muchthe same way. Under the ADA, an accommodation or modi-fication to a test that fundamentally alters the construct beingmeasured would not be called something different; rather itwould probably be found not “reasonable.”

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 58

Page 69: STANDARDS

ministration, or response procedures that maintainthe original construct and result in scores compa-rable to those on the original test. For example,text magnification might be an accommodationfor a test taker with a visual impairment who oth-erwise would have difficulty deciphering test di-rections or items. English–native language glossariesare an example of an accommodation that mightbe provided for limited English proficient testtakers on a construction safety test to help themunderstand what is being asked. The glossarieswould contain words that, while not directlyrelated to the construct being measured, wouldhelp limited English test takers understand thecontext of the question or task being posed.

At the other end of the range are adaptationsthat transform the construct being measured, in-cluding the test content and/or testing conditions,to get a reasonable measure of a somewhat differentbut appropriate construct for designated testtakers. For example, in educational testing, differenttests addressing alternate achievement standardsare designed for students with severe cognitivedisabilities for the same subjects in which studentswithout disabilities are assessed. Clearly, scoresfrom these different tests cannot be consideredcomparable to those resulting from the generalassessment, but instead represent scores from anew test that requires the same rigorous develop-ment and validation processes as would be carriedout for any new assessment. (An expanded dis-cussion of the use of such alternate assessments isfound in chap. 12; alternate assessments will notbe treated further in the present chapter.) Otheradaptations change the intended construct tomake it accessible for designated students whileretaining as much of the original construct aspossible. For example, a reading test adaptationmight provide a dyslexic student with a screenreader that reads aloud the passages and the testquestions measuring reading comprehension. Ifthe construct is intentionally defined as requiringboth the ability to decode and the ability to com-prehend written language, the adaptation wouldrequire a different interpretation of the test scoresas a measure of reading comprehension. Clearly,this adaptation changes the construct being meas-

ured, because the student does not have to decodethe printed text; but without the adaptation, thestudent may not be able to demonstrate anystanding on the construct of reading comprehension.On the other hand, if the purpose of the readingtest is to evaluate comprehension without concernfor decoding ability, the adaptation might bejudged to support more valid interpretations ofsome students’ reading comprehension and theessence of the relevant parts of the constructmight be judged to be intact. The challenge forthose who report, interpret, and/or use test scoresfrom adapted tests is to recognize which adaptationsprovide scores that are comparable to the scoresfrom the original, unadapted assessment andwhich adaptations do not. This challenge becomeseven more difficult when evidence to support thecomparability of scores is not available.

Test Accommodations: Comparable MeasuresThat Maintain the Intended Construct

Comparability of scores enables test users to makecomparable inferences based on the scores for alltest takers. Comparability also is the definingfeature for a test adaptation to be considered anaccommodation. Scores from the accommodatedversion of the test must yield inferences comparableto those from the standard version; to make thishappen is a challenging proposition. On the onehand, common, uniform procedures are a basicunderpinning for score validity and comparability.On the other hand, accommodations by theirvery nature mean that something in the testingcircumstance has been changed because adheringto the original standardized procedures would in-terfere with valid measurement of the intendedconstruct(s) for some individuals.

The comparability of inferences made fromaccommodated test scores rests largely on whetherthe scores represent the same constructs as thosefrom the original test. This determination requiresa very clear definition of the intended construct(s).For example, when non-native speakers of thelanguage of the test take a survey of their healthand nutrition knowledge, one may not knowwhether the test score is, in whole or in part, ameasure of the ability to read in the language of

59

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 59

Page 70: STANDARDS

the test rather than a measure of the intendedconstruct. If the test is not intended to also be ameasure of the ability to read in English, then testscores do not represent the same construct(s) forexaminees who may have poor reading skills, suchas limited English proficient test takers, as theydo for those who are fully proficient in readingEnglish. An adaptation that improves the accessi-bility of the test for non-native speakers of Englishby providing direct or indirect linguistic supportsmay yield a score that is uncontaminated by theability to understand English.

At the same time, construct underrepresentationis a primary threat to the validity of test accom-modations. For example, extra time is a commonaccommodation, but if speed is part of the intendedconstruct, it is inappropriate to allow for extratime in the test administration. Scores obtainedon the test with extended administration timemay underrepresent the construct measured bythe strictly timed test because speed will not bepart of the construct measured by the extended-time test. Similarly, translating a reading compre-hension test used for selection into an organization’straining program is inappropriate if reading com-prehension in English is important to successfulparticipation in the program.

Claims that accommodated versions of a testyield interpretations comparable to those basedon scores from the original test and that the con-struct being measured has not been changed needto be evaluated and substantiated with evidence.Although score comparability is easiest to establishwhen different test forms are constructed followingidentical procedures and then equated statistically,such procedures usually are not possible for ac-commodated and nonaccommodated versions oftests. Instead, relevant evidence can take a varietyof forms, from experimental studies to assess con-struct equivalence to smaller, qualitative studiesand/or use of professional judgment and expertreview. Whatever the case, test developers and/orusers should seek evidence of the comparabilityof the accommodated and original assessments.

A variety of strategies for accommodating testsand testing procedures have been implementedto be responsive to the needs of test takers with

disabilities and those with diverse linguistic andcultural backgrounds. Similar approaches may beadapted for other subgroups. Specific strategiesdepend on the purpose of the test and the con-struct(s) the test is intended to measure. Somestrategies require changing test administrationprocedures (e.g., instructions, response format),whereas others alter testing medium, timing, set-tings, or format. Depending on the linguisticbackground or the nature and extent of thedisability, one or more testing changes may beappropriate for a particular individual.

Regardless of the individual’s characteristicsthat make accommodations necessary, it is im-portant that test accommodations address thespecific access issue(s) that otherwise would biasan individual’s test results. For example, accom-modations provided to limited English proficienttest takers should be designed to address appropriatelinguistic support needs; those provided to testtakers with visual impairments should addressthe inability to see test material. Accommodationsshould be effective in removing construct-irrelevantbarriers to an individual’s test performance withoutproviding an unfair advantage over individualswho do not receive the accommodation. Admittedly,achieving both objectives can be challenging.

Adaptations involving test translations meritspecial consideration. Simply translating a testfrom one language to another does not ensurethat the translation produces a version of the testthat is comparable in content and difficulty levelto the original version of the test, or that thetranslated test produces scores that are equally re-liable/precise and valid as those from the originaltest. Furthermore, one cannot assume that the rel-evant acculturation, clinical, or educational expe-riences are similar for test takers taking the translatedversion and for the target group used to developthe original version. In addition, it cannot be as-sumed that translation into the native language isalways a preferred accommodation. Research ineducational testing, for example, shows that trans-lated content tests are not effective unless testtakers have been instructed using the language ofthe translated test. Whenever tests are translatedfrom one language to a second language, evidence

60

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 60

Page 71: STANDARDS

of the validity, reliability/precision, and comparabilityof scores on the different versions of the testsshould be collected and reported.

When the testing accommodation employsthe use of an interpreter, it is desirable, where fea-sible, to obtain someone who has a basic under-standing of the process of psychological and edu-cational assessment, is fluent in the language ofthe test and the test taker’s native language, and isfamiliar with the test taker’s cultural background.The interpreter ideally needs to understand theimportance of following standardized procedures,the importance of accurately conveying to the ex-aminer a test taker’s actual responses, and the roleand responsibilities of the interpreter in testing.The interpreter must be careful not to provideany assistance to the candidate that might potentiallycompromise the validity of the interpretation forintended uses of the assessment results.

Finally, it is important to standardize proceduresfor implementing accommodations, as far as pos-sible, so that comparability of scores is maintained.Standardized procedures for test accommodationsmust include rules for determining who is eligiblefor an accommodation, as well as precisely howthe accommodation is to be administered. Testusers should monitor adherence to the rules foreligibility and for appropriate administration ofthe accommodated test.

Test Modifications: Noncomparable MeasuresThat Change the Intended Construct

There may be times when additional flexibilityis required to obtain even partial measurementof the construct; that is, it may be necessary toconsider a modification to a test that will resultin changing the intended construct to provideeven limited access to the construct that is beingmeasured. For example, an individual withdyscalculia may have limited ability to do com-putations without a calculator; however, if pro-vided a calculator, the individual may be able todo the calculations required in the assessment.If the construct being assessed involves broadermathematics skill, the individual may havelimited access to the construct being measuredwithout the use of a calculator; with the modi-

fication, however, the individual may be able todemonstrate mathematics problem-solving skills,even if he or she is not able to demonstratecomputation skills. Because modified assessmentsare measuring a different construct from thatmeasured by the standardized assessment, it isimportant to interpret the assessment scores asresulting from a new test and to gather whateverevidence is necessary to evaluate the validity ofthe interpretations for intended uses of thescores. For norm-based score interpretations,any modification that changes the constructwill invalidate the norms for score interpretations.Likewise, if the construct is changed, criterion-based score interpretations from the modifiedassessment (for example, making classificationdecisions such as “pass/fail” or assigning categoriesof mastery such as “basic,” “proficient,” or “ad-vanced” using cut scores determined on theoriginal assessment) will not be valid.

Reporting Scores From Accommodated and Modified Tests

Typically, test administrators and testing profes-sionals document steps used in making test ac-commodations or modifications in the test report;clinicians may also include a discussion of the va-lidity of the interpretations of the resulting scoresfor intended uses. This practice of reporting thenature of accommodations and modifications isconsistent with implied requirements to commu-nicate information as to the nature of the assessmentprocess if these changes may affect the reliability/pre-cision of test scores or the validity of interpretationsdrawn from test scores.

The flagging of test score reports can be acontroversial issue and subject to legal requirements.When there is clear evidence that scores fromregular and altered tests or test administrationsare not comparable, consideration should be givento informing score users, potentially by flaggingthe test results to indicate their special nature, tothe extent permitted by law. Where there iscredible evidence that scores from regular andaltered tests are comparable, then flagging generallyis not appropriate. There is little agreement in thefield on how to proceed when credible evidence

61

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 61

Page 72: STANDARDS

on comparability does not exist. To the extentpossible, test developers and/or users should collectevidence to examine the comparability of regularand altered tests or administration procedures forthe test’s intended purposes.

Appropriate Use of Accommodations or Modifications

Depending on the construct to be measured andthe test’s purpose, there are some testing situationswhere accommodations as defined by the Standardsare not needed or modifications as defined by theStandards are not appropriate. First, the reasonfor the possible alteration, such as English languageskills or a disability, may in fact be directly relevantto the focal construct. In employment testing, itwould be inappropriate to make changes to thetest if the test is designed to assess essential skillsrequired for the job and the test changes wouldfundamentally alter the constructs being measured.For example, despite increased automation anduse of recording devices, some court reporter jobsrequire individuals to be able to work quickly andaccurately. Speed is an important aspect of theconstruct that cannot be adapted. As another ex-ample, a work sample for a customer service jobthat requires fluent communication in Englishwould not be translated into another language.

Second, an adaptation for a particular disabilityis inappropriate when the purpose of a test is todiagnose the presence and degree of that disability.

For example, allowing extra time on a timed testto determine distractibility and speed-of-processingdifficulties associated with attention deficit disorderwould make it impossible to determine the extentto which the attention and processing-speed dif-ficulties actually exist.

Third, it is important to note that not all in-dividuals within a general class of examinees, suchas those with diverse linguistic or cultural back-grounds or with disabilities, may require specialprovisions when taking tests. The language skills,cultural knowledge, or specific disabilities thatthese individuals possess, for example, might notinfluence their performance on a particular typeof test. Hence, for these individuals, no changesare needed.

The effectiveness of a given accommodationalso plays a role in determinations of appropriateuse. If a given accommodation or modificationdoes not increase access to the construct asmeasured, there is little point in using it. Evidenceof effectiveness may be gathered through quanti-tative or qualitative studies. Professional judgmentnecessarily plays a substantial role in decisionsabout changes to the test or testing situation.

In summary, fairness is a fundamental issuefor valid test score interpretation, and it shouldtherefore be the goal for all testing applications.Fairness is the responsibility of all parties involvedin test development, administration, and score in-terpretation for the intended purposes of the test.

62

CHAPTER 3

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 62

Page 73: STANDARDS

63

FAIRNESS IN TESTING

The standards in this chapter begin with an over-arching standard (numbered 3.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intofour thematic clusters labeled as follows:

1. Test Design, Development, Administration,and Scoring Procedures That Minimize Bar-riers to Valid Score Interpretations for theWidest Possible Range of Individuals andRelevant Subgroups

2. Validity of Test Score Interpretations for Intended Uses for the Intended ExamineePopulation

3. Accommodations to Remove Construct-Irrelevant Barriers and Support Valid Inter-pretations of Scores for Their Intended Uses

4. Safeguards Against Inappropriate Score Interpretations for Intended Uses

Standard 3.0

All steps in the testing process, including testdesign, validation, development, administration,and scoring procedures, should be designed insuch a manner as to minimize construct-irrelevantvariance and to promote valid score interpretationsfor the intended uses for all examinees in the in-tended population.

Comment: The central idea of fairness in testingis to identify and remove construct-irrelevantbarriers to maximal performance for any examinee.Removing these barriers allows for the comparableand valid interpretation of test scores for all ex-aminees. Fairness is thus central to the validityand comparability of the interpretation of testscores for intended uses.

Cluster 1. Test Design, Development,Administration, and Scoring ProceduresThat Minimize Barriers to Valid ScoreInterpretations for the Widest PossibleRange of Individuals and RelevantSubgroups

Standard 3.1

Those responsible for test development, revision,and administration should design all steps ofthe testing process to promote valid score inter-pretations for intended score uses for the widestpossible range of individuals and relevant sub-groups in the intended population.

Comment: Test developers must clearly delineateboth the constructs that are to be measured by thetest and the characteristics of the individuals andsubgroups in the intended population of test takers.Test tasks and items should be designed to maximizeaccess and be free of construct-irrelevant barriers asfar as possible for all individuals and relevant sub-groups in the intended test-taker population. Oneway to accomplish these goals is to create the testusing principles of universal design, which take ac-count of the characteristics of all individuals forwhom the test is intended and include such elementsas precisely defining constructs and avoiding, wherepossible, characteristics and formats of items andtests (for example, test speededness) that may com-promise valid score interpretations for individualsor relevant subgroups. Another principle of universaldesign is to provide simple, clear, and intuitivetesting procedures and instructions. Ultimately,the goal is to design a testing process that will, tothe extent practicable, remove potential barriers tothe measurement of the intended construct for allindividuals, including those individuals requiringaccommodations. Test developers need to be knowl-edgeable about group differences that may interfere

STANDARDS FOR FAIRNESS

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 63

Page 74: STANDARDS

64

CHAPTER 3

with the precision of scores and the validity of testscore inferences, and they need to be able to takesteps to reduce bias.

Standard 3.2

Test developers are responsible for developingtests that measure the intended construct andfor minimizing the potential for tests’ being af-fected by construct-irrelevant characteristics, suchas linguistic, communicative, cognitive, cultural,physical, or other characteristics.

Comment:Unnecessary linguistic, communicative,cognitive, cultural, physical, and/or other charac-teristics in test item stimulus and/or response re-quirements can impede some individuals in demon-strating their standing on intended constructs.Test developers should use language in tests thatis consistent with the purposes of the tests andthat is familiar to as wide a range of test takers aspossible. Avoiding the use of language that hasdifferent meanings or different connotations forrelevant subgroups of test takers will help ensurethat test takers who have the skills being assessedare able to understand what is being asked ofthem and respond appropriately. The level of lan-guage proficiency, physical response, or other de-mands required by the test should be kept to theminimum required to meet work and credentialingrequirements and/or to represent the target con-struct(s). In work situations, the modality inwhich language proficiency is assessed should becomparable to that required on the job, forexample, oral and/or written, comprehensionand/or production. Similarly, the physical andverbal demands of response requirements shouldbe consistent with the intended construct.

Standard 3.3

Those responsible for test development shouldinclude relevant subgroups in validity, reliability/precision, and other preliminary studies usedwhen constructing the test.

Comment: Test developers should include indi-viduals from relevant subgroups of the intended

testing population in pilot or field test samplesused to evaluate item and test appropriateness forconstruct interpretations. The analyses that arecarried out using pilot and field testing datashould seek to detect aspects of test design,content, and format that might distort test scoreinterpretations for the intended uses of the testscores for particular groups and individuals. Suchanalyses could employ a range of methodologies,including those appropriate for small sample sizes,such as expert judgment, focus groups, andcognitive labs. Both qualitative and quantitativesources of evidence are important in evaluatingwhether items are psychometrically sound andappropriate for all relevant subgroups.

If sample sizes permit, it is often valuable tocarry out separate analyses for relevant subgroupsof the population. When it is not possible toinclude sufficient numbers in pilot and/or fieldtest samples in order to do separate analyses, op-erational test results may be accumulated andused to conduct such analyses when sample sizesbecome large enough to support the analyses.

If pilot or field test results indicate that itemsor tests function differentially for individualsfrom, for example, relevant age, cultural, disability,gender, linguistic and/or racial/ethnic groups inthe population of test takers, test developersshould investigate aspects of test design, content,and format (including response formats) thatmight contribute to the differential performanceof members of these groups and, if warranted,eliminate these aspects from future test developmentpractices.

Expert and sensitivity reviews can serve toguard against construct-irrelevant language andimages, including those that may offend someindividuals or subgroups, and against construct-irrelevant context that may be more familiar tosome than others. Test publishers often conductsensitivity reviews of all test material to detectand remove sensitive material from tests (e.g.,text, graphics, and other visual representationswithin the test that could be seen as offensive tosome groups and possibly affect the scores of in-dividuals from these groups). Such reviews shouldbe conducted before a test becomes operational.

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 64

Page 75: STANDARDS

Standard 3.4

Test takers should receive comparable treatmentduring the test administration and scoring process.

Comment: Those responsible for testing shouldadhere to standardized test administration, scoring,and security protocols so that test scores willreflect the construct(s) being assessed and willnot be unduly influenced by idiosyncrasies in thetesting process. Those responsible for test admin-istration should mitigate the possibility of personalpredispositions that might affect the test admin-istration or interpretation of scores.

Computerized and other forms of technolo-gy-based testing add extra concerns for standard-ization in administration and scoring. Examineesmust have access to technology so that aspects ofthe technology itself do not influence scores. Ex-aminees working on older, slower equipment maybe unfairly disadvantaged relative to those workingon newer equipment. If computers or other devicesdiffer in speed of processing or movement fromone screen to the next, in the fidelity of thevisuals, or in other important ways, it is possiblethat construct-irrelevant factors may influencetest performance.

Issues related to test security and fidelity ofadministration can also threaten the comparabilityof treatment of individuals and the validity andfairness of test score interpretations. For example,unauthorized distribution of items to some ex-aminees but not others, or unproctored test ad-ministrations where standardization cannot beensured, could provide an advantage to some testtakers over others. In these situations, test resultsshould be interpreted with caution.

Standard 3.5

Test developers should specify and documentprovisions that have been made to test adminis-tration and scoring procedures to remove con-struct-irrelevant barriers for all relevant subgroupsin the test-taker population.

Comment: Test developers should specify howconstruct-irrelevant barriers were minimized in

the test development process for individuals fromall relevant subgroups in the intended test popu-lation. Test developers and/or users should alsodocument any studies carried out to examine thereliability/precision of scores and validity of scorerinterpretations for relevant subgroups of the in-tended population of test takers for the intendeduses of the test scores. Special test administration,scoring, and reporting procedures should be doc-umented and made available to test users.

Cluster 2. Validity of Test ScoreInterpretations for Intended Uses for the Intended Examinee Population

Standard 3.6

Where credible evidence indicates that test scoresmay differ in meaning for relevant subgroups inthe intended examinee population, test developersand/or users are responsible for examining theevidence for validity of score interpretations forintended uses for individuals from those sub-groups. What constitutes a significant differencein subgroup scores and what actions are taken inresponse to such differences may be defined byapplicable laws.

Comment: Subgroup mean differences do not inand of themselves indicate lack of fairness, butsuch differences should trigger follow-up studies,where feasible, to identify the potential causes ofsuch differences. Depending on whether subgroupdifferences are discovered during the developmentor use phase, either the test developer or the testuser is responsible for initiating follow-up inquiriesand, as appropriate, relevant studies. The inquiryshould investigate construct underrepresentationand sources of construct-irrelevant variance aspotential causes of subgroup differences, investigatedas feasible, through quantitative and/or qualitativestudies. The kinds of validity evidence consideredmay include analysis of test content, internalstructure of test responses, the relationship of testscores to other variables, or the response processesemployed by the individual examinees. When

65

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 65

Page 76: STANDARDS

66

CHAPTER 3

sample sizes are sufficient, studies of score precisionand accuracy for relevant subgroups also shouldbe conducted. When sample sizes are small, datamay sometimes be accumulated over operationaladministrations of the test so that suitable quan-titative analyses by subgroup can be performedafter the test has been in use for a period of time.Qualitative studies also are relevant to the supportingvalidity arguments (e.g., expert reviews, focusgroups, cognitive labs). Test developers shouldclosely consider findings from quantitative and/orqualitative analyses in documenting the interpre-tations for the intended score uses, as well as insubsequent test revisions.

Analyses, where possible, may need to takeinto account the level of heterogeneity within rel-evant subgroups, for example, individuals withdifferent disabilities, or linguistic minority examineesat different levels of English proficiency. Differenceswithin these subgroups may influence the appro-priateness of test content, the internal structureof the test responses, the relation of test scores toother variables, or the response processes employedby individual examinees.

Standard 3.7

When criterion-related validity evidence is usedas a basis for test score–based predictions offuture performance and sample sizes are sufficient,test developers and/or users are responsible forevaluating the possibility of differential predictionfor relevant subgroups for which there is priorevidence or theory suggesting differential pre-diction.

Comment:When sample sizes are sufficient, dif-ferential prediction is often examined using re-gression analysis. One approach to regressionanalysis examines slope and intercept differencesbetween targeted groups (e.g., Black and Whitesamples), while another examines systematic de-viations from a common regression line for thegroups of interest. Both approaches can accountfor the possibility of predictive bias and/or differ-ences in heterogeneity between groups and providevaluable information for the examination of dif-

ferential predictions. In contrast, correlation co-efficients provide inadequate evidence for oragainst a differential prediction hypothesis ifgroups or treatments are found to have unequalmeans and variances on the test and the criterion.It is particularly important in the context oftesting for high-stakes purposes that test developersand/or users examine differential prediction andavoid the use of correlation coefficients in situationswhere groups or treatments result in unequalmeans or variances on the test and criterion.

Standard 3.8

When tests require the scoring of constructedresponses, test developers and/or users shouldcollect and report evidence of the validity ofscore interpretations for relevant subgroups inthe intended population of test takers for the in-tended uses of the test scores.

Comment: Subgroup differences in examinee re-sponses and/or the expectations and perceptionsof scorers can introduce construct-irrelevantvariance in scores from constructed response tests.These, in turn, could seriously affect thereliability/precision, validity, and comparabilityof score interpretations for intended uses for someindividuals. Different methods of scoring coulddifferentially influence the construct representationof scores for individuals from some subgroups.

For human scoring, scoring procedures shouldbe designed with the intent that the scores reflectthe examinee’s standing relative to the tested con-struct(s) and are not influenced by the perceptionsand personal predispositions of the scorers. It isessential that adequate training and calibration ofscorers be carried out and monitored throughoutthe scoring process to support the consistency ofscorers’ ratings for individuals from relevant sub-groups. Where sample sizes permit, the precisionand accuracy of scores for relevant subgroups alsoshould be calculated.

Automated scoring algorithms may be used toscore complex constructed responses, such as essays,either as the sole determiner of the score or inconjunction with a score provided by a human

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 66

Page 77: STANDARDS

scorer. Scoring algorithms need to be reviewed forpotential sources of bias. The precision of scoresand validity of score interpretations resulting fromautomated scoring should be evaluated for allrelevant subgroups of the intended population.

Cluster 3. Accommodations to RemoveConstruct-Irrelevant Barriers andSupport Valid Interpretations of Scoresfor Their Intended Uses

Standard 3.9

Test developers and/or test users are responsiblefor developing and providing test accommodations,when appropriate and feasible, to remove con-struct-irrelevant barriers that otherwise wouldinterfere with examinees’ ability to demonstratetheir standing on the target constructs.

Comment: Test accommodations are designed toremove construct-irrelevant barriers related to in-dividual characteristics that otherwise would in-terfere with the measurement of the target constructand therefore would unfairly disadvantage indi-viduals with these characteristics. These accom-modations include changes in administrationsetting, presentation, interface/engagement, andresponse requirements, and may include the ad-dition of individuals to the administration process(e.g., readers, scribes).

An appropriate accommodation is one thatresponds to specific individual characteristics butdoes so in a way that does not change the constructthe test is measuring or the meaning of scores.Test developers and/or test users should documentthe basis for the conclusion that the accommodationdoes not change the construct that the test ismeasuring. Accommodations must address indi-vidual test takers’ specific needs (e.g., cognitive,linguistic, sensory, physical) and may be requiredby law. For example, individuals who are notfully proficient in English may need linguistic ac-commodations that address their language status,while visually impaired individuals may need textmagnification. In many cases when a test is used

to evaluate the academic progress of an individual,the accommodation that will best eliminate con-struct irrelevance will match the accommodationused for instruction.

Test modifications that change the constructthat the test is measuring may be needed for someexaminees to demonstrate their standing on someaspect of the intended construct. If an assessment ismodified to improve access to the intended constructfor designated individuals, the modified assessmentshould be treated like a newly developed assessmentthat needs to adhere to the test standards for validity,reliability/precision, fairness, and so forth.

Standard 3.10

When test accommodations are permitted, testdevelopers and/or test users are responsible fordocumenting standard provisions for using theaccommodation and for monitoring the appro-priate implementation of the accommodation.

Comment: Test accommodations should be usedonly when the test taker has a documented needfor the accommodation, for example, an Individ-ualized Education Plan (IEP) or documentationby a physician, psychologist, or other qualifiedprofessional. The documentation should be preparedin advance of the test-taking experience andreviewed by one or more experts qualified tomake a decision about the relevance of the docu-mentation to the requested accommodation.

Test developers and/or users should provideindividuals requiring accommodations in a testingsituation with information about the availabilityof accommodations and the procedures for re-questing them prior to the test administration. Insettings where accommodations are routinely pro-vided for individuals with documented needs(e.g., educational settings), the documentationshould describe permissible accommodations andinclude standardized protocols and/or proceduresfor identifying examinees eligible for accommo-dations, identifying and assigning appropriate ac-commodations for these individuals, and admin-istering accommodations, scoring, and reportingin accordance with standardized rules.

67

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 67

Page 78: STANDARDS

68

CHAPTER 3

Test administrators and users should alsoprovide those who have a role in determining andadministering accommodations with sufficient in-formation and expertise to appropriately use ac-commodations that may be applied to the assess-ment. Instructions for administering any changesin the test or testing procedures should be clearlydocumented and, when necessary, test adminis-trators should be trained to follow these procedures.The test administrator should administer the ac-commodations in a standardized manner as doc-umented by the test developer. Administrationprocedures should include procedures for recordingwhich accommodations were used for specific in-dividuals and, where relevant, for recording anydeviation from standardized procedures for ad-ministering the accommodations.

The test administrator or appropriate repre-sentative of the test user should document anyuse of accommodations. For large-scale educationassessments, test users also should monitor theappropriate use of accommodations.

Standard 3.11

When a test is changed to remove barriers tothe accessibility of the construct being measured,test developers and/or users are responsible forobtaining and documenting evidence of thevalidity of score interpretations for intendeduses of the changed test, when sample sizespermit.

Comment: It is desirable, where feasible and ap-propriate, to pilot and/or field test any test alter-ations with individuals representing each relevantsubgroup for whom the alteration is intended.Validity studies typically should investigate boththe efficacy of the alteration for intendedsubgroup(s) and the comparability of score infer-ences from the altered and original tests.

In some circumstances, developers may notbe able to obtain sufficient samples of individuals,for example, those with the same disability orsimilar levels of a disability, to conduct standardempirical analyses of reliability/precision andvalidity. In these situations, alternative ways should

be sought to evaluate the validity of the changedtest for relevant subgroups, for example throughsmall-sample qualitative studies or professionaljudgments that examine the comparability of theoriginal and altered tests and/or that investigatealternative explanations for performance on thechanged tests.

Evidence should be provided for recommendedalterations. If a test developer recommends differenttime limits, for example, for individuals with dis-abilities or those from diverse linguistic andcultural backgrounds, pilot or field testing shouldbe used, whenever possible, to establish these par-ticular time limits rather than simply allowingtest takers a multiple of the standard time withoutexamining the utility of the arbitrary implemen-tation of multiples of the standard time. Whenpossible, fatigue and other time-related issuesshould be investigated as potentially importantfactors when time limits are extended.

When tests are linguistically simplified toremove construct-irrelevant variance, test developersand/or users are responsible for documenting ev-idence of the comparability of scores from thelinguistically simplified tests to the original test,when sample sizes permit.

Standard 3.12

When a test is translated and adapted from onelanguage to another, test developers and/or testusers are responsible for describing the methodsused in establishing the adequacy of the adaptationand documenting empirical or logical evidencefor the validity of test score interpretations forintended use.

Comment: The term adaptation is used here todescribe changes made to tests translated fromone language to another to reduce construct-ir-relevant variance that may arise due to individualor subgroup characteristics. In this case the trans-lation/adaptation process involves not only trans-lating the language of the test so that it is suitablefor the subgroup taking the test, but also addressingany construct-irrelevant linguistic and culturalsubgroup characteristics that may interfere with

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 68

Page 79: STANDARDS

measurement of the intended construct(s). Whenmultiple language versions of a test are intendedto provide comparable scores, test developersshould describe in detail the methods used fortest translation and adaptation and should reportevidence of test score validity pertinent to the lin-guistic and cultural groups for whom the test isintended and pertinent to the scores’ intendeduses. Evidence of validity may include empiricalstudies and/or professional judgment documentingthat the different language versions measure com-parable or similar constructs and that the scoreinterpretations from the two versions have com-parable validity for their intended uses. Forexample, if a test is translated and adapted intoSpanish for use with Central American, Cuban,Mexican, Puerto Rican, South American, andSpanish populations, the validity of test score in-terpretations for specific uses should be evaluatedwith members of each of these groups separately,where feasible. Where sample sizes permit, evidenceof score accuracy and precision should be providedfor each group, and test properties for eachsubgroup should be included in test manuals.

Standard 3.13

A test should be administered in the languagethat is most relevant and appropriate to the testpurpose.

Comment: Test users should take into accountthe linguistic and cultural characteristics andrelative language proficiencies of examinees whoare bilingual or use multiple languages. Identifyingthe most appropriate language(s) for testing alsorequires close consideration of the context andpurpose for testing. Except in cases where thepurpose of testing is to determine test takers’ levelof proficiency in a particular language, the testtakers should be tested in the language in whichthey are most proficient. In some cases, test takers’most proficient language in general may not bethe language in which they were instructed ortrained in relation to tested constructs, and inthese cases it may be more appropriate to administerthe test in the language of instruction.

Professional judgment needs to be used to de-termine the most appropriate procedures for es-tablishing relative language proficiencies. Suchprocedures may range from self-identification byexaminees to formal language proficiency testing.Sensitivity to linguistic and cultural characteristicsmay require the sole use of one language in testingor use of multiple languages to minimize the in-troduction of construct-irrelevant componentsinto the measurement process.

Determination of a test taker’s most proficientlanguage for test administration does not auto-matically guarantee validity of score inferencesfor the intended use. For example, individualsmay be more proficient in one language than an-other, but not necessarily developmentally proficientin either; disconnects between the language ofconstruct acquisition and that of assessment alsocan compromise appropriate interpretation of thetest taker’s scores.

Standard 3.14

When testing requires the use of an interpreter,the interpreter should follow standardized pro-cedures and, to the extent feasible, be sufficientlyfluent in the language and content of the testand the examinee’s native language and cultureto translate the test and related testing materialsand to explain the examinee’s test responses, asnecessary.

Comment: Although individuals with limitedproficiency in the language of the test (includingdeaf and hard-of-hearing individuals whose nativelanguage may be sign language) should ideally betested by professionally trained bilingual/biculturalexaminers, the use of an interpreter may benecessary in some situations. If an interpreter isrequired, the test user is responsible for selectingan interpreter with reasonable qualifications, ex-perience, and preparation to assist appropriatelyin the administration of the test. As with otheraspects of standardized testing, procedures for ad-ministering a test when an interpreter is usedshould be standardized and documented. It isnecessary for the interpreter to understand the

69

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 69

Page 80: STANDARDS

70

CHAPTER 3

importance of following standardized proceduresfor this test, the importance of accurately conveyingto the examiner an examinee’s actual responses,and the role and responsibilities of the interpreterin testing. When the translation of technical termsis important to accurately assess the construct,the interpreter should be familiar with the meaningof these terms and corresponding vocabularies inthe respective languages.

Unless a test has been standardized and normedwith the use of interpreters, their use may need tobe viewed as an alteration that could change themeasurement of the intended construct, in particularbecause of the introduction of a third party duringtesting, as well as the modification of the standardizedprotocol. Differences in word meaning, familiarity,frequency, connotations, and associations make itdifficult to directly compare scores from any non-standardized translations to English-language norms.

When a test is likely to require the use of in-terpreters, the test developer should provide clearguidance on how interpreters should be selectedand their role in administration.

Cluster 4. Safeguards AgainstInappropriate Score Interpretations for Intended Uses

Standard 3.15

Test developers and publishers who claim that atest can be used with examinees from specificsubgroups are responsible for providing the nec-essary information to support appropriate testscore interpretations for their intended uses forindividuals from these subgroups.

Comment: Test developers should include in testmanuals and instructions for score interpretationexplicit statements about the applicability of thetest for relevant subgroups. Test developers shouldprovide evidence of the applicability of the testfor relevant subgroups and make explicit cautionsagainst foreseeable (based on prior experience orother relevant sources such as research literature)misuses of test results.

Standard 3.16

When credible research indicates that test scoresfor some relevant subgroups are differentially af-fected by construct-irrelevant characteristics ofthe test or of the examinees, when legally per-missible, test users should use the test only forthose subgroups for which there is sufficient ev-idence of validity to support score interpretationsfor the intended uses.

Comment: A test may not measure the sameconstruct(s) for individuals from different relevantsubgroups because different characteristics oftest content or format influence scores of testtakers from one subgroup to another. Any suchdifferences may inadvertently advantage or dis-advantage individuals from these subgroups. Thedecision whether to use a test with any given rel-evant subgroup necessarily involves a carefulanalysis of the validity evidence for the subgroup,as is called for in Standard 1.4. The decision alsorequires consideration of applicable legal require-ments and the exercise of thoughtful professionaljudgment regarding the significance of any con-struct-irrelevant components. In cases wherethere is credible evidence of differential validity,developers should provide clear guidance to thetest user about when and whether valid inter-pretations of scores for their intended uses canor cannot be drawn for individuals from thesesubgroups.

There may be occasions when examineesrequest or demand to take a version of the testother than that deemed most appropriate by thedeveloper or user. For example, an individualwith a disability may decline an altered formatand request the standard form. Acceding to suchrequests, after fully informing the examinee aboutthe characteristics of the test, the accommodationsthat are available, and how the test scores will beused, is not a violation of this standard and insome instances may be required by law.

In some cases, such as when a test will distributebenefits or burdens (such as qualifying for anhonors class or denial of a promotion in a job),the law may limit the extent to which a test user

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 70

Page 81: STANDARDS

may evaluate some groups under the test andother groups under a different test.

Standard 3.17

When aggregate scores are publicly reported forrelevant subgroups— for example, males and fe-males, individuals of differing socioeconomicstatus, individuals differing by race/ethnicity,individuals with different sexual orientations,individuals with diverse linguistic and culturalbackgrounds, individuals with disabilities, youngchildren or older adults— test users are responsiblefor providing evidence of comparability and forincluding cautionary statements whenever credibleresearch or theory indicates that test scores maynot have comparable meaning across these sub-groups.

Comment: Reporting scores for relevant subgroupsis justified only if the scores have comparablemeaning across these groups and there is sufficientsample size per group to protect individual identityand warrant aggregation. This standard is intendedto be applicable to settings where scores areimplicitly or explicitly presented as comparablein meaning across subgroups. Care should betaken that the terms used to describe reportedsubgroups are clearly defined, consistent withcommon usage, and clearly understood by thoseinterpreting test scores.

Terminology for describing specific subgroupsfor which valid test score inferences can andcannot be drawn should be as precise as possible,and categories should be consistent with the in-tended uses of the results. For example, the termsLatino or Hispanic can be ambiguous if not specif-ically defined, in that they may denote individualsof Cuban, Mexican, Puerto Rican, South orCentral American, or other Spanish-culture origin,regardless of race/ethnicity, and may combinethose who are recent immigrants with those whoare U.S. native born, those who may not be pro-ficient in English, and those of diverse socioeco-nomic background. Similarly, the term “individualswith disabilities” encompasses a wide range ofspecific conditions and background characteristics.

Even references to specific categories of individualswith disabilities, such as hearing impaired, shouldbe accompanied by an explanation of the meaningof the term and an indication of the variability ofindividuals within the group.

Standard 3.18

In testing individuals for diagnostic and/or specialprogram placement purposes, test users shouldnot use test scores as the sole indicators to char-acterize an individual’s functioning, competence,attitudes, and/or predispositions. Instead, multiplesources of information should be used, alternativeexplanations for test performance should be con-sidered, and the professional judgment of someonefamiliar with the test should be brought to bearon the decision.

Comment:Many test manuals point out variablesthat should be considered in interpreting testscores, such as clinically relevant history, medica-tions, school record, vocational status, and test-taker motivation. Influences associated withvariables such as age, culture, disability, gender,and linguistic or racial/ethnic characteristics mayalso be relevant.

Opportunity to learn is another variable thatmay need to be taken into account in educationaland/or clinical settings. For instance, if recentimmigrants being tested on a personality inventoryor an ability measure have little prior exposure toschool, they may not have had the opportunity tolearn concepts that the test assumes are commonknowledge or common experience, even if thetest is administered in the native language. Nottaking into account prior opportunity to learncan lead to misdiagnoses, inappropriate placementsand/or services, and unintended negative conse-quences.

Inferences about test takers’ general languageproficiency should be based on tests that measurea range of language features, not a single linguisticskill. A more complete range of communicativeabilities (e.g., word knowledge, syntax as well ascultural variation) will typically need to be assessed.Test users are responsible for interpreting individual

71

FAIRNESS IN TESTING

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 71

Page 82: STANDARDS

72

CHAPTER 3

scores in light of alternative explanations and/orrelevant individual variables noted in the testmanual.

Standard 3.19

In settings where the same authority is responsiblefor both provision of curriculum and high-stakesdecisions based on testing of examinees’ curriculummastery, examinees should not suffer permanentnegative consequences if evidence indicates thatthey have not had the opportunity to learn thetest content.

Comment: In educational settings, students’opportunity to learn the content and skillsassessed by an achievement test can seriouslyaffect their test performance and the validity oftest score interpretations for intended use forhigh-stakes individual decisions. If there is nota good match between the content of curriculumand instruction and that of tested constructs forsome students, those students cannot be expectedto do well on the test and can be unfairly disad-vantaged by high-stakes individual decisions,such as denying high school graduation, thatare made based on test results. When an authority,such as a state or district, is responsible for pre-scribing and/or delivering curriculum and in-struction, it should not penalize individuals fortest performance on content that the authorityhas not provided.

Note that this standard is not applicable in situ-ations where different authorities are responsible forcurriculum, testing, and/or interpretation and useof results. For example, opportunity to learn may bebeyond the knowledge or control of test users, andit may not influence the validity of test interpretationssuch as predictions of future performance.

Standard 3.20

When a construct can be measured in differentways that are equal in their degree of constructrepresentation and validity (including freedomfrom construct-irrelevant variance), test usersshould consider, among other factors, evidenceof subgroup differences in mean scores or inpercentages of examinees whose scores exceedthe cut scores, in deciding which test and/or cutscores to use.

Comment: Evidence of differential subgroup per-formance is one important factor influencing thechoice between one test and another. However,other factors, such as cost, testing time, test security,and logistical issues (e.g., the need to screen verylarge numbers of examinees in a very short time),must also enter into professional judgments abouttest selection and use. If the scores from two testslead to equally valid interpretations and imposesimilar costs or other burdens, legal considerationsmay require selecting the test that minimizes sub-group differences.

ch03.qxp_AERA Standards 6/18/14 5:23 PM Page 72

Page 83: STANDARDS

IIPART II

Operations

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 73

Page 84: STANDARDS

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 74

Page 85: STANDARDS

Test development is the process of producing ameasure of some aspect of an individual’s knowledge,skills, abilities, interests, attitudes, or other char-acteristics by developing questions or tasks andcombining them to form a test, according to aspecified plan. The steps and considerations forthis process are articulated in the test design plan.Test design begins with consideration of expectedinterpretations for intended uses of the scores tobe generated by the test. The content and formatof the test are then specified to provide evidenceto support the interpretations for intended uses.Test design also includes specification of test ad-ministration and scoring procedures, and of howscores are to be reported. Questions or tasks (here-after referred to as items) are developed followingthe test specifications and screened using criteriaappropriate to the intended uses of the test. Pro-cedures for scoring individual items and the testas a whole are also developed, reviewed, andrevised as needed. Test design is commonly iterative,with adjustments and revisions made in responseto data from tryouts and operational use.Test design and development procedures must

support the validity of the interpretations of testscores for their intended uses. For example, currenteducational assessments often are used to indicatestudents’ proficiency with regard to standards forthe knowledge and skill a student should exhibit;thus, the relationship between the test contentand the established content standards is key. Inthis case, content specifications must clearlydescribe the content and/or cognitive categoriesto be covered so that evidence of the alignment ofthe test questions to these categories can begathered. When normative interpretations are in-tended, development procedures should includea precise definition of the reference populationand plans to collect appropriate normative data.Many tests, such as employment or college selectiontests, rely on predictive validity evidence. Specifi-

cations for such tests should include descriptionsof the outcomes the test is designed to predictand plans to collect evidence of the effectivenessof test scores in predicting these outcomes.Issues bearing on validity, reliability, and

fairness are interwoven within the stages of testdevelopment. Each of these topics is addressedcomprehensively in other chapters of the Standards:validity in chapter 1, reliability in chapter 2, andfairness in chapter 3. Additional material on testadministration and scoring, and on reporting andinterpretation of scores and results, is provided inchapter 6. Chapter 5 discusses score scales, andchapter 7 covers documentation requirements.In addition, test developers should respect the

rights of participants in the development process,including pretest participants. In particular, de-velopers should take steps to ensure proper noticeand consent from participants and to protect par-ticipants’ personally identifiable information con-sistent with applicable legal and professional re-quirements. The rights of test takers are discussedin chapter 8. This chapter describes four phases of the test

development process leading from the originalstatement of purpose(s) to the final product: (a)development and evaluation of the test specifica-tions; (b) development, tryout, and evaluation ofthe items; (c) assembly and evaluation of new testforms; and (d) development of procedures andmaterials for administration and scoring. Whatfollows is a description of typical test developmentprocedures, although there may be sound reasonsthat some of the steps covered in the descriptionare followed in some settings and not in others.

Test Specifications

General ConsiderationsIn nearly all cases, test development is guided bya set of test specifications. The nature of these

75

4. TEST DESIGN AND DEVELOPMENT

BACKGROUND

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 75

Page 86: STANDARDS

specifications and the way in which they arecreated may vary widely as a function of thenature of the test and its intended uses. The termtest specifications is sometimes limited to descriptionof the content and format of the test. In the Stan-dards, test specifications are defined more broadlyto also include documentation of the purposeand intended uses of the test, as well as detaileddecisions about content, format, test length, psy-chometric characteristics of the items and test,delivery mode, administration, scoring, and scorereporting. Responsibility for developing test specifications

also varies widely across testing programs. Formost commercial tests, test specifications arecreated by the test developer. In other contexts,such as tests used for educational accountability,many aspects of the test specifications are establishedthrough a public policy process. As discussed inthe introduction, the generic term test developer isused in this chapter in preference to other terms,such as test publisher, to cover both those responsiblefor developing and those responsible for imple-menting test specifications across a wide range oftest development processes.

Statement of Purpose and Intended Uses

The process of developing educational and psy-chological tests should begin with a statement ofthe purpose(s) of the test, the intended users anduses, the construct or content domain to be meas-ured, and the intended examinee population.Tests of the same construct or domain can differin important ways because factors such as purpose,intended uses, and examinee population may vary.In addition, tests intended for diverse examineepopulations must be developed to minimize con-struct-irrelevant factors that may unfairly depressor inflate some examinees’ performance. In manycases, accommodations and/or alternative versionsof tests may need to be specified to removeirrelevant barriers to performance for particularsubgroups in the intended examinee population.Specification of intended uses will include an

indication of whether the test score interpretationswill be primarily norm-referenced or criterion-ref-erenced. When scores are norm-referenced, relative

score interpretations are of primary interest. Ascore for an individual or for a definable group isranked within a distribution of scores or comparedwith the average performance of test takers in areference population (e.g., based on age, grade,diagnostic category, or job classification). Wheninterpretations are criterion-referenced, absolutescore interpretations are of primary interest. Themeaning of such scores does not depend on rankinformation. Rather, the test score conveys directlya level of competence in some defined criteriondomain. Both relative and absolute interpretationsare often used with a given test, but the test de-veloper determines which approach is most relevantto specific uses of the test.

Content Specifications

The first step in developing test specifications isto extend the original statement of purpose(s),and the construct or content domain being con-sidered, into a framework for the test that describesthe extent of the domain, or the scope of the con-struct to be measured. Content specifications, some-times referred to as content frameworks, delineatethe aspects (e.g., content, skills, processes, and di-agnostic features) of the construct or domain tobe measured. The specifications should addressquestions about what is to be included, such as“Does eighth-grade mathematics include algebra?”“Does verbal ability include text comprehensionas well as vocabulary?” “Does self-esteem includeboth feelings and acts?” The delineation of thecontent specifications can be guided by theory orby an analysis of the content domain (e.g., ananalysis of job requirements in the case of manycredentialing and employment tests). The contentspecifications serve as a guide to subsequent testevaluation. The chapter on validity provides amore thorough discussion of the relationshipsamong the construct or content domain, the testframework, and the purpose(s) of the test.

Format Specifications

Once decisions have been made about what thetest is to measure and what meaning its scores areintended to convey, the next step is to createformat specifications. Format specifications delineate

76

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 76

Page 87: STANDARDS

the format of items (i.e., tasks or questions); theresponse format or conditions for responding;and the type of scoring procedures. Althoughformat decisions are often driven by considerationsof expediency, such as ease of responding or costof scoring, validity considerations must not beoverlooked. For example, if test questions requiretest takers to possess significant linguistic skill tointerpret them but the test is not intended as ameasure of linguistic skill, the complexity of thequestions may lead to construct-irrelevant variancein test scores. This would be unfair to test takerswith limited linguistic skills, thereby reducing thevalidity of the test scores as a measure of theintended content. Format specifications shouldinclude a rationale for how the chosen formatsupports the validity, reliability, and fairness ofintended uses of the resulting scores.The nature of the item and response formats

that may be specified depends on the purposes ofthe test, the defined domain of the test, and thetesting platform. Selected-response formats, suchas true-false or multiple-choice items, are suitablefor many purposes of testing. Computer-basedtesting allows different ways of indicating responses,such as drag-and-drop. Other purposes may bemore effectively served by a short-answer format.Short-answer items require a response of no morethan a few words. Extended-response formatsrequire the test taker to write a more extensive re-sponse of one or more sentences or paragraphs.Performance assessments often seek to emulatethe context or conditions in which the intendedknowledge or skills are actually applied. One typeof performance assessment, for example, is thestandardized job or work sample where a task ispresented to the test taker in a standardized formatunder standardized conditions. Job or work samplesmight include the assessment of a medical practi-tioner’s ability to make an accurate diagnosis andrecommend treatment for a defined condition, amanager’s ability to articulate goals for an organi-zation, or a student’s proficiency in performing ascience laboratory experiment.

Accessibility of item formats. As described inchapter 3, designing tests to be accessible and

valid for all intended examinees, to the maximumextent possible, is critical. Formats that may beunfamiliar to some groups of test takers or thatplace inappropriate demands should be avoided.The principles of universal design describe the useof test formats that allow tests to be taken withoutadaptation by as broad a range of individuals aspossible, but they do not necessarily eliminate theneed for adaptations. Format specifications shouldinclude consideration of alternative formats thatmight also be needed to remove irrelevant barriersto performance, such as large print or braille forexaminees who are visually impaired or, where ap-propriate to the construct being measured, bilingualdictionaries for test takers who are more proficientin a language other than the language of the test.The number and types of adaptations to be specifieddepend on both the nature of the construct beingassessed and the targeted population of test takers.

Complex item formats. Some testing programsemploy more complex item formats. Examples in-clude performance assessments, simulations, andportfolios. Specifications for more complex itemformats should describe the domain from whichthe items or tasks are sampled, components of thedomain to be assessed by the tasks or items, andcritical features of the items that should be replicatedin creating items for alternate forms. Special con-siderations for complex item formats are illustratedthrough the following discussion of performanceassessments, simulations, and portfolios.

Performance assessments. Performance assessmentsrequire examinees to demonstrate the ability toperform tasks that are often complex in natureand generally require the test takers to demonstratetheir abilities or skills in settings that closelyresemble real-life situations. One distinctionbetween performance assessments and other formsof tests is the type of response that is requiredfrom the test takers. Performance assessmentsrequire the test takers to carry out a process suchas playing a musical instrument or tuning a car’sengine or creating a product such as a writtenessay. An assessment of a clinical psychologist intraining may require the test taker to interview a

77

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 77

Page 88: STANDARDS

client, choose appropriate tests, arrive at a diagnosis,and plan for therapy. Because performance assessments typically

consist of a small number of tasks, establishingthe extent to which the results can be generalizedto a broader domain described in the test specifi-cations is particularly important. The test specifi-cations should indicate critical dimensions to bemeasured (e.g., skills and knowledge, cognitiveprocesses, context for performing the tasks) sothat tasks selected for testing will systematicallyrepresent the critical dimensions, leading to acomprehensive coverage of the domain as well asconsistent coverage across test forms. Specificationof the domain to be covered is also important forclarifying potentially irrelevant sources of variationin performance. Further, both theoretical andempirical evidence are important for documentingthe extent to which performance assessments— tasks as well as scoring criteria— reflect the processesor skills that are specified by the domain definition.When tasks are designed to elicit complex cognitiveprocesses, detailed analyses of the tasks and scoringcriteria and both theoretical and empirical analysesof the test takers’ performances on the tasksprovide necessary validity evidence.

Simulations. Simulation assessments are similarto performance assessments in that they requirethe examinee to engage in a complex set ofbehaviors for a specified period of time. Simulationsare sometimes a substitute for performance as-sessments, when actual task performance mightbe costly or dangerous. Specifications for simulationtasks should describe the domain of activities tobe covered by the tasks, critical dimensions ofperformance to be reflected in each task, andspecific format considerations such as the numberor duration of the tasks and essentials of how theuser interacts with the tasks. Specifications shouldbe sufficient to allow experts to judge the compa-rability of different sets of simulation tasks includedin alternate forms.

Portfolios. Portfolios are systematic collectionsof work or educational products, typically gatheredover time. The design of a portfolio assessment,

like that of other assessment procedures, mustflow from the purpose of the assessment. Typicalpurposes include judgment of improvement injob or educational performance and evaluation ofeligibility for employment, promotion, or gradu-ation. Portfolio specifications indicate the natureof the work that is to be included in the portfolio.The portfolio may include entries such as repre-sentative products, the best work of the test taker,or indicators of progress. For example, in an em-ployment setting involving promotion decisions,employees may be instructed to include their bestwork or products. Alternatively, if the purpose isto judge students’ educational growth, the studentsmay be asked to provide evidence of improvementwith respect to particular competencies or skills.Students may also be asked to provide justificationsfor their choices or a cover piece reflecting on thework presented and what the student has learnedfrom it. Still other methods may call for the useof videos, exhibitions, or demonstrations.The specifications for the portfolio indicate

who is responsible for selecting its contents. Forexample, the specifications must state whetherthe test taker, the examiner, or both parties workingtogether should be involved in the selection ofthe contents of the portfolio. The particular re-sponsibilities of each party are delineated in thespecifications. In employment settings, employeesmay be involved in the selection of their workand products that demonstrate their competenciesfor promotion purposes. Analogously, in educationalapplications, students may participate in the se-lection of some of their work and the products tobe included in their portfolios. Specifications for how portfolios are scored

and by whom will vary as a function of the use ofthe portfolio scores. Centralized evaluation ofportfolios is common where portfolios are usedin high-stakes decisions. The more standardizedthe contents and procedures for collecting andscoring material, the more comparable the scoresfrom the resulting portfolios will be. Regardlessof the methods used, all performance assessments,simulations, and portfolios are evaluated by thesame standards of technical quality as other formsof tests.

78

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 78

Page 89: STANDARDS

Test Length

Test developers frequently follow test blueprintsthat specify the number of items for each contentarea to be included in each test form. Specificationsfor test length must balance testing time require-ments with the precision of the resulting scores,with longer tests generally leading to more precisescores. Test developers frequently follow test blue-prints that provide guidance on the number orpercentage of items for each area of content andthat may also include specification of the distri-bution of items by cognitive requirements or byitem format. Test length and blueprint specificationsare often updated based on data from tryouts ontime requirements, content coverage, and scoreprecision. When tests are administered adaptively,test length (the number of items administered toeach examinee) is determined by stopping rules,which may be based on a fixed number of testquestions or may be based on a desired level ofscore precision.

Psychometric Specifications

Psychometric specifications indicate desired statisticalproperties of items (e.g., difficulty, discrimination,and inter-item correlations) as well as the desiredstatistical properties of the whole test, includingthe nature of the reporting scale, test difficultyand precision, and the distribution of items acrosscontent or cognitive categories. When psychometricindices of the items are estimated using item re-sponse theory (IRT), the fit of the model to thedata is also evaluated. This is accomplished byevaluating the extent to which the assumptionsunderlying the item response model (e.g., unidi-mensionality and local independence) are satisfied.

Scoring Specifications

Test specifications will describe how individualtest items are to be scored and how item scoresare to be combined to yield one or more overalltest scores. All types of items require some indicationof how to score the responses. For selected-responseitems, one of the response options is consideredthe correct response in some testing programs. Inother testing programs, each response option may

yield a different item score. For short-answeritems, a list of acceptable responses may suffice,although more general scoring instructions aresometimes required. Extended-response items re-quire more detailed rules for scoring, sometimescalled scoring rubrics. Scoring rubrics specify thecriteria for evaluating performance and may varyin the degree of judgment entailed, the numberof score levels employed, and the ways in whichcriteria for each score level are described. It iscommon practice for test developers to providescorers with examples of performances at each ofthe score levels to help clarify the criteria.For extended-response items, including per-

formance tasks, simulations, and portfolios, twomajor types of scoring procedures are used: analyticand holistic. Both of the procedures require explicitperformance criteria that reflect the test framework.However, the approaches lead to some differencesin the scoring specifications. Under the analyticscoring procedure, each critical dimension of theperformance criteria is judged independently, andseparate scores are obtained for each of these di-mensions in addition to an overall score. Underthe holistic scoring procedure, the same performancecriteria may implicitly be considered, but onlyone overall score is provided. Because the analyticprocedure can provide information on a numberof critical dimensions, it potentially providesvaluable information for diagnostic purposes andlends itself to evaluating strengths and weaknessesof test takers. However, validation will be requiredfor diagnostic interpretations for particular usesof the separate scores. In contrast, the holisticprocedure may be preferable when an overalljudgment is desired and when the skills being as-sessed are complex and highly interrelated. Re-gardless of the type of scoring procedure, designingthe items and developing the scoring rubrics andprocedures is an integrated process.When scoring procedures require human judg-

ment, the scoring specifications should describeessential scorer qualifications, how scorers are tobe trained and monitored, how scoring discrepanciesare to be identified and resolved, and how the ab-sence of bias in scorer judgment is to be checked.In some cases, computer algorithms are used to

79

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 79

Page 90: STANDARDS

score complex examinee responses, such as essays.In such cases, scoring specifications should indicatehow scores are generated by these algorithms andhow they are to be checked and validated. Scoring specifications will also include whether

test scores are simple sums of item scores, involvedifferential weighting of items or sections, or arebased on a more complex measurement model. Ifan IRT model is used, specifications shouldindicate the form of the model, how model pa-rameters are to be estimated, and how model fit isto be evaluated.

Test Administration Specifications

Test administration specifications describe howthe test is to be administered. Administrationprocedures include mode of test delivery (e.g.,paper-and-pencil or computer based), time limits,accommodation procedures, instructions and ma-terials provided to examiners and examinees, andprocedures for monitoring test taking and ensuringtest security. For tests administered by computer,administration specifications will also include adescription of any hardware and software require-ments, including connectivity considerations forWeb-based testing.

Refining the Test Specifications

There is often a subtle interplay between theprocess of conceptualizing a construct or contentdomain and the development of a test of thatconstruct or domain. The specifications for thetest provide a description of how the construct ordomain will be represented and may need to berefined as development proceeds. The proceduresused to develop items and scoring rubrics and toexamine item and test characteristics may oftencontribute to clarifying the specifications. Theextent to which the construct is fully defined apriori is dependent on the testing application. Inmany testing applications, well-defined and detailedtest specifications guide the development of itemsand their associated scoring rubrics and procedures.In some areas of psychological measurement, testdevelopment may be less dependent on an a prioridefined framework and may rely more on a data-based approach that results in an empirically

derived definition of the construct being measured.In such instances, items are selected primarily onthe basis of their empirical relationship with anexternal criterion, their relationships with oneanother, or the degree to which they discriminateamong groups of individuals. For example, itemsfor a test for sales personnel might be selectedbased on the correlations of item scores with pro-ductivity measures of current sales personnel.Similarly, an inventory to help identify differentpatterns of psychopathology might be developedusing patients from different diagnostic subgroups.When test development relies on a data-based ap-proach, some items will likely be selected basedon chance occurrences in the data. Cross-validationstudies are routinely conducted to determine thetendency to select items by chance, which involvesadministering the test to a comparable samplethat was not involved in the original test develop-ment effort.In other testing applications, however, the test

specifications are fixed in advance and guide thedevelopment of items and scoring procedures.Empirical relationships may then be used to informdecisions about retaining, rejecting, or modifyingitems. Interpretations of scores from tests developedby this process have the advantage of a theoreticaland an empirical foundation for the underlyingdimensions represented by the test.

Considerations for Adaptive Testing

In adaptive testing, test items or sets of items areselected as the test is being administered based onthe test taker’s responses to prior items. Specificationof item selection algorithms may involve consid-eration of content coverage as well as increasingthe precision of the score estimate. When severalitems are tied to a single passage or task, morecomplex algorithms for selecting the next passageor task are needed. In some instances, a largernumber of items are developed for each passageor task and the selection algorithm chooses specificitems to administer based on content and precisionconsiderations. Specifications must also indicatewhether a fixed number of items are to be admin-istered or whether the test is to continue untilprecision or content coverage criteria are met.

80

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 80

Page 91: STANDARDS

The use of adaptive testing and related com-puter-based testing models also involves specialconsiderations related to item development. Whena pool of operational items is developed for acomputerized adaptive test, the specifications referboth to the item pool and to the rules or proceduresby which an individualized set of items is selectedfor each test taker. Some of the appealing featuresof computerized adaptive tests, such as tailoringthe difficulty level of the items to the test taker’sability, place additional constraints on the designof such tests. In most cases, large numbers ofitems are needed in constructing a computerizedadaptive test to ensure that the set of items ad-ministered to each test taker meets all of the re-quirements of the test specifications. Further, testsoften are developed in the context of larger systemsor programs. Multiple pools of items, for example,may be created for use with different groups oftest takers or on different testing dates. Testsecurity concerns are heightened when limitedavailability of equipment makes it impossible totest all examinees at the same time. A number ofissues, including test security, the complexity ofcontent coverage requirements, required scoreprecision levels, and whether test takers might beallowed to retest using the same pool, must beconsidered when specifying the size of item poolsassociated with each form of the adaptive test.The development of items for adaptive testing

typically requires a greater proportion of items tobe developed at high or low levels of difficultyrelative to the targeted testing population. Tryoutdata for items developed for use in adaptive testsshould be examined for possible context effects toassess how much item parameters might shiftwhen items are administered in different orders.In addition, if items are associated with a commonpassage or stimulus, development should be in-formed by an understanding of how item selectionwill work. For example, the approach to developingitems associated with a passage may differ dependingon whether the item selection algorithm selectsall of the available items related to the passage oris able to choose subsets of the available itemsrelated to the passage. Because of the issues thatarise when items or tasks are nested within

common passages or stimuli, variations on adaptivetesting are often considered. For example, multistagetesting begins with a set of routing items. Oncethese are given and scored, the computer branchesto item groups that are explicitly targeted to ap-propriate difficulty levels, based on the evaluationof examinees’ observed performance on the routingitems. In general, the special requirements ofadaptive testing necessitate some shift in the wayin which items are developed and tried out. Al-though the fundamental principles of quality itemdevelopment are no different, greater attentionmust be given to the interactions among content,format, and item difficulty to achieve item poolsthat are best suited to this testing approach.

Systems Supporting Item and Test Development

The increased reliance on technology and the needfor speed and efficiency in the test developmentprocess require consideration of the systems sup-porting item and test development. Such systemscan enhance good item and test developmentpractice by facilitating item/task authoring andreviewing, providing item banking and automatedtools to assist with test form development, and in-tegrating item/task statistical information withitem/task text and graphics. These systems can bedeveloped to comply with interoperability and ac-cessibility standards and frameworks that make iteasier for test users to transition their testingprograms from one test developer to another. Al-though the specifics of item databases and supportingsystems are outside the scope of the Standards, theincreased availability of such systems compels thoseresponsible for developing such tests to considerapplying technology to test design and development.Test developers should evaluate costs and benefitsof different applications, considering issues suchas speed of development, transportability acrosstesting platforms, and security.

Item Development and Review

The test developer usually assembles an item poolthat consists of more questions or tasks than areneeded to populate the test form or forms to bebuilt. This allows the test developer to select a set

81

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 81

Page 92: STANDARDS

of items for one or more forms of the test thatmeet the test specifications. The quality of theitems is usually ascertained through item reviewprocedures and item tryouts, often referred to aspretesting. Items are reviewed for content quality,clarity, and construct-irrelevant aspects of contentthat influence test takers’ responses. In most cases,sound practice dictates that items be reviewed forsensitivity and potential offensiveness that couldintroduce construct-irrelevant variance for indi-viduals or groups of test takers. An attempt isgenerally made to avoid words and topics thatmay offend or otherwise disturb some test takers,if less offensive material is equally useful (seechap. 3). For constructed response questions andperformance tasks, development includes item-specific scoring rubrics as well as prompts or taskdescriptions. Reviewers should be knowledgeableabout test content and about the examinee groupscovered by this review.Often, new test items are administered to a

group of test takers who are as representative aspossible of the target population for the test, andwhere possible, who adequately represent individualsfrom intended subgroups. Item tryouts help deter-mine some of the psychometric properties of thetest items, such as an item’s difficulty and ability todistinguish among test takers of different standingon the construct being assessed. Ongoing testingprograms often pretest items by inserting theminto existing operational tests (the tryout items donot contribute to the scores that test takers receive).Analyses of responses to these tryout items provideuseful data for evaluating quality and appropriatenessprior to operational use.Statistical analyses of item tryout data commonly

include studies of differential item functioning(see chap. 3, “Fairness in Testing”). Differentialitem functioning is said to exist when test takersfrom different groups (e.g., groups defined bygender, race/ethnicity, or age) who have approxi-mately equal ability on the targeted construct orcontent domain differ in their responses to anitem. In theory, the ultimate goal of such studiesis to identify construct-irrelevant aspects of itemcontent, item format, or scoring criteria that maydifferentially affect test scores of one or more

groups of test takers. When differential item func-tioning is detected, test developers try to identifyplausible explanations for the differences, and theymay then replace or revise items to promote soundscore interpretations for all examinees. When itemsare dropped due to a differential item functioningindex, the test developer must take care that anyreplacements or revisions do not compromise cov-erage of the specified test content. Test developers sometimes use approaches in-

volving structured interviews or think-aloud pro-tocols with selected test takers. Such approaches,sometimes referred to as cognitive labs, are used toidentify irrelevant barriers to responding correctlythat might limit the accessibility of the test content.Cognitive labs are also used to provide evidencethat the cognitive processes being followed bythose taking the assessment are consistent withthe construct to be measured. Additional steps are involved in the evaluation

of scoring rubrics for extended-response items orperformance tasks. Test developers must identifyresponses that illustrate each scoring level, for usein training and checking scorers. Developers alsoidentify responses at the borders between adjacentscore levels for use in more detailed discussionsduring scorer training. Statistical analyses of scoringconsistency and accuracy (agreement with scoresassigned by experts) should be included in theanalysis of tryout data.

Assembling and Evaluating Test Forms

The next step in test development is to assembleitems into one or more test forms or to identifyone or more pools of items for an adaptive ormultistage test. The test developer is responsiblefor documenting that the items selected for thetest meet the requirements of the test specifications.In particular, the set of items selected for a newtest form or an item pool for an adaptive testmust meet both content and psychometric speci-fications. In addition, editorial and content reviewsare commonly conducted to replace items thatare too similar to other items or that may provideclues to the answers to other items in the sametest form or item pool. When multiple forms of a

82

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 82

Page 93: STANDARDS

test are prepared, the test specifications governeach of the forms.New test forms are sometimes tried out or

field tested prior to operational use. The purposeof a field test is to determine whether itemsfunction as intended in the context of the newtest form and to assess statistical properties,such as score precision or reliability, of the newform. When field tests are conducted, all relevantexaminee groups should be included so thatresults and conclusions will generalize to the in-tended operational use of the new test formsand support further analyses of the fairness ofthe new forms.

Developing Procedures and Materials for Administration and Scoring

Many interested persons (e.g., practitioners,teachers) may be involved in developing itemsand scoring rubrics, and/or evaluating the subse-quent performances. If a participatory approachis used, participants’ knowledge about the domainbeing assessed and their ability to apply the scoringrubrics are of critical importance. Equally importantfor those involved in developing tests and evaluatingperformances is their familiarity with the natureof the population being tested. Relevant charac-teristics of the population being tested may includethe typical range of expected skill levels, familiaritywith the response modes required of them, typicalways in which knowledge and skills are displayed,and the primary language used.Test development includes creation of a number

of documents to support test administration asdescribed in the test specifications. Instructionsto test users are developed and tried out as part ofpilot or field testing procedures. Instructions andtraining for test administrators must also be de-veloped and tried out. A key consideration in de-veloping test administration procedures and ma-terials is that test administration should be fair toall examinees. This means that instructions fortaking the test should be clear and that test ad-ministration conditions should be standardizedfor all examinees. It also means considerationmust be given in advance to appropriate testing

accommodations for examinees who need them,as discussed in chapter 3. For computer-administered tests, administration

procedures must be consistent with hardware andsoftware requirements included in the test speci-fications. Hardware requirements may cover proces-sor speed and memory; keyboard, mouse, or otherinput devices; monitor size and display resolution;and connectivity to local servers or the Internet.Software requirements cover operating systems,browsers, or other common tools and provisionsfor blocking access to, or interference from, othersoftware. Examinees taking computer-administeredtests should be informed on how to respond toquestions, how to navigate through the test,whether they can skip items, whether they canrevisit previously answered items later in thetesting period, whether they can suspend thetesting session to a later time, and other exigenciesthat may occur during testing.Test security procedures should also be imple-

mented in conjunction with both administrationand scoring of the tests. Such procedures ofteninclude tracking and storage of materials; encryptionof electronic transmission of exam content andscores; nondisclosure agreements for test takers,scorers, and administrators; and procedures formonitoring examinees during the testing session.In addition, for testing programs that reuse testitems or test forms, security procedures shouldinclude evaluation of changes in item statistics toassess the possibility of a security breach. Test de-velopers or users might consider monitoring ofwebsites for possible disclosure of test content.

Test Revisions

Tests and their supporting documents (e.g., testmanuals, technical manuals, user guides) shouldbe reviewed periodically to determine whetherrevisions are needed. Revisions or amendmentsare necessary when new research data, significantchanges in the domain, or new conditions of testuse and interpretation suggest that the test is nolonger optimal or fully appropriate for some ofits intended uses. As an example, tests are revisedif the test content or language has become

83

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 83

Page 94: STANDARDS

outdated and, therefore, may subsequently affectthe validity of the test score interpretations. How-ever, outdated norms may not have the same im-plications for revisions as an outdated test. Forexample, it may be necessary to update the normsfor an achievement test after a period of rising orfalling achievement in the norming population,or when there are changes in the test-taking pop-ulation; but the test content itself may continue

to be as relevant as it was when the test was de-veloped. The timing of the need for review willvary as a function of test content and intendeduse(s). For example, tests of mastery of educationalor training curricula should be reviewed wheneverthe corresponding curriculum is updated. Testsassessing psychological constructs should be re-viewed when research suggests a revised concep-tualization of the construct.

84

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 84

Page 95: STANDARDS

The standards in this chapter begin with an over-arching standard (numbered 4.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intofour thematic clusters labeled as follows:

1. Standards for Test Specifications2. Standards for Item Development and Review3. Standards for Developing Test Administra-tion and Scoring Procedures and Materials

4. Standards for Test Revision

Standard 4.0

Tests and testing programs should be designedand developed in a way that supports the validityof interpretations of the test scores for their in-tended uses. Test developers and publishersshould document steps taken during the designand development process to provide evidence offairness, reliability, and validity for intendeduses for individuals in the intended examineepopulation.

Comment: Specific standards for designing anddeveloping tests in a way that supports intendeduses are described below. Initial specifications fora test, intended to guide the development process,may be modified or expanded as developmentproceeds and new information becomes available.Both initial and final documentation of test spec-ifications and development procedures provide abasis on which external experts and test users canjudge the extent to which intended uses havebeen or are likely to be supported, leading tovalid interpretations of test results for all individuals.Initial test specifications may be modified as evi-dence is collected during development and im-plementation of the test.

Cluster 1. Standards for TestSpecifications

Standard 4.1

Test specifications should describe the purpose(s)of the test, the definition of the construct or do-main measured, the intended examinee population,and interpretations for intended uses. The spec-ifications should include a rationale supportingthe interpretations and uses of test results forthe intended purpose(s).

Comment: The adequacy and usefulness of testinterpretations depend on the rigor with whichthe purpose(s) of the test and the domain repre-sented by the test have been defined and explicated.The domain definition should be sufficiently de-tailed and delimited to show clearly what dimensionsof knowledge, skills, cognitive processes, attitudes,values, emotions, or behaviors are included andwhat dimensions are excluded. A clear descriptionwill enhance accurate judgments by reviewers andothers about the degree of congruence betweenthe defined domain and the test items. Clearspecification of the intended examinee populationand its characteristics can help to guard againstconstruct-irrelevant characteristics of item contentand format. Specifications should include plansfor collecting evidence of the validity of theintended interpretations of the test scores for theirintended uses. Test developers should also identifypotential limitations on test use or possible inap-propriate uses.

Standard 4.2

In addition to describing intended uses of thetest, the test specifications should define thecontent of the test, the proposed test length, theitem formats, the desired psychometric propertiesof the test items and the test, and the orderingof items and sections. Test specifications shouldalso specify the amount of time allowed for

85

TEST DESIGN AND DEVELOPMENT

STANDARDS FOR TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 85

Page 96: STANDARDS

testing; directions for the test takers; proceduresto be used for test administration, includingpermissible variations; any materials to be used;and scoring and reporting procedures. Specifica-tions for computer-based tests should include adescription of any hardware and software re-quirements.

Comment: Professional judgment plays a majorrole in developing the test specifications. Thespecific procedures used for developing the speci-fications depend on the purpose(s) of the test.For example, in developing licensure and certifi-cation tests, practice analyses or job analysesusually provide the basis for defining the testspecifications; job analyses alone usually serve thisfunction for employment tests. For achievementtests given at the end of a course, the test specifi-cations should be based on an outline of coursecontent and goals. For placement tests, developerswill examine the required entry-level knowledgeand skills for different courses. In developing psy-chological tests, descriptions and diagnostic criteriaof behavioral, mental, and emotional deficits andpsychopathology inform test specifications.The types of items, the response formats, the

scoring procedures, and the test administrationprocedures should be selected based on thepurpose(s) of the test, the domain to be measured,and the intended test takers. To the extent possible,test content and administration procedures shouldbe chosen so that intended inferences from testscores are equally valid for all test takers. Somedetails of the test specifications may be revised onthe basis of initial pilot or field tests. For example,specifications of the test length or mix of itemtypes might be modified based on initial data toachieve desired precision of measurement.

Standard 4.3

Test developers should document the rationaleand supporting evidence for the administration,scoring, and reporting rules used in computer-adaptive, multistage-adaptive, or other tests de-livered using computer algorithms to select items.This documentation should include procedures

used in selecting items or sets of items for ad-ministration, in determining the starting pointand termination conditions for the test, in scoringthe test, and in controlling item exposure.

Comment: If a computerized adaptive test is in-tended to measure a number of different contentsubcategories, item selection procedures shouldensure that the subcategories are adequately rep-resented by the items presented to the test taker.Common rationales for computerized adaptivetests are that score precision is increased, particularlyfor high- and low-scoring examinees, or that com-parable precision is achieved while testing time isreduced. Note that these tests are subject to thesame requirements for documenting the validityof score interpretations for their intended use asother types of tests. Test specifications shouldinclude plans to collect evidence required for suchdocumentation.

Standard 4.4

If test developers prepare different versions of atest with some change to the test specifications,they should document the content and psycho-metric specifications of each version. The docu-mentation should describe the impact of differ-ences among versions on the validity of score in-terpretations for intended uses and on theprecision and comparability of scores.

Comment: Test developers may have a numberof reasons for creating different versions of a test,such as allowing different amounts of time fortest administration by reducing or increasing thenumber of items on the original test, or allowingadministration to different populations by trans-lating test questions into different languages. Testdevelopers should document the extent to whichthe specifications differ from those of the originaltest, provide a rationale for the different versions,and describe the implications of such differencesfor interpreting the scores derived from the differentversions. Test developers and users should monitorand document any psychometric differences amongversions of the test based on evidence collectedduring development and implementation. Evidence

86

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 86

Page 97: STANDARDS

of differences may involve judgments when thenumber of examinees receiving a particular versionis small (e.g., a braille version). Note that theserequirements are in addition to the normal re-quirements for demonstrating the equivalency ofscores from different forms of the same test.When different languages are used in differenttest versions, the procedures used to develop andcheck translations into each language should bedocumented.

Standard 4.5

If the test developer indicates that the conditionsof administration are permitted to vary fromone test taker or group to another, permissiblevariation in conditions for administration shouldbe identified. A rationale for permitting the dif-ferent conditions and any requirements for per-mitting the different conditions should be doc-umented.

Comment: Variation in conditions of adminis-tration may reflect administration constraints indifferent locations or, more commonly, may bedesigned as testing accommodations for specificexaminees or groups of examinees. One exampleof a common variation is the use of computeradministration of a test form in some locationsand paper-and-pencil administration of the sameform in other locations. Another example issmall-group or one-on-one administration fortest takers whose test performance might belimited by distractions in large group settings.Test accommodations, as discussed in chapter 3(“Fairness in Testing”), are changes made in atest to increase fairness for individuals who oth-erwise would be disadvantaged by construct-ir-relevant features of test items. Test developersshould specify procedures for monitoring variationsand for collecting evidence to show that thetarget construct is or is not altered by allowablevariations. These procedures should be documentedbased on data collected during implementation.

Standard 4.6

When appropriate to documenting the validity oftest score interpretations for intended uses, relevantexperts external to the testing program shouldreview the test specifications to evaluate their ap-propriateness for intended uses of the test scoresand fairness for intended test takers. The purposeof the review, the process by which the review isconducted, and the results of the review shouldbe documented. The qualifications, relevant ex-periences, and demographic characteristics ofexpert judges should also be documented.

Comment: A number of factors may be consideredin deciding whether external review of test speci-fications is needed, including the extent of intendeduse, whether score interpretations may have im-portant consequences, and the availability ofexternal experts. Expert review of the test specifi-cations may serve many useful purposes, such ashelping to ensure content quality and representa-tiveness. Use of experts external to the test devel-opment process supports objectivity in judgmentsof the quality of the test specifications. Review ofthe specifications prior to starting item developmentcan avoid significant problems during subsequenttest item reviews. The expert judges may includeindividuals representing defined populations ofconcern to the test specifications. For example, ifthe test is to be administered to different linguisticand cultural groups, the expert review typicallyincludes members of these groups and experts ontesting issues specific to these groups.

Cluster 2. Standards for ItemDevelopment and Review

Standard 4.7

The procedures used to develop, review, and tryout items and to select items from the item poolshould be documented.

Comment: The qualifications of individuals de-veloping and reviewing items and the processesused to train and guide them in these activities

87

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 87

Page 98: STANDARDS

are important aspects of test development docu-mentation. Typically, several groups of individualsparticipate in the test development process, in-cluding item writers and individuals participatingin reviews for item and test content, for sensitivity,or for other purposes.

Standard 4.8

The test review process should include empiricalanalyses and/or the use of expert judges toreview items and scoring criteria. When expertjudges are used, their qualifications, relevantexperiences, and demographic characteristicsshould be documented, along with the instruc-tions and training in the item review processthat the judges receive.

Comment: When sample size permits, empiricalanalyses are needed to check the psychometricproperties of test items and also to check whethertest items function similarly for different groups.Expert judges may be asked to check item scoringand to identify material likely to be inappropriate,confusing, or offensive for groups in the test-taking population. For example, judges may beasked to identify whether lack of exposure toproblem contexts in mathematics word problemsmay be of concern for some groups of students.Various groups of test takers can be defined bycharacteristics such as age, ethnicity, culture,gender, disability, or demographic region. Whenfeasible, both empirical and judgmental evidenceof the extent to which test items function similarlyfor different groups should be used in screeningthe items. (See chap. 3 for examples of appropriatetypes of evidence.)Studies of the alignment of test forms to

content specifications are sometimes conductedto support interpretations that test scoresindicate mastery of targeted test content. Expertsindependent of the test developers judge thedegree to which item content matches contentcategories in the test specifications and whethertest forms provide balanced coverage of thetargeted content.

Standard 4.9

When item or test form tryouts are conducted,the procedures used to select the sample(s) of testtakers as well as the resulting characteristics ofthe sample(s) should be documented. The sample(s)should be as representative as possible of the pop-ulation(s) for which the test is intended.

Comment: Conditions that may differentiallyaffect performance on the test items by the tryoutsample(s) as compared with the intended popula-tion(s) should be documented when appropriate.For example, test takers may be less motivatedwhen they know their scores will not have animpact on them. Where possible, item and testcharacteristics should be examined and documentedfor relevant subgroups in the intended examineepopulation.To the extent feasible, item and test form

tryouts should include relevant examinee groups.Where sample size permits, test developers shoulddetermine whether item scores have different re-lationships to the construct being measured fordifferent groups (differential item functioning).When testing accommodations are designed forspecific examinee groups, information on itemperformance under accommodated conditionsshould also be collected. For relatively smallgroups, qualitative information may be useful.For example, test-taker interviews might be usedto assess the effectiveness of accommodations inremoving irrelevant variance.

Standard 4.10

When a test developer evaluates the psychometricproperties of items, the model used for thatpurpose (e.g., classical test theory, item responsetheory, or another model) should be documented.The sample used for estimating item propertiesshould be described and should be of adequatesize and diversity for the procedure. The processby which items are screened and the data usedfor screening, such as item difficulty, item dis-crimination, or differential item functioning

88

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 88

Page 99: STANDARDS

(DIF) for major examinee groups, should also bedocumented. When model-based methods (e.g.,IRT) are used to estimate item parameters in testdevelopment, the item response model, estimationprocedures, and evidence of model fit should bedocumented.

Comment: Although overall sample size isrelevant, there should also be an adequate numberof cases in regions critical to the determinationof the psychometric properties of items. If thetest is to achieve greatest precision in a particularpart of the score scale and this considerationaffects item selection, the manner in which itemstatistics are used for item selection needs to becarefully described. When IRT is used as thebasis of test development, it is important to doc-ument the adequacy of fit of the model to thedata. This is accomplished by providing infor-mation about the extent to which IRT assumptions(e.g., unidimensionality, local item independence,or, for certain models, equality of slope parameters)are satisfied.Statistics used for flagging items that function

differently for different groups should be described,including specification of the groups to be analyzed,the criteria for flagging, and the procedures forreviewing and making final decisions about flaggeditems. Sample sizes for groups of concern shouldbe adequate for detecting meaningful DIF.Test developers should consider how any dif-

ferences between the administration conditionsof the field test and the final form might affectitem performance. Conditions that can affectitem statistics include motivation of the testtakers, item position, time limits, length of test,mode of testing (e.g., paper-and-pencil versuscomputer administered), and use of calculatorsor other tools.

Standard 4.11

Test developers should conduct cross-validationstudies when items or tests are selected primarilyon the basis of empirical relationships rather thanon the basis of content or theoretical considerations.

The extent to which the different studies showconsistent results should be documented.

Comment: When data-based approaches to testdevelopment are used, items are selected primarilyon the basis of their empirical relationships withan external criterion, their relationships with oneanother, or their power to discriminate amonggroups of individuals. Under these circumstances,it is likely that some items will be selected basedon chance occurrences in the data used. Adminis-tering the test to a comparable sample of testtakers or use of a separate validation sampleprovides independent verification of the relationshipsused in selecting items.Statistical optimization techniques such as

stepwise regression are sometimes used to developtest composites or to select tests for further use ina test battery. As with the empirical selection ofitems, capitalization on chance can occur. Cross-validation on an independent sample or the useof a formula that predicts the shrinkage of corre-lations in an independent sample may provide aless biased index of the predictive power of thetests or composite.

Standard 4.12

Test developers should document the extent towhich the content domain of a test representsthe domain defined in the test specifications.

Comment:Test developers should provide evidenceof the extent to which the test items and scoringcriteria yield scores that represent the defined do-main. This affords a basis to help determinewhether performance on the test can be generalizedto the domain that is being assessed. This isespecially important for tests that contain a smallnumber of items, such as performance assessments.Such evidence may be provided by expert judges.In some situations, an independent study of thealignment of test questions to the content specifi-cations is conducted to validate the developer’sinternal processing for ensuring appropriate contentcoverage.

89

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 89

Page 100: STANDARDS

Standard 4.13

When credible evidence indicates that irrelevantvariance could affect scores from the test, then tothe extent feasible, the test developer should in-vestigate sources of irrelevant variance. Wherepossible, such sources of irrelevant variance shouldbe removed or reduced by the test developer.

Comment: A variety of methods may be used tocheck for the influence of irrelevant factors, in-cluding analyses of correlations with measures ofother relevant and irrelevant constructs and, insome cases, deeper cognitive analyses (e.g., use offollow-up probes to identify relevant and irrelevantreasons for correct and incorrect responses) of ex-aminee standing on the targeted construct. Adeeper understanding of irrelevant sources of vari-ance may also lead to refinement of the descriptionof the construct under examination.

Standard 4.14

For a test that has a time limit, test developmentresearch should examine the degree to whichscores include a speed component and shouldevaluate the appropriateness of that component,given the domain the test is designed to measure.

Comment: At a minimum, test developers shouldexamine the proportion of examinees who completethe entire test, as well as the proportion who failto respond to (omit) individual test questions.Where speed is a meaningful part of the targetconstruct, the distribution of the number of itemsanswered should be analyzed to check for appro-priate variability in the number of items attemptedas well as the number of correct responses. Whenspeed is not a meaningful part of the target con-struct, time limits should be determined so thatexaminees will have adequate time to demonstratethe targeted knowledge and skill.

Cluster 3. Standards for DevelopingTest Administration and ScoringProcedures and Materials

Standard 4.15

The directions for test administration should bepresented with sufficient clarity so that it ispossible for others to replicate the administrationconditions under which the data on reliability,validity, and (where appropriate) norms wereobtained. Allowable variations in administrationprocedures should be clearly described. Theprocess for reviewing requests for additionaltesting variations should also be documented.

Comment: Because all people administering tests,including those in schools, industry, and clinics,need to follow test administration procedurescarefully, it is essential that test administratorsreceive detailed instructions on test administrationguidelines and procedures. Testing accommodationsmay be needed to allow accurate measurement ofintended constructs for specific groups of testtakers, such as individuals with disabilities andindividuals whose native language is not English.(See chap. 3, “Fairness in Testing.”)

Standard 4.16

The instructions presented to test takers shouldcontain sufficient detail so that test takers canrespond to a task in the manner that the test de-veloper intended. When appropriate, sample ma-terials, practice or sample questions, criteria forscoring, and a representative item identified witheach item format or major area in the test’s clas-sification or domain should be provided to thetest takers prior to the administration of thetest, or should be included in the testing materialas part of the standard administration instruc-tions.

Comment: For example, in a personality inventorythe intent may be that test takers give the first re-sponse that occurs to them. Such an expectationshould be made clear in the inventory directions.

90

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 90

Page 101: STANDARDS

As another example, in directions for interest oroccupational inventories, it may be important tospecify whether test takers are to mark the activitiesthey would prefer under ideal conditions orwhether they are to consider both their opportunityand their ability realistically.Instructions and any practice materials should

be available in formats that can be accessed by alltest takers. For example, if a braille version of thetest is provided, the instructions and any practicematerials should also be provided in a form thatcan be accessed by students who take the brailleversion. The extent and nature of practice materials

and directions depend on expected levels of knowl-edge among test takers. For example, in using anovel test format, it may be very important toprovide the test taker with a practice opportunityas part of the test administration. In some testingsituations, it may be important for the instructionsto address such matters as time limits and theeffects that guessing has on test scores. If expansionor elaboration of the test instructions is permitted,the conditions under which this may be doneshould be stated clearly in the form of generalrules and by giving representative examples. If noexpansion or elaboration is to be permitted, thisshould be stated explicitly. Test developers shouldinclude guidance for dealing with typical questionsfrom test takers. Test administrators should be in-structed on how to deal with questions that mayarise during the testing period.

Standard 4.17

If a test or part of a test is intended for researchuse only and is not distributed for operationaluse, statements to that effect should be displayedprominently on all relevant test administrationand interpretation materials that are provided tothe test user.

Comment: This standard refers to tests that areintended for research use only. It does not refer tostandard test development functions that occurprior to the operational use of a test (e.g., itemand form tryouts). There may be legal requirements

to inform participants of how the test developerwill use the data generated from the test, includingthe user’s personally identifiable information, howthat information will be protected, and withwhom it might be shared.

Standard 4.18

Procedures for scoring and, if relevant, scoringcriteria, should be presented by the test developerwith sufficient detail and clarity to maximizethe accuracy of scoring. Instructions for usingrating scales or for deriving scores obtained bycoding, scaling, or classifying constructed responsesshould be clear. This is especially critical for ex-tended-response items such as performance tasks,portfolios, and essays.

Comment: In scoring more complex responses,test developers must provide detailed rubrics andtraining in their use. Providing multiple examplesof responses at each score level for use in trainingscorers and monitoring scoring consistency isalso common practice, although these are typicallyadded to scoring specifications during item de-velopment and tryouts. For monitoring scoringeffectiveness, consistency criteria for qualifyingscorers should be specified, as appropriate, alongwith procedures, such as double-scoring of someor all responses. As appropriate, test developersshould specify selection criteria for scorers andprocedures for training, qualifying, and monitoringscorers. If different groups of scorers are usedwith different administrations, procedures forchecking the comparability of scores generatedby the different groups should be specified andimplemented.

Standard 4.19

When automated algorithms are to be used toscore complex examinee responses, characteristicsof responses at each score level should be docu-mented along with the theoretical and empiricalbases for the use of the algorithms.

Comment: Automated scoring algorithms shouldbe supported by an articulation of the theoretical

91

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 91

Page 102: STANDARDS

and methodological bases for their use that is suf-ficiently detailed to establish a rationale for linkingthe resulting test scores to the underlying constructof interest. In addition, the automated scoring al-gorithm should have empirical research support,such as agreement rates with human scorers, priorto operational use, as well as evidence that thescoring algorithms do not introduce systematicbias against some subgroups. Because automated scoring algorithms are

often considered proprietary, their developers arerarely willing to reveal scoring and weightingrules in public documentation. Also, in somecases, full disclosure of details of the scoring algo-rithm might result in coaching strategies thatwould increase scores without any real change inthe construct(s) being assessed. In such cases, de-velopers should describe the general characteristicsof scoring algorithms. They may also have the al-gorithms reviewed by independent experts, underconditions of nondisclosure, and collect independentjudgments of the extent to which the resultingscores will accurately implement intended scoringrubrics and be free from bias for intended examineesubpopulations.

Standard 4.20

The process for selecting, training, qualifying,and monitoring scorers should be specified bythe test developer. The training materials, suchas the scoring rubrics and examples of test takers’responses that illustrate the levels on the rubricscore scale, and the procedures for trainingscorers should result in a degree of accuracy andagreement among scorers that allows the scoresto be interpreted as originally intended by thetest developer. Specifications should also describeprocesses for assessing scorer consistency andpotential drift over time in raters’ scoring.

Comment:To the extent possible, scoring processesand materials should anticipate issues that mayarise during scoring. Training materials shouldaddress any common misconceptions about therubrics used to describe score levels. When writtentext is being scored, it is common to include a set

of prescored responses for use in training and forjudging scoring accuracy. The basis for determiningscoring consistency (e.g., percentage of exact agree-ment, percentage within one score point, or someother index of agreement) should be indicated.Information on scoring consistency is essential toestimating the precision of resulting scores.

Standard 4.21

When test users are responsible for scoring andscoring requires scorer judgment, the test user isresponsible for providing adequate training andinstruction to the scorers and for examiningscorer agreement and accuracy. The test developershould document the expected level of scoreragreement and accuracy and should provide asmuch technical guidance as possible to aid testusers in satisfying this standard.

Comment: A common practice of test developersis to provide training materials (e.g., scoringrubrics, examples of test takers’ responses at eachscore level) and procedures when scoring is doneby test users and requires scorer judgment. Trainingprovided to support local scoring should includestandards for checking scorer accuracy duringtraining and operational scoring. Training shouldalso cover any special consideration for test-takergroups that might interact differently with thetask to be scored.

Standard 4.22

Test developers should specify the proceduresused to interpret test scores and, when appropriate,the normative or standardization samples or thecriterion used.

Comment: Test specifications may indicate thatthe intended scores should be interpreted as in-dicating an absolute level of the construct beingmeasured or as indicating standing on the con-struct relative to other examinees, or both. Inabsolute score interpretations, the score or averageis assumed to reflect directly a level of competenceor mastery in some defined criterion domain. Inrelative score interpretations the status of an in-

92

CHAPTER 4

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 92

Page 103: STANDARDS

dividual (or group) is determined by comparingthe score (or mean score) with the performanceof others in one or more defined populations.Tests designed to facilitate one type of interpre-tation may function less effectively for the othertype of interpretation. Given appropriate testdesign and adequate supporting data, however,scores arising from norm-referenced testing pro-grams may provide reasonable absolute score in-terpretations, and scores arising from criterion-referenced programs may provide reasonable rel-ative score interpretations.

Standard 4.23

When a test score is derived from the differentialweighting of items or subscores, the test developershould document the rationale and process usedto develop, review, and assign item weights.When the item weights are obtained based onempirical data, the sample used for obtainingitem weights should be representative of thepopulation for which the test is intended andlarge enough to provide accurate estimates ofoptimal weights. When the item weights are ob-tained based on expert judgment, the qualificationsof the judges should be documented.

Comment: Changes in the population of testtakers, along with other changes, for example ininstructions, training, or job requirements, mayaffect the original derived item weights, necessitatingsubsequent studies. In many cases, content areasare weighted by specifying a different number ofitems from different areas. The rationale forweighting the different content areas should alsobe documented and periodically reviewed.

Cluster 4. Standards for Test Revision

Standard 4.24

Test specifications should be amended or revisedwhen new research data, significant changes inthe domain represented, or newly recommended

conditions of test use may reduce the validity oftest score interpretations. Although a test thatremains useful need not be withdrawn or revisedsimply because of the passage of time, test devel-opers and test publishers are responsible for mon-itoring changing conditions and for amending,revising, or withdrawing the test as indicated.

Comment: Test developers need to consider anumber of factors that may warrant the revisionof a test, including outdated test content and lan-guage, new evidence of relationships among meas-ured or predicted constructs, or changes to testframeworks to reflect changes in curriculum, in-struction, or job requirements. If an older versionof a test is used when a newer version has beenpublished or made available, test users are re-sponsible for providing evidence that the olderversion is as appropriate as the new version forthat particular test use.

Standard 4.25

When tests are revised, users should be informedof the changes to the specifications, of any ad-justments made to the score scale, and of thedegree of comparability of scores from the originaland revised tests. Tests should be labeled as “re-vised” only when the test specifications havebeen updated in significant ways.

Comment: It is the test developer’s responsibilityto determine whether revisions to a test would in-fluence test score interpretations. If test score in-terpretations would be affected by the revisions,it is appropriate to label the test “revised.” Whentests are revised, the nature of the revisions andtheir implications for test score interpretationsshould be documented. Examples of changes thatrequire consideration include adding new areas ofcontent, refining content descriptions, redistributingthe emphasis across different content areas, andeven just changing item format specifications.Note that creating a new test form using the samespecifications is not considered a revision withinthe context of this standard.

93

TEST DESIGN AND DEVELOPMENT

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 93

Page 104: STANDARDS

ch04.qxp_AERA Standards 6/18/14 2:33 PM Page 94

Page 105: STANDARDS

Test scores are reported on scales designed toassist in score interpretation. Typically, scoringbegins with responses to separate test items. Theseitem scores are combined, sometimes by addition,to obtain a raw score when using classical testtheory or to produce an IRT score when usingitem response theory (IRT) or other model-basedtechniques. Raw scores and IRT scores often aredifficult to interpret in the absence of further in-formation. Interpretation may be facilitated byconverting raw scores or IRT scores to scalescores. Examples include various scale scores usedon college admissions tests and those used toreport results for intelligence tests or vocationalinterest and personality inventories. The processof developing a score scale is referred to as scalinga test. Scale scores may aid interpretation by in-dicating how a given score compares with thoseof other test takers, by enhancing the comparabilityof scores obtained through different forms of atest, and by helping to prevent confusion withother scores.

Another way of assisting score interpretationis to establish cut scores that distinguish differentscore ranges. In some cases, a single cut scoredefines the boundary between passing and failing.In other cases, a series of cut scores define distinctproficiency levels. Scale scores, proficiency levels,and cut scores can be central to the use and inter-pretation of test scores. For that reason, their de-fensibility is an important consideration in testscore validation for the intended purposes.

Decisions about how many scale score pointsto use often are based on test score reliability con-cerns. If too few scale score points are used, thenthe reliability of scale scores is decreased as infor-mation is discarded. If too many scale-score pointsare used, then test users might attempt to interpretscale score differences that are small relative tothe amount of measurement error in the scores.

In addition to facilitating interpretations ofscores on a single test form, scale scores often arecreated to enhance comparability across alternateforms2 of the same test, by using equating methods.Score linking is a general term for methods used todevelop scales with similar scale properties. Scorelinking includes equating and other methods fortransforming scores to enhance their comparabilityon tests designed to measure different constructs(e.g., related subtests in a battery). Linking methodsare also used to relate scale scores on differentmeasures of similar constructs (e.g., tests of aparticular construct from different test developers)and to relate scale scores on tests that measuresimilar constructs given under different modes ofadministration (e.g., computer and paper-and-pencil administrations). Vertical scaling methodssometimes are used to place scores from differentlevels of an achievement test on a single scale to fa-cilitate inferences about growth or development.The degree of score comparability that results fromthe application of a linking procedure varies alonga continuum. Equating is intended to allow scoreson alternate test forms to be used interchangeably,whereas comparability of scores associated withother types of linking may be more restricted.

Interpretations of Scores

An individual’s raw scores or scale scores often arecompared with the distribution of scores for one

95

5. SCORES, SCALES, NORMS, SCORELINKING, AND CUT SCORES

BACKGROUND

2The term alternate form is used in this chapter to indicatetest forms that have been built to the same content andstatistical specifications and developed to measure the sameconstruct. This term is not to be confused with the termalternate assessment as it is used in chapter 3, to indicate a testthat has been modified or changed to increase access to theconstruct for subgroups of the population. The alternate assessment may or may not measure the same construct as theunaltered assessment.

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 95

Page 106: STANDARDS

or more comparison groups to draw useful infer-ences about the person’s relative performance.Test score interpretations based on such comparisonsare said to be norm referenced. Percentile ranknorms, for example, indicate the standing of anindividual or group within a defined populationof individuals or groups. An example might bethe percentile scores used in military enlistmenttesting, which compare each applicant’s score withscores for the population of 18-to-23-year-oldAmerican youth. Percentiles, averages, or otherstatistics for such reference groups are called norms.By showing how the test score of a given examineecompares with those of others, norms assist in theclassification or description of examinees.

Other test score interpretations make no directreference to the performance of other examinees.These interpretations may take a variety of forms;most are collectively referred to as criterion-referenced interpretations. Scale scores supportingsuch interpretations may indicate the likely pro-portion of correct responses that would be obtainedon some larger domain of similar items, or theprobability that an examinee will answer particularsorts of items correctly. Other criterion-referencedinterpretations may indicate the likelihood thatsome psychopathology is present. Still other cri-terion-referenced interpretations may indicate theprobability that an examinee’s level of testedknowledge or skill is adequate to perform suc-cessfully in some other setting. Scale scores tosupport such criterion-referenced score interpre-tations often are developed on the basis of statisticalanalyses of the relationships of test scores to othervariables.

Some scale scores are developed primarily tosupport norm-referenced interpretations; otherssupport criterion-referenced interpretations. Inpractice, however, there is not always a sharp dis-tinction. Both criterion-referenced and norm-ref-erenced scales may be developed and used withthe same test scores if appropriate methods areused to validate each type of interpretation. More-over, a norm-referenced score scale originally de-veloped, for example, to indicate performancerelative to some specific reference populationmight, over time, also come to support criterion-

referenced interpretations. This could happen asresearch and experience bring increased under-standing of the capabilities implied by differentscale score levels. Conversely, results of an educa-tional assessment might be reported on a scaleconsisting of several ordered proficiency levels,defined by descriptions of the kinds of tasksstudents at each level are able to perform. Thatwould be a criterion-referenced scale, but oncethe distribution of scores over levels is reported,say, for all eighth-grade students in a given state,individual students’ scores will also convey infor-mation about their standing relative to that testedpopulation.

Interpretations based on cut scores may likewisebe either criterion referenced or norm referenced.If qualitatively different descriptions are attachedto successive score ranges, a criterion-referencedinterpretation is supported. For example, the de-scriptions of proficiency levels in some assessmenttask-scoring rubrics can enhance score interpretationby summarizing the capabilities that must bedemonstrated to merit a given score. In othercases, criterion-referenced interpretations may bebased on empirically determined relationships be-tween test scores and other variables. But whentests are used for selection, it may be appropriateto rank-order examinees according to their testperformance and establish a cut score so as toselect a prespecified number or proportion of ex-aminees from one end of the distribution, providedthe selection use is sufficiently supported byrelevant reliability and validity evidence to supportrank ordering. In such cases, the cut score inter-pretation is norm referenced; the labels “reject” or“fail” versus “accept” or “pass” are determinedprimarily by an examinee’s standing relative toothers tested in the current selection process.

Criterion-referenced interpretations based oncut scores are sometimes criticized on the groundsthat there is rarely a sharp distinction betweenthose just below and those just above a cut score.A neuropsychological test may be helpful in diag-nosing some particular impairment, for example,but the probability that the impairment is presentis likely to increase continuously as a function ofthe test score rather than to change sharply at a

96

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 96

Page 107: STANDARDS

particular score. Cut scores may aid in formulatingrules for reaching decisions on the basis of testperformance. It should be recognized, however,that the likelihood of misclassification will generallybe relatively high for persons with scores close tothe cut scores.

Norms

The validity of norm-referenced interpretationsdepends in part on the appropriateness of the ref-erence group to which test scores are compared.Norms based on hospitalized patients, for example,might be inappropriate for some interpretationsof nonhospitalized patients’ scores. Thus, it is im-portant that reference populations be carefullydefined and clearly described. Validity of such in-terpretations also depends on the accuracy withwhich norms summarize the performance of thereference population. That population may besmall enough that essentially the entire populationcan be tested (e.g., all test takers at a given gradelevel in a given district tested on the same occasion).Often, however, only a sample of examinees fromthe reference population is tested. It is then im-portant that the norms be based on a technicallysound, representative sample of test takers of suf-ficient size. Patients in a few hospitals in a smallgeographic region are unlikely to be representativeof all patients in the United States, for example.Moreover, the usefulness of norms based on agiven sample may diminish over time. Thus, fortests that have been in use for a number of years,periodic review is generally required to ensure thecontinued utility of their norms. Renorming maybe required to maintain the validity of norm-ref-erenced test score interpretations.

More than one reference population may beappropriate for the same test. For example, achieve-ment test performance might be interpreted byreference to local norms based on sampling froma particular school district for use in making localinstructional decisions, or to norms for a state ortype of community for use in interpreting statewidetesting results, or to national norms for use inmaking comparisons with national groups. Forother tests, norms might be based on occupational

or educational classifications. Descriptive statisticsfor all examinees who happen to be tested duringa given period of time (sometimes called usernorms or program norms) may be useful for somepurposes, such as describing trends over time.But there must be a sound reason to regard thatgroup of test takers as an appropriate basis forsuch inferences. When there is a suitable rationalefor using such a group, the descriptive statisticsshould be clearly characterized as being based ona sample of persons routinely tested as part of anongoing program.

Score Linking

Score linking is a general term that refers to relatingscores from different tests or test forms. Whendifferent forms of a test are constructed to thesame content and statistical specifications andadministered under the same conditions, they arereferred to as alternate forms or sometimes parallelor equivalent forms. The process of placing rawscores from such alternate forms on a commonscale is referred to as equating. Equating involvessmall statistical adjustments to account for minordifferences in the difficulty of the alternate forms.After equating, alternate forms of the same testyield scale scores that can be used interchangeablyeven though they are based on different sets ofitems. In many testing programs that administertests multiple times, concerns with test securitymay be raised if the same form is used repeatedly.In other testing programs, the same test takersmay be measured repeatedly, perhaps to measurechange in levels of psychological dysfunction, at-titudes, or educational achievement. In these cases,reusing the same test items may result in biasedestimates of change. Score equating allows for theuse of alternate forms, thereby avoiding theseconcerns.

Although alternate forms are built to the samecontent and statistical specifications, differencesin test difficulty will occur, creating the need forequating. One approach to equating involves ad-ministering the forms to be equated to the samesample of examinees or to equivalent samples.Another approach involves administering a common

97

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 97

Page 108: STANDARDS

set of items, referred to as anchor items, to thesamples taking each form. Each approach hasunique strengths, but also involves assumptionsthat could influence the equating results, and sothese assumptions must be checked. Choosingamong equating approaches may include the fol-lowing considerations:

• Administering forms to the same sample allowsfor an estimate of the correlation between thescores on the two forms, as well as providingdata needed to adjust for differences in difficulty.However, there could be order effects related topractice or fatigue that may affect the score dis-tribution for the form administered second.

• Administering alternate forms to equivalentsamples, usually through random assignment,avoids any order effects but does not providea direct estimate of the correlation betweenthe scores; other methods are needed to demon-strate that the two forms measure the sameconstruct.

• Embedding a set of anchor items in each ofthe forms being equated provides a basis foradjusting for differences in the samples of ex-aminees taking each form. The anchor itemsshould cover the same content and difficultyrange as each of the full forms being equatedso that differences on the anchor items willaccurately reflect differences on the full forms.Also, anchor item position and other contextfactors should be the same in both forms. It isimportant to check that the anchor itemsfunction similarly in the forms being equated.Anchor items are often dropped from theanchor if their relative difficulty is substantiallydifferent in the forms being equated.

• Sometimes an external anchor test is used inwhich the anchor items are administered in aseparate section and do not contribute to thetotal score on the test. This approach eliminatessome context factors as the presentation ofthe anchor items is identical for each examineesample. Again, however, the anchor test mustreflect the content and difficulty of the opera-

tional forms being equated. Both embeddedand external anchor test designs involve strongstatistical assumptions regarding the equivalenceof the anchor and the forms being equated.These assumptions are particularly criticalwhen the samples of examinees taking the dif-ferent forms vary considerably on the constructbeing measured.

When claiming that scores on test forms areequated, it is important to document how theforms are built to the same content and statisticalspecifications and to demonstrate that scores onthe alternate forms are measures of the same con-struct and have similar reliability. Equating shouldprovide accurate score conversions for any set ofpersons drawn from the examinee population forwhich the test is designed; hence the stability ofconversions across relevant subgroups should bedocumented. Whenever possible, the definitionsof important examinee populations should includegroups for which fairness may be a particularissue, such as examinees with disabilities or fromdiverse linguistic and cultural backgrounds. Whensample sizes permit, it is important to examinethe stability of equating conversions across thesepopulations.

The increased use of tests delivered by computerraises special considerations for equating andlinking because more flexible models for deliveringtests become possible. These include adaptivetesting as well as approaches where unique itemsor multiple intact sets of items are selected from alarger pool of available items. It has long beenrecognized that little is learned from examinees’responses to items that are much too easy ormuch too difficult for them. Consequently, sometesting procedures use only a subset of the availableitems with each examinee. An adaptive test consistsof a pool of items together with rules for selectinga subset of those items to be administered to anindividual examinee and a procedure for placingdifferent examinees’ scores on a common scale.The selection of successive items is based in parton the examinees’ responses to previous items.The item pool and item selection rules may bedesigned so that each examinee receives a repre-

98

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 98

Page 109: STANDARDS

sentative set of items of appropriate difficulty.With some adaptive tests, it may happen that twoexaminees rarely if ever receive the same set ofitems. Moreover, two examinees taking the sameadaptive test may be given sets of items that differmarkedly in difficulty. Nevertheless, adaptive testscores can be reported on a common scale andfunction much like scores from a single alternateform of a test that is not adaptive.

Often, the adaptation of the test is done itemby item. In other situations, such as in multistagetesting, the exam process may branch from choos-ing among sets of items that are broadly repre-sentative of content and difficulty to choosingamong sets of items that are targeted explicitlyfor a higher or lower level of the construct beingmeasured, based on an interim evaluation of ex-aminee performance.

In many situations, item pools for adaptivetests are updated by replacing some of the itemsin the pool with new items. In other cases, entirepools of items are replaced. In either case, statisticalprocedures are used to link item parameterestimates for the new items to the existing IRTscale so that scores from alternate pools can beused interchangeably, in much the same way thatscores on alternate forms of tests are used whenscores on the alternate forms are equated. Tosupport comparability of scores on adaptive testsacross pools, it is necessary to construct the poolsto the same explicit content and statistical speci-fications and administer them under the sameconditions. Most often, a common-item designis used in linking parameter estimates for thenew items to the IRT scale used for adaptivetesting. In such cases, stability checks should bemade on the statistical characteristics of the com-mon items, and the number of common itemsshould be sufficient to yield stable results. Theadequacy of the assumptions needed to linkscores across pools should be checked.

Many other examples of linking exist thatmay not result in interchangeable scores, includingthe following:

• For the evaluation of examinee growth overtime, it may be desirable to develop vertical

scales that span a broad range of developmentalor educational levels. The development of ver-tical scales typically requires linking of teststhat are purposefully constructed to differ indifficulty.

• Test revision often brings a need to link scoresobtained using newer and older test specifications.

• International comparative studies may requirelinking of scores on tests given in differentlanguages.

• Scores may be linked on tests measuring dif-ferent constructs, perhaps comparing an aptitudewith a form of behavior, or linking measuresof achievement in several content areas oracross different test publishers.

• Sometimes linkings are made to compare per-formance of groups (e.g., school districts,states) on different measures of similar con-structs, such as when linking scores on a stateachievement test to scores on an internationalassessment.

• Results from linking studies are sometimesaligned or presented in a concordance table toaid users in estimating performance on onetest from performance on another.

• In situations where complex item types areused, score linking is sometimes conductedthrough judgments about the comparabilityof item content from one test to another. Forexample, writing prompts built to be similar,where responses are scored using a commonrubric, might be assumed to be equivalent indifficulty. When possible, these linkings shouldbe checked empirically.

• In some situations, judgmental methods areused to link scores across tests. In these situa-tions, the judgment processes and their reliabilityshould be well documented and the rationalefor their use should be clear.

Processes used to facilitate comparisons maybe described with terms such as linking, calibration,concordance, vertical scaling, projection, or moderation.

99

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 99

Page 110: STANDARDS

These processes may be technically sound andmay fully satisfy desired goals of comparabilityfor one purpose or for one relevant subgroup ofexaminees, but they cannot be assumed to bestable over time or invariant across multiple sub-groups of the examinee population, nor is thereany assurance that scores obtained using differenttests will be equally precise. Thus, their use forother purposes or with other populations thanthe originally intended population may requireadditional support. For example, a score conversionthat was accurate for a group of native speakersmight systematically overpredict or underpredictthe scores of a group of nonnative speakers.

Cut Scores

A critical step in the development and use ofsome tests is to establish one or more cut scoresdividing the score range to partition the distributionof scores into categories. These categories may beused just for descriptive purposes or may be usedto distinguish among examinees for whom differentprograms are deemed desirable or different pre-dictions are warranted. An employer may determinea cut score to screen potential employees or topromote current employees; proficiency levels of“basic,” “proficient,” and “advanced” may be es-tablished using standard-setting methods to setcut scores on a state test of mathematics achievementin fourth grade; educators may want to use testscores to identify students who are prepared to goon to college and take credit-bearing courses; orin granting a professional license, a state mayspecify a minimum passing score on a licensuretest.

These examples differ in important respects,but all involve delineating categories of examineeson the basis of test scores. Such cut scores providethe basis for using and interpreting test results.Thus, in some situations, the validity of test scoreinterpretations may hinge on the cut scores. Therecan be no single method for determining cutscores for all tests or for all purposes, nor canthere be any single set of procedures for establishingtheir defensibility. In addition, although cut scoresare helpful for informing selection, placement,

and other classifications, it should be acknowledgedthat such categorical decisions are rarely made onthe basis of test performance alone. The examplesthat follow serve only as illustrations.

The first example, that of an employer inter-viewing all those who earn scores above a givenlevel on an employment test, is the most straight-forward. Assuming that validity evidence has beenprovided for scores on the employment test for itsintended use, average job performance typicallywould be expected to rise steadily, albeit slowly,with each increment in test score, at least forsome range of scores surrounding the cut score.In such a case the designation of the particularvalue for the cut score may be largely determinedby the number of persons to be interviewed orfurther screened.

In the second example, a state department ofeducation establishes content standards for whatfourth-grade students are to learn in mathematicsand implements a test for assessing student achieve-ment on these standards. Using a structured,judgmental standard-setting process, committeesof subject matter experts develop or elaborate onperformance-level descriptors (sometimes referredto as achievement-level descriptors) that indicatewhat students at achievement levels of “basic,”“proficient,” and “advanced” should know and beable to do in fourth-grade mathematics. In addition,committees examine test items and student per-formance to recommend cut scores that are usedto assign students to each achievement level basedon their test performance. The final decisionabout the cut scores is a policy decision typicallymade by a policy body such as the board of edu-cation for the state.

In the third example, educators wish to usetest scores to identify students who are preparedto go on to college and take credit-bearing courses.Cut scores might initially be identified based onjudgments about requirements for taking credit-bearing courses across a range of colleges. Alter-natively, judgments about individual studentsmight be collected and then used to find a scorelevel that most effectively differentiates thosejudged to be prepared from those judged not tobe. In such cases, judges must be familiar with

100

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 100

Page 111: STANDARDS

both the college course requirements and the stu-dents themselves. Where possible, initial judgmentscould be followed up with longitudinal data indi-cating whether former examinees did or did nothave to take remedial courses.

In the final example, that of a professional li-censure examination, the cut score represents aninformed judgment that those scoring below itare at risk of making serious errors because theylack the knowledge or skills tested. No test isperfect, of course, and regardless of the cut scorechosen, some examinees with inadequate skillsare likely to pass, and some with adequate skillsare likely to fail. The relative probabilities of suchfalse positive and false negative errors will varydepending on the cut score chosen. A given prob-ability of exposing the public to potential harmby issuing a license to an incompetent individual(false positive) must be weighed against somecorresponding probability of denying a license to,and thereby disenfranchising, a qualified examinee(false negative). Changing the cut score to reduceeither probability will increase the other, althoughboth kinds of errors can be minimized throughsound test design that anticipates the role of thecut score in test use and interpretation. Determiningcut scores in such situations cannot be a purely

technical matter, although empirical studies andstatistical models can be of great value in informingthe process.

Cut scores embody value judgments as wellas technical and empirical considerations. Wherethe results of the standard-setting process havehighly significant consequences, those involvedin the standard-setting process should be concernedthat the process by which cut scores are deter-mined be clearly documented and that it be de-fensible. When standard-setting involves judgesor subject matter experts, their qualificationsand the process by which they were selected arepart of that documentation. Care must be takento ensure that these persons understand whatthey are to do and that their judgments are asthoughtful and objective as possible. The processmust be such that well-qualified participants canapply their knowledge and experience to reachmeaningful and relevant judgments that accuratelyreflect their understandings and intentions. Asufficiently large and representative group ofparticipants should be involved to provide rea-sonable assurance that the expert ratings acrossjudges are sufficiently reliable and that the resultsof the judgments would not vary greatly if theprocess were replicated.

101

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 101

Page 112: STANDARDS

The standards in this chapter begin with an over-arching standard (numbered 5.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intofour thematic clusters labeled as follows:

1. Interpretations of Scores2. Norms3. Score Linking4. Cut Scores

Standard 5.0

Test scores should be derived in a way thatsupports the interpretations of test scores for theproposed uses of tests. Test developers and usersshould document evidence of fairness, reliability,and validity of test scores for their proposed use.

Comment: Specific standards for various uses andinterpretations of test scores and score scales aredescribed below. These include standards for norm-referenced and criterion-referenced interpretations,interpretations of cut scores, interchangeability ofscores on alternate forms following equating, andscore comparability following the use of other pro-cedures for score linking. Documentation supportingsuch interpretations provides a basis for externalexperts and test users to judge the extent to whichthe interpretations are likely to be supported andcan lead to valid interpretations of scores for all in-dividuals in the intended examinee population.

Cluster 1. Interpretations of Scores

Standard 5.1

Test users should be provided with clear expla-nations of the characteristics, meaning, and

intended interpretation of scale scores, as well astheir limitations.

Comment: Illustrations of appropriate and inap-propriate interpretations may be helpful, especiallyfor types of scales or interpretations that are unfa-miliar to most users. This standard pertains toscore scales intended for criterion-referenced aswell as norm-referenced interpretations. All scores(raw scores or scale scores) may be subject to mis-interpretation. If the nature or intended uses of ascale are novel, it is especially important that itsuses, interpretations, and limitations be clearlydescribed.

Standard 5.2

The procedures for constructing scales used forreporting scores and the rationale for these pro-cedures should be described clearly.

Comment: When scales, norms, or other inter-pretive systems are provided by the test developer,technical documentation should describe theirrationale and enable users to judge the qualityand precision of the resulting scale scores. For ex-ample, the test developer should describe anynormative, content, or score precision informationthat is incorporated into the scale and provide arationale for the number of score points that areused. This standard pertains to score scales intendedfor criterion-referenced as well as norm-referencedinterpretations.

Standard 5.3

If there is sound reason to believe that specificmisinterpretations of a score scale are likely, testusers should be explicitly cautioned.

Comment: Test publishers and users can reducemisinterpretations of scale scores if they explicitlydescribe both appropriate uses and potentialmisuses. For example, a score scale point originally

102

CHAPTER 5

STANDARDS FOR SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 102

Page 113: STANDARDS

defined as the mean of some reference populationshould no longer be interpreted as representingaverage performance if the scale is held constantover time and the examinee population changes.Similarly, caution is needed if score meaningsmay vary for some test takers, such as the meaningof achievement scores for students who have nothad adequate opportunity to learn the materialcovered by the test.

Standard 5.4

When raw scores are intended to be directly in-terpretable, their meanings, intended interpre-tations, and limitations should be described andjustified in the same manner as is done for scalescores.

Comment: In some cases the items in a test are arepresentative sample of a well-defined domain ofitems with regard to both content and item diffi-culty. The proportion answered correctly on thetest may then be interpreted as an estimate of theproportion of items in the domain that could beanswered correctly. In other cases, different inter-pretations may be attached to scores above orbelow a particular cut score. Support should beoffered for any such interpretations recommendedby the test developer.

Standard 5.5

When raw scores or scale scores are designed forcriterion-referenced interpretation, including theclassification of examinees into separate categories,the rationale for recommended score interpreta-tions should be explained clearly.

Comment: Criterion-referenced interpretationsare score-based descriptions or inferences that donot take the form of comparisons of an examinee’stest performance with the test performance ofother examinees. Examples include statementsthat some psychopathology is likely present, thata prospective employee possesses specific skills re-quired in a given position, or that a child scoringabove a certain score point can successfully applya given set of skills. Such interpretations may

refer to the absolute levels of test scores or topatterns of scores for an individual examinee.Whenever the test developer recommends suchinterpretations, the rationale and empirical basisshould be presented clearly. Serious efforts shouldbe made whenever possible to obtain independentevidence concerning the soundness of such scoreinterpretations.

Standard 5.6

Testing programs that attempt to maintain acommon scale over time should conduct periodicchecks of the stability of the scale on whichscores are reported.

Comment:The frequency of such checks dependson various characteristics of the testing program.In some testing programs, items are introducedinto and retired from item pools on an ongoingbasis. In other cases, the items in successive testforms may overlap very little, or not at all. Ineither case, if a fixed scale is used for reporting, itis important to ensure that the meaning of thescale scores does not change over time. Whenscales are based on the subsequent application ofprecalibrated item parameter estimates using itemresponse theory, periodic analyses of item parameterstability should be routinely undertaken.

Standard 5.7

When standardized tests or testing proceduresare changed for relevant subgroups of test takers,the individual or group making the change shouldprovide evidence of the comparability of scoreson the changed versions with scores obtained onthe original versions of the tests. If evidence islacking, documentation should be provided thatcautions users that scores from the changed testor testing procedure may not be comparable withthose from the original version.

Comment: Sometimes it becomes necessary tochange original versions of a test or testingprocedure when the test is given to relevant sub-groups of the testing population, for example, in-dividuals with disabilities or individuals with

103

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 103

Page 114: STANDARDS

diverse linguistic and cultural backgrounds. Atest may be translated into braille so that it is ac-cessible to individuals who are blind, or the testingprocedure may be changed to include extra timefor certain groups of examinees. These changesmay or may not have an effect on the underlyingconstructs that are measured by the test and, con-sequently, on the score conversions used with thetest. If scores on the changed test will be comparedwith scores on the original test, the test developershould provide empirical evidence of the compa-rability of scores on the changed and original testwhenever sample sizes are sufficiently large toprovide this type of evidence.

Cluster 2. Norms

Standard 5.8

Norms, if used, should refer to clearly describedpopulations. These populations should includeindividuals or groups with whom test users willordinarily wish to compare their own examinees.

Comment: It is the responsibility of test developersto describe norms clearly and the responsibility oftest users to use norms appropriately. Users need toknow the applicability of a test to different groups.Differentiated norms or summary informationabout differences between gender, racial/ethnic,language, disability, grade, or age groups, forexample, may be useful in some cases. The permissibleuses of such differentiated norms and related in-formation may be limited by law. Users also needto be alerted to situations in which norms are lessappropriate for some groups or individuals thanothers. On an occupational interest inventory, forexample, norms for persons actually engaged in anoccupation may be inappropriate for interpretingthe scores of persons not so engaged.

Standard 5.9

Reports of norming studies should include precisespecification of the population that was sampled,sampling procedures and participation rates, anyweighting of the sample, the dates of testing,

and descriptive statistics. Technical documentationshould indicate the precision of the norms them-selves.

Comment: The information provided should besufficient to enable users to judge the appropri-ateness of the norms for interpreting the scores oflocal examinees. The information should be pre-sented so as to comply with applicable legal re-quirements and professional standards relating toprivacy and data security.

Standard 5.10

When norms are used to characterize examineegroups, the statistics used to summarize eachgroup’s performance and the norms to whichthose statistics are referred should be definedclearly and should support the intended use orinterpretation.

Comment: It is not possible to determine thepercentile rank of a school’s average test score ifall that is known is the percentile rank of each ofthat school’s students. It may sometimes be usefulto develop special norms for group means, butwhen the sizes of the groups differ materially orwhen some groups are much more heterogeneousthan others, the construction and interpretationof group norms is problematic. One commonand acceptable procedure is to report the percentilerank of the median group member, for example,the median percentile rank of the pupils tested ina given school.

Standard 5.11

If a test publisher provides norms for use in testscore interpretation, then as long as the test re-mains in print, it is the test publisher’s responsi-bility to renorm the test with sufficient frequencyto permit continued accurate and appropriatescore interpretations.

Comment: Test publishers should ensure thatup-to-date norms are readily available or provideevidence that older norms are still appropriate.However, it remains the test user’s responsibility

104

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 104

Page 115: STANDARDS

to avoid inappropriate use of norms that are outof date and to strive to ensure accurate and ap-propriate score interpretations.

Cluster 3. Score Linking

Standard 5.12

A clear rationale and supporting evidence shouldbe provided for any claim that scale scores earnedon alternate forms of a test may be used inter-changeably.

Comment: For scores on alternate forms to beused interchangeably, the alternate forms must bebuilt to common detailed content and statisticalspecifications. Adequate data should be collectedand appropriate statistical methodology shouldbe applied to conduct the equating of scores onalternate test forms. The quality of the equatingshould be evaluated to assess whether the resultingscale scores on the alternate forms can be used in-terchangeably.

Standard 5.13

When claims of form-to-form score equivalenceare based on equating procedures, detailed technicalinformation should be provided on the methodby which equating functions were establishedand on the accuracy of the equating functions.

Comment: Evidence should be provided to showthat equated scores on alternate forms measureessentially the same construct with very similarlevels of reliability and conditional standard errorsof measurement and that the results are appropriatefor relevant subgroups. Technical informationshould include the design of the equating study,the statistical methods used, the size and relevantcharacteristics of examinee samples used in equatingstudies, and the characteristics of any anchor testsor anchor items. For tests for which equating isconducted prior to operational use (i.e., pre-equating), documentation of the item calibrationprocess should be provided and the adequacy ofthe equating functions should be evaluated following

operational administration. When equivalent formsof computer-based tests are constructed dynamically,the algorithms used should be documented andthe technical characteristics of alternate formsshould be evaluated based on simulation and/oranalysis of administration data. Standard errorsof equating functions should be estimated and re-ported whenever possible. Sample sizes permitting,it may be informative to assess whether equatingfunctions developed for relevant subgroups of ex-aminees are similar. It may also be informative touse two or more anchor forms and to conduct theequating using each of the anchors. To be mostuseful, equating error should be presented in unitsof the reported score scale. For testing programswith cut scores, equating error near the cut scoreis of primary importance.

Standard 5.14

In equating studies that rely on the statisticalequivalence of examinee groups receiving differentforms, methods of establishing such equivalenceshould be described in detail.

Comment: Certain equating designs rely on therandom equivalence of groups receiving differentforms. Often, one way to ensure such equivalenceis to mix systematically different test forms andthen distribute them in a random fashion so thatroughly equal numbers of examinees receive eachform. Because administration designs intendedto yield equivalent groups are not always adheredto in practice, the equivalence of groups shouldbe evaluated statistically.

Standard 5.15

In equating studies that employ an anchor testdesign, the characteristics of the anchor test andits similarity to the forms being equated shouldbe presented, including both content specificationsand empirically determined relationships amongtest scores. If anchor items are used in theequating study, the representativeness and psy-chometric characteristics of the anchor itemsshould be presented.

105

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 105

Page 116: STANDARDS

Comment: Scores on tests or test forms may beequated via common items embedded withineach of them, or a common test administered to-gether with each of them. These common itemsor tests are referred to as linking items, commonitems, anchor items, or anchor tests. Statistical pro-cedures applied to anchor items make assumptionsthat substitute for the equivalence achieved withan equivalent groups design. Performances onthese items are the only empirical evidence usedto adjust for differences in ability between groupsbefore making adjustments for test difficulty.With such approaches, the quality of the resultingequating depends strongly on the number of theanchor items used and how well the anchor itemsproportionally reflect the content and statisticalcharacteristics of the test. The content of theanchor items should be exactly the same in eachtest form to be equated. The anchor items shouldbe in similar positions to help reduce error inequating due to item context effects. In addition,checks should be made to ensure that, after con-trolling for examinee group differences, the anchoritems have similar statistical characteristics oneach test form.

Standard 5.16

When test scores are based on model-based psy-chometric procedures, such as those used incomputerized adaptive or multistage testing,documentation should be provided to indicatethat the scores have comparable meaning overalternate sets of test items.

Comment:When model-based psychometric pro-cedures are used, technical documentation shouldbe provided that supports the comparability ofscores over alternate sets of items. Such docu-mentation should include the assumptions andprocedures that were used to establish compara-bility, including clear descriptions of model-basedalgorithms, software used, quality control proce-dures followed, and technical analyses conductedthat justify the use of the psychometric modelsfor the particular test scores that are intended tobe comparable.

Standard 5.17

When scores on tests that cannot be equated arelinked, direct evidence of score comparabilityshould be provided, and the examinee populationfor which score comparability applies should bespecified clearly. The specific rationale and theevidence required will depend in part on the in-tended uses for which score comparability isclaimed.

Comment: Support should be provided for anyassertion that linked scores obtained using testsbuilt to different content or statistical specifications,tests that use different testing materials, or teststhat are administered under different test admin-istration conditions are comparable for the intendedpurpose. For these links, the examinee populationfor which score comparability is established shouldbe specified clearly. This standard applies, for ex-ample, to tests that differ in length, tests adminis-tered in different formats (e.g., paper-and-penciland computer-based tests), test forms designedfor individual versus group administration, teststhat are vertically scaled, computerized adaptivetests, tests that are revised substantially, tests givenin different languages, tests administered undervarious accommodations, tests measuring differentconstructs, and tests from different publishers.

Standard 5.18

When linking procedures are used to relate scoreson tests or test forms that are not closely parallel,the construction, intended interpretation, andlimitations of those linkings should be describedclearly.

Comment: Various linkings have been conductedrelating scores on tests developed at differentlevels of difficulty, relating earlier to revised formsof published tests, creating concordances betweendifferent tests of similar or different constructs,or for other purposes. Such linkings often areuseful, but they may also be subject to misinter-pretation. The limitations of such linkings shouldbe described clearly. Detailed technical informationshould be provided on the linking methodology

106

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 106

Page 117: STANDARDS

and the quality of the linking. Technical informationabout the linking should include, as appropriate,the reliability of the sets of scores being linked,the correlation between test scores, an assessmentof content similarity, the conditions of measurementfor each test, the data collection design, thestatistical methods used, the standard errors ofthe linking function, evaluations of sampling sta-bility, and assessments of score comparability.

Standard 5.19

When tests are created by taking a subset of theitems in an existing test or by rearranging items,evidence should be provided that there are nodistortions of scale scores, cut scores, or normsfor the different versions or for score linkingsbetween them.

Comment: Some tests and test batteries are pub-lished in both a full-length version and a surveyor short version. In other cases, multiple versionsof a single test form may be created by rearrangingits items. It should not be assumed that performancedata derived from the administration of items aspart of the initial version can be used to computescale scores, compute linked scores, constructconversion tables, approximate norms, or approx-imate cut scores for alternative intact tests. Cautionis required in cases where context effects are likely,including speeded tests, long tests where fatiguemay be a factor, adaptive tests, and tests developedfrom calibrated item pools. Options for gatheringevidence related to context effects might includeexaminations of model-data fit, operational recal-ibrations of item parameter estimates initiallyderived using pretest data, and comparisons ofperformance on original and revised test forms asadministered to randomly equivalent groups.

Standard 5.20

If test specifications are changed from one versionof a test to a subsequent version, such changesshould be identified, and an indication shouldbe given that converted scores for the two versionsmay not be strictly equivalent, even when statistical

procedures have been used to link scores fromthe different versions. When substantial changesin test specifications occur, scores should be re-ported on a new scale, or a clear statementshould be provided to alert users that the scoresare not directly comparable with those on earlierversions of the test.

Comment: Major shifts sometimes occur in thespecifications of tests that are used for substantialperiods of time. Often such changes take advantageof improvements in item types or shifts in contentthat have been shown to improve validity andtherefore are highly desirable. It is important torecognize, however, that such shifts will result inscores that cannot be made strictly interchangeablewith scores on an earlier form of the test, evenwhen statistical linking procedures are used. Toassess score comparability, it is advisable to evaluatethe relationship between scores on the old andnew versions.

Cluster 4. Cut Scores

Standard 5.21

When proposed score interpretations involveone or more cut scores, the rationale and proce-dures used for establishing cut scores should bedocumented clearly.

Comment:Cut scores may be established to selecta specified number of examinees (e.g., to identifya fixed number of job applicants for further screen-ing), in which case little further documentationmay be needed concerning the specific questionof how the cut scores are established, although at-tention should be paid to the rationale for usingthe test in selection and the precision of comparisonsamong examinees. In other cases, however, cutscores may be used to classify examinees intodistinct categories (e.g., diagnostic categories, pro-ficiency levels, or passing versus failing) for whichthere are no pre-established quotas. In these cases,the standard-setting method must be documentedin more detail. Ideally, the role of cut scores in testuse and interpretation is taken into account during

107

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 107

Page 118: STANDARDS

test design. Adequate precision in regions of scorescales where cut scores are established is prerequisiteto reliable classification of examinees into categories.If standard setting employs data on the score dis-tributions for criterion groups or on the relationof test scores to one or more criterion variables,those data should be summarized in technicaldocumentation. If a judgmental standard-settingprocess is followed, the method employed shouldbe described clearly, and the precise nature andreliability of the judgments called for should bepresented, whether those are judgments of persons,of item or test performances, or of other criterionperformances predicted by test scores. Documen-tation should also include the selection and qual-ifications of standard-setting panel participants,training provided, any feedback to participantsconcerning the implications of their provisionaljudgments, and any opportunities for participantsto confer with one another. Where applicable,variability over participants should be reported.Whenever feasible, an estimate should be providedof the amount of variation in cut scores thatmight be expected if the standard-setting procedurewere replicated with a comparable standard-settingpanel.

Standard 5.22

When cut scores defining pass-fail or proficiencylevels are based on direct judgments about theadequacy of item or test performances, the judg-mental process should be designed so that theparticipants providing the judgments can bringtheir knowledge and experience to bear in a rea-sonable way.

Comment: Cut scores are sometimes based onjudgments about the adequacy of item or testperformances (e.g., essay responses to a writingprompt) or proficiency expectations (e.g., thescale score that would characterize a borderlineexaminee). The procedures used to elicit suchjudgments should result in reasonable, defensibleproficiency standards that accurately reflect thestandard-setting participants’ values and intentions.Reaching such judgments may be most straight-

forward when participants are asked to considerkinds of performances with which they are familiarand for which they have formed clear conceptionsof adequacy or quality. When the responses elicitedby a test neither sample nor closely simulate theuse of tested knowledge or skills in the actual cri-terion domain, participants are not likely to ap-proach the task with such clear understandings ofadequacy or quality. Special care must then betaken to ensure that participants have a soundbasis for making the judgments requested. Thoroughfamiliarity with descriptions of different proficiencylevels, practice in judging task difficulty withfeedback on accuracy, the experience of actuallytaking a form of the test, feedback on the passrates entailed by provisional proficiency standards,and other forms of information may be beneficialin helping participants to reach sound and prin-cipled decisions.

Standard 5.23

When feasible and appropriate, cut scores definingcategories with distinct substantive interpretationsshould be informed by sound empirical dataconcerning the relation of test performance tothe relevant criteria.

Comment: In employment settings where it hasbeen established that test scores are related to jobperformance, the precise relation of test andcriterion may have little bearing on the choice ofa cut score, if the choice is based on the need fora predetermined number of candidates. However,in contexts where distinct interpretations areapplied to different score categories, the empiricalrelation of test to criterion assumes greater im-portance. For example, if a cut score is to be seton a high school mathematics test indicatingreadiness for college-level mathematics instruction,it may be desirable to collect empirical data estab-lishing a relationship between test scores andgrades obtained in relevant college courses. Cutscores used in interpreting diagnostic tests maybe established on the basis of empirically determinedscore distributions for criterion groups. Withmany achievement or proficiency tests, such as

108

CHAPTER 5

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 108

Page 119: STANDARDS

those used in credentialing, suitable criteriongroups (e.g., successful versus unsuccessful prac-titioners) are often unavailable. Nevertheless, whenappropriate and feasible, the test developer shouldinvestigate and report the relation between testscores and performance in relevant practicalsettings. Professional judgment is required to de-termine an appropriate standard-setting approach

(or combination of approaches) in any given situ-ation. In general, one would not expect to find asharp difference in levels of the criterion variablebetween those just below and those just above thecut score, but evidence should be provided, wherefeasible, of a relationship between test and criterionperformance over a score interval that includes orapproaches the cut score.

109

SCORES, SCALES, NORMS, SCORE LINKING, AND CUT SCORES

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 109

Page 120: STANDARDS

ch05.qxp_AERA Standards 6/18/14 2:36 PM Page 110

Page 121: STANDARDS

The usefulness and interpretability of test scoresrequire that a test be administered and scored ac-cording to the test developer’s instructions. Whendirections, testing conditions, and scoring followthe same detailed procedures for all test takers,the test is said to be standardized. Without suchstandardization, the accuracy and comparabilityof score interpretations would be reduced. Fortests designed to assess the test taker’s knowledge,skills, abilities, or other personal characteristics,standardization helps to ensure that all test takershave the same opportunity to demonstrate theircompetencies. Maintaining test security also helpsensure that no one has an unfair advantage. Theimportance of adherence to appropriate standard-ization of administration procedures increaseswith the stakes of the test.

Sometimes, however, situations arise in whichvariations from standardized procedures may beadvisable or legally mandated. For example, indi-viduals with disabilities and persons of differentlinguistic backgrounds, ages, or familiarity withtesting may need nonstandard modes of test ad-ministration or a more comprehensive orientationto the testing process, so that all test takers canhave an unobstructed opportunity to demonstratetheir standing on the construct(s) being measured.Different modes of presenting the test or its in-structions, or of responding, may be suitable forspecific individuals, such as persons with somekinds of disability, or persons with limited proficiencyin the language of the test, in order to provide ap-propriate access to reduce construct-irrelevant vari-ance (see chap. 3, “Fairness in Testing”). In clinicalor neuropsychological testing situations, flexibilityin administration may be required, depending onthe individual’s ability to comprehend and respondto test items or tasks and/or the construct requiredto be measured. Some situations and/or theconstruct (e.g., testing for memory impairment ina test taker with dementia who is in a hospital)

may require that the assessment be abbreviated oraltered. Large-scale testing programs typically es-tablish specific procedures for considering andgranting accommodations and other variationsfrom standardized procedures. Usually these ac-commodations themselves are somewhat standard-ized; occasionally, some alternative other than theaccommodations foreseen and specified by thetest developer may be indicated. Appropriate careshould be taken to avoid unfair treatment and dis-crimination. Although variations may be madewith the intent of maintaining score comparability,the extent to which that is possible often cannotbe determined. Comparability of scores may becompromised, and the test may then not measurethe same constructs for all test takers.

Tests and assessments differ in their degree ofstandardization. In many instances, different testtakers are not given the same test form but receiveequivalent forms that have been shown to yieldcomparable scores, or alternate test forms wherescores are adjusted to make them comparable.Some assessments permit test takers to choosewhich tasks to perform or which pieces of theirwork are to be evaluated. Standardization can bemaintained in these situations by specifying theconditions of the choice and the criteria for eval-uation of the products. When an assessmentpermits a certain kind of collaboration betweentest takers or between test taker and test adminis-trator, the limits of that collaboration should bespecified. With some assessments, test administratorsmay be expected to tailor their instructions tohelp ensure that all test takers understand what isexpected of them. In all such cases, the goalremains the same: to provide accurate, fair, andcomparable measurement for everyone. The degreeof standardization is dictated by that goal, and bythe intended use of the test score.

Standardized directions help ensure that alltest takers have a common understanding of the

111

6. TEST ADMINISTRATION, SCORING,REPORTING, AND INTERPRETATION

BACKGROUND

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 111

Page 122: STANDARDS

mechanics of test taking. Directions generallyinform test takers on how to make their responses,what kind of help they may legitimately be givenif they do not understand the question or task,how they can correct inadvertent responses, andthe nature of any time constraints. General adviceis sometimes given about omitting item responses.Many tests, including computer-administeredtests, require special equipment or software. In-struction and practice exercises are often presentedin such cases so that the test taker understandshow to operate the equipment or software. Theprinciple of standardization includes orientingtest takers to materials and accommodations withwhich they may not be familiar. Some equipmentmay be provided at the testing site, such as shoptools or software systems. Opportunity for testtakers to practice with the equipment will oftenbe appropriate, unless ability to use the equipmentis the construct being assessed.

Tests are sometimes administered via technology,with test responses entered by keyboard, computermouse, voice input, or other devices. Increasingly,many test takers are accustomed to using computers.Those who are not may require training to reduceconstruct-irrelevant variance. Even those test takerswho are familiar with computers may need somebrief explanation and practice to manage test-specific details such as the test’s interface. Specialissues arise in managing the testing environmentto reduce construct-irrelevant variance, such asavoiding light reflections on the computer screenthat interfere with display legibility, or maintaininga quiet environment when test takers start orfinish at different times from neighboring testtakers. Those who administer computer-basedtests should be trained so that they can deal withhardware, software, or test administration problems.Tests administered by computer in Web-basedapplications may require other supports to maintainstandardized environments.

Standardized scoring procedures help to ensureconsistent scoring and reporting, which are essentialin all circumstances. When scoring is done bymachine, the accuracy of the machine, includingany scoring program or algorithm, should be es-tablished and monitored. When the scoring of

complex responses is done by human scorers orautomatic scoring engines, careful training is re-quired. The training typically requires experthuman raters to provide a sample of responsesthat span the range of possible score points or rat-ings. Within the score point ranges, trainers shouldalso provide samples that exemplify the variety ofresponses that will yield the score point or rating.Regular monitoring can help ensure that everytest performance is scored according to the samestandardized criteria and that the test scorers donot apply the criteria differently as they progressthrough the submitted test responses.

Test scores, per se, are not readily interpretedwithout other information, such as norms or stan-dards, indications of measurement error, and de-scriptions of test content. Just as a temperature of50 degrees Fahrenheit in January is warm for Min-nesota and cool for Florida, a test score of 50 isnot meaningful without some context. Interpretivematerial should be provided that is readily under-standable to those receiving the report. Often, thetest user provides an interpretation of the resultsfor the test taker, suggesting the limitations of theresults and the relationship of any reported scoresto other information. Scores on some tests are notdesigned to be released to test takers; only broadtest interpretations, or dichotomous classifications,such as “pass/fail,” are intended to be reported.

Interpretations of test results are sometimesprepared by computer systems. Such interpreta-tions are generally based on a combination ofempirical data, expert judgment, and experienceand require validation. In some professional ap-plications of individualized testing, the comput-er-prepared interpretations are communicatedby a professional, who might modify the com-puter-based interpretation to fit special circum-stances. Care should be taken so that test inter-pretations provided by nonalgorithmic approachesare appropriately consistent. Automatically gen-erated reports are not a substitute for the clinicaljudgment of a professional evaluator who hasworked directly with the test taker, or for the in-tegration of other information, including butnot limited to other test results, interviews,existing records, and behavioral observations.

112

CHAPTER 6

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 112

Page 123: STANDARDS

In some large-scale assessments, the primarytarget of assessment is not the individual test takerbut rather a larger unit, such as a school district oran industrial plant. Often, different test takers aregiven different sets of items, following a carefullybalanced matrix sampling plan, to broaden therange of information that can be obtained in areasonable time period. The results acquire meaningwhen aggregated over many individuals takingdifferent samples of items. Such assessments maynot furnish enough information to support evenminimally valid or reliable scores for individuals,

as each individual may take only an incompletetest, while in the aggregate, the assessment resultsmay be valid and acceptably reliable for interpre-tations about performance of the larger unit.

Some further issues of administration andscoring are discussed in chapter 4, “Test Designand Development.”

Test users and those who receive test materials,test scores, and ancillary information such as testtakers’ personally identifiable information are re-sponsible for appropriately maintaining the securityand confidentiality of that information.

113

TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 113

Page 124: STANDARDS

114

CHAPTER 6

The standards in this chapter begin with an over-arching standard (numbered 6.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intothree thematic clusters labeled as follows:

1. Test Administration2. Test Scoring3. Reporting and Interpretation

Standard 6.0

To support useful interpretations of score results,assessment instruments should have establishedprocedures for test administration, scoring, re-porting, and interpretation. Those responsiblefor administering, scoring, reporting, and inter-preting should have sufficient training and supportsto help them follow the established procedures.Adherence to the established procedures shouldbe monitored, and any material errors should bedocumented and, if possible, corrected.

Comment: In order to support the validity ofscore interpretations, administration should followany and all established procedures, and compliancewith such procedures needs to be monitored.

Cluster 1. Test Administration

Standard 6.1

Test administrators should follow carefully thestandardized procedures for administration andscoring specified by the test developer and anyinstructions from the test user.

Comment:Those responsible for testing programsshould provide appropriate training, documentation,and oversight so that the individuals who administer

or score the test(s) are proficient in the appropriatetest administration or scoring procedures and un-derstand the importance of adhering to the direc-tions provided by the test developer. Large-scaletesting programs should specify accepted stan-dardized procedures for determining accommo-dations and other acceptable variations in test ad-ministration. Training should enable test admin-istrators to make appropriate adjustments if anaccommodation or modification is required thatis not covered by the standardized procedures.

Specifications regarding instructions to testtakers, time limits, the form of item presentationor response, and test materials or equipmentshould be strictly observed. In general, the sameprocedures should be followed as were used whenobtaining the data for scaling and norming thetest scores. Some programs do not scale or establishnorms, such as portfolio assessments and most al-ternate academic assessments for students withsevere cognitive disabilities. However, these programstypically have specified standardized proceduresfor administration and scoring when they establishperformance standards. A test taker with a disabilitymay require variations to provide access withoutchanging the construct that is measured. Otherspecial circumstances may require some flexibilityin administration, such as language support toprovide access under certain conditions, or someclinical or neuropsychological evaluations, in ad-dition to procedures related to accommodations.Judgments of the suitability of adjustments shouldbe tempered by the consideration that departuresfrom standard procedures may jeopardize thevalidity or complicate the comparability of thetest score interpretations. These judgments shouldbe made by qualified individuals and be consistentwith the guidelines provided by the test user ortest developer.

Policies regarding retesting should be establishedby the test developer or user. The test user andadministrator should follow the established policy.Such retest policies should be clearly communicated

STANDARDS FOR TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 114

Page 125: STANDARDS

by the test user as part of the conditions for stan-dardized test administration. Retesting is intendedto decrease the probability that a person will beincorrectly classified as not meeting some standard.For example, some testing programs specify thata person may retake the test; some offer multipleopportunities to take a test, for example whenpassing the test is required for high school gradu-ation or credentialing.

Test developers should specify the standardizedadministration conditions that support intendeduses of score interpretations. Test users should beaware of the implications of less controlled admin-istration conditions. Test users are responsible forproviding technical and other support to helpensure that test administrations meet these conditionsto the extent possible. However, technology andthe Internet have made it possible to administertests in many settings, including settings in whichthe administration conditions may not be strictlycontrolled or monitored. Those who allow lack ofstandardization are responsible for providing evidencethat the lack of standardization did not affect test-taker performance or the quality or comparabilityof the scores produced. Complete documentationwould include reporting the extent to which stan-dardized administration conditions were not met.

Characteristics such as time limits, choicesabout item types and response formats, complexinterfaces, and instructions that potentially addconstruct-irrelevant variance should be scrutinizedin terms of the test purpose and the constructsbeing measured. Appropriate usability and empiricalresearch should be carried out, as feasible, to doc-ument and ideally minimize the impact of sourcesor conditions that contribute to construct-irrelevantvariability.

Standard 6.2

When formal procedures have been establishedfor requesting and receiving accommodations,test takers should be informed of these proceduresin advance of testing.

Comment:When testing programs have establishedprocedures and criteria for identifying and providing

accommodations for test takers, the proceduresand criteria should be carefully followed and doc-umented. Ideally, these procedures include howto consider the instances when some alternativemay be appropriate in addition to those accom-modations foreseen and specified by the test de-veloper. Test takers should be informed of anytesting accommodations that may be available tothem and the process and requirements, if any,for obtaining needed accommodations. Similarly,in educational settings, appropriate school personneland parents/legal guardians should be informedof the requirements, if any, for obtaining neededaccommodations for students being tested.

Standard 6.3

Changes or disruptions to standardized test ad-ministration procedures or scoring should bedocumented and reported to the test user.

Comment: Information about the nature ofchanges to standardized administration or scoringprocedures should be maintained in secure datafiles so that research studies or case reviews basedon test records can take it into account. Thisincludes not only accommodations or modificationsfor particular test takers but also disruptions inthe testing environment that may affect all testtakers in the testing session. A researcher maywish to use only the records based on standardizedadministration. In other cases, research studiesmay depend on such information to form groupsof test takers. Test users or test sponsors shouldestablish policies specifying who secures the datafiles, who may have access to the files, and, if nec-essary, how to maintain confidentiality of respon-dents, for example by de-identifying respondents.Whether the information about deviations fromstandard procedures is reported to users of testdata depends on considerations such as whetherthe users are admissions officers or users of indi-vidualized psychological reports in clinical settings.If such reports are made, it may be appropriate toinclude clear documentation of any deviationfrom standard administration procedures, discussionof how such administrative variations may have

115

TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 115

Page 126: STANDARDS

affected the results, and perhaps certain cautions.For example, test users may need to be informedabout the comparability of scores when modifica-tions are provided (see chap. 3, “Fairness inTesting,” and chap. 9, “The Rights and Responsi-bilities of Test Users”). If a deviation or change toa standardized test administration procedure isjudged significant enough to adversely affect thevalidity of score interpretation, then appropriateaction should be taken, such as not reporting thescores, invalidating the scores, or providing op-portunities for readministration under appropriatecircumstances. Testing environments that are notmonitored (e.g., in temporary conditions or onthe Internet) should meet these standardized ad-ministration conditions; otherwise, the report onscores should note that standardized conditionswere not guaranteed.

Standard 6.4

The testing environment should furnish reasonablecomfort with minimal distractions to avoid con-struct-irrelevant variance.

Comment: Test developers should provide in-formation regarding the intended test adminis-tration conditions and environment. Noise, dis-ruption in the testing area, extremes of tempera-ture, poor lighting, inadequate work space,illegible materials, and malfunctioning computersare among the conditions that should be avoidedin testing situations, unless measuring the constructrequires such conditions. The testing site shouldbe readily accessible. Technology-based admin-istrations should avoid distractions such as equip-ment or Internet-connectivity failures, or largevariations in the time taken to present test itemsor score responses. Testing sessions should bemonitored where appropriate to assist the testtaker when a need arises and to maintain properadministrative procedures. In general, the testingconditions should be equivalent to those thatprevailed when norms and other interpretativedata were obtained.

Standard 6.5

Test takers should be provided appropriate in-structions, practice, and other support necessaryto reduce construct-irrelevant variance.

Comment: Instructions to test takers shouldclearly indicate how to make responses, exceptwhen doing so would obstruct measurement ofthe intended construct (e.g., when an individual’sspontaneous approach to the test-taking situationis being assessed). Instructions should also begiven in the use of any equipment or softwarelikely to be unfamiliar to test takers, unless ac-commodating to unfamiliar tools is part of whatis being assessed. The functions or interfaces ofcomputer-administered tests may be unfamiliarto some test takers, who may need to be shownhow to log on, navigate, or access tools. Practiceopportunities should be given when equipment isinvolved, unless use of the equipment is being as-sessed. Some test takers may need practice re-sponding with particular means required by thetest, such as filling in a multiple-choice “bubble”or interacting with a multimedia simulation.Where possible, practice responses should be mon-itored to confirm that the test taker is making ac-ceptable responses. If a test taker is unable to usethe equipment or make the responses, it may beappropriate to consider alternative testing modes.In addition, test takers should be clearly informedon how their rate of work may affect scores, andhow certain responses, such as not responding,guessing, or responding incorrectly, will be treatedin scoring, unless such directions would underminethe construct being assessed.

Standard 6.6

Reasonable efforts should be made to ensure theintegrity of test scores by eliminating opportunitiesfor test takers to attain scores by fraudulent ordeceptive means.

Comment: In testing programs where the resultsmay be viewed as having important consequences,score integrity should be supported through active

116

CHAPTER 6

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 116

Page 127: STANDARDS

efforts to prevent, detect, and correct scores obtainedby fraudulent or deceptive means. Such effortsmay include, when appropriate and practicable,stipulating requirements for identification, con-structing seating charts, assigning test takers toseats, requiring appropriate space between seats,and providing continuous monitoring of the testingprocess. Test developers should design test materialsand procedures to minimize the possibility of cheat-ing. A local change in the date or time of testingmay offer an opportunity for cheating. Test ad-ministrators should be trained on how to take ap-propriate precautions against and detect opportunitiesto cheat, such as opportunities afforded by technologythat would allow a test taker to communicate withan accomplice outside the testing area, or technologythat would allow a test taker to copy test informationfor subsequent disclosure. Test administrators shouldfollow established policies for dealing with any in-stances of testing irregularity. In general, stepsshould be taken to minimize the possibility ofbreaches in test security, and to detect any breaches.In any evaluation of work products (e.g., portfolios)steps should be taken to ensure that the productrepresents the test taker’s own work, and that theamount and kind of assistance provided is consistentwith the intent of the assessment. Ancillary docu-mentation, such as the date when the work wasdone, may be useful. Testing programs may usetechnologies during scoring to detect possible ir-regularities, such as computer analyses of erasurepatterns, similar answer patterns for multiple testtakers, plagiarism from online sources, or unusualitem parameter shifts. Users of such technologiesare responsible for their accuracy and appropriateapplication. Test developers and test users mayneed to monitor for disclosure of test items on theInternet or from other sources. Testing programswith high-stakes consequences should have definedpolicies and procedures for detecting and processingpotential testing irregularities— including a processby which a person charged with an irregularity canqualify for and/or present an appeal— and for in-validating test scores and providing opportunityfor retesting.

Standard 6.7

Test users have the responsibility of protectingthe security of test materials at all times.

Comment: Those who have test materials undertheir control should, with due consideration ofethical and legal requirements, take all steps nec-essary to ensure that only individuals withlegitimate needs and qualifications for access totest materials are able to obtain such access beforethe test administration, and afterwards as well, ifany part of the test will be reused at a later time.Concerns with inappropriate access to test materialsinclude inappropriate disclosure of test content,tampering with test responses or results, and pro-tection of test taker’s privacy rights. Test usersmust balance test security with the rights of alltest takers and test users. When sensitive testdocuments are at issue in court or in administrativeagency challenges, it is important to identify se-curity and privacy concerns and needed protectionsat the outset. Parties should ensure that therelease or exposure of such documents (includingspecific sections of those documents that maywarrant redaction) to third parties, experts, andthe courts/agencies themselves are consistent withconditions (often reflected in protective orders)that do not result in inappropriate disclosure andthat do not risk unwarranted release beyond theparticular setting in which the challenge has oc-curred. Under certain circumstances, when sensitivetest documents are challenged, it may be appro-priate to employ an independent third party,using a closely supervised secure procedure toconduct a review of the relevant materials ratherthan placing tests, manuals, or a test taker’s testresponses in the public record. Those who haveconfidential information related to testing, suchas registration information, scheduling, and pay-ments, have similar responsibility for protectingthat information. Those with test materials undertheir control should use and disclose such infor-mation only in accordance with any applicableprivacy laws.

117

TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 117

Page 128: STANDARDS

Cluster 2. Test Scoring

Standard 6.8

Those responsible for test scoring should establishscoring protocols. Test scoring that involveshuman judgment should include rubrics, proce-dures, and criteria for scoring. When scoring ofcomplex responses is done by computer, the ac-curacy of the algorithm and processes should bedocumented.

Comment:A scoring protocol should be established,which may be as simple as an answer key for mul-tiple-choice questions. For constructed responses, scorers— humans or machine programs— may beprovided with scoring rubrics listing acceptablealternative responses, as well as general criteria. Acommon practice of test developers is to providescoring training materials, scoring rubrics, andexamples of test takers’ responses at each scorelevel. When tests or items are used over a periodof time, scoring materials should be reviewed periodically.

Standard 6.9

Those responsible for test scoring should establishand document quality control processes and cri-teria. Adequate training should be provided.The quality of scoring should be monitored anddocumented. Any systematic source of scoringerrors should be documented and corrected.

Comment: Criteria should be established for ac-ceptable scoring quality. Procedures should be in-stituted to calibrate scorers (human or machine)prior to operational scoring, and to monitor howconsistently scorers are scoring in accordance withthose established standards during operationalscoring. Where scoring is distributed across scorers,procedures to monitor raters’ accuracy and reliabilitymay also be useful as a quality control procedure.Consistency in applying scoring criteria is oftenchecked by independently rescoring randomly se-

lected test responses. Periodic checks of thestatistical properties (e.g., means, standard devia-tions, percentage of agreement with scores previouslydetermined to be accurate) of scores assigned byindividual scorers during a scoring session canprovide feedback for the scorers, helping them tomaintain scoring standards. In addition, analysesmight monitor possible effects on scoring accuracyof variables such as scorer, task, time or day ofscoring, scoring trainer, scorer pairing, and so on,to inform appropriate corrective or preventativeactions. When the same items are used in multipleadministrations, programs should have proceduresin place to monitor consistency of scoring acrossadministrations (e.g., year-to-year comparability).One way to check for consistency over time is torescore some responses from earlier administrations.Inaccurate or inconsistent scoring may call for re-training, rescoring, dismissing some scorers, and/orreexamining the scoring rubrics or programs. Sys-tematic scoring errors should be corrected, whichmay involve rescoring responses previously scored,as well as correcting the source of the error.Clerical and mechanical errors should be examined.Scoring errors should be minimized and, whenthey are found, steps should be taken promptlyto minimize their recurrence.

Typically, those responsible for scoring willdocument the procedures followed for scoring,procedures followed for quality assurance of thatscoring, the results of the quality assurance, andany unusual circumstances. Depending on thetest user, that documentation may be providedregularly or upon reasonable request. Computerizedscoring applications of text, speech, or other con-structed responses should provide similar docu-mentation of accuracy and reliability, includingcomparisons with human scoring.

When scoring is done locally and requiresscorer judgment, the test user is responsible forproviding adequate training and instruction tothe scorers and for examining scorer agreementand accuracy. The expected level of scorer agreementand accuracy should be documented, as feasible.

118

CHAPTER 6

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 118

Page 129: STANDARDS

Cluster 3. Reporting and Interpretation

Standard 6.10

When test score information is released, thoseresponsible for testing programs should provideinterpretations appropriate to the audience. Theinterpretations should describe in simple languagewhat the test covers, what scores represent, theprecision/reliability of the scores, and how scoresare intended to be used.

Comment:Test users should consult the interpretivematerial prepared by the test developer and shouldrevise or supplement the material as necessary topresent the local and individual results accuratelyand clearly to the intended audience, which mayinclude clients, legal representatives, media, referralsources, test takers, parents, or teachers. Reportsand feedback should be designed to support validinterpretations and use, and minimize potentialnegative consequences. Score precision might bedepicted by error bands or likely score ranges,showing the standard error of measurement.Reports should include discussion of any admin-istrative variations or behavioral observations inclinical settings that may affect results and inter-pretations. Test users should avoid misinterpretationand misuse of test score information. While testusers are primarily responsible for avoiding mis-interpretation and misuse, the interpretive materialsprepared by the test developer or publisher mayaddress common misuses or misinterpretations.To accomplish this, developers of reports and in-terpretive materials may conduct research to helpverify that reports and materials can be interpretedas intended (e.g., focus groups with representativeend-users of the reports). The test developershould inform test users of changes in the testover time that may affect test score interpretation,such as changes in norms, test content frameworks,or scale score meanings.

Standard 6.11

When automatically generated interpretationsof test response protocols or test performanceare reported, the sources, rationale, and empiricalbasis for these interpretations should be available,and their limitations should be described.

Comment: Interpretations of test results are some-times automatically generated, either by a computerprogram in conjunction with computer scoring,or by manually prepared materials. Automaticallygenerated interpretations may not be able to takeinto consideration the context of the individual’scircumstances. Automatically generated interpre-tations should be used with care in diagnostic set-tings, because they may not take into accountother relevant information about the individualtest taker that provides context for test results,such as age, gender, education, prior employment,psychosocial situation, health, psychological history,and symptomatology. Similarly, test developersand test users of automatically generated inter-pretations of academic performance and accom-panying prescriptions for instructional follow-upshould report the bases and limitations of the in-terpretations. Test interpretations should not implythat empirical evidence exists for a relationshipamong particular test results, prescribed interven-tions, and desired outcomes, unless empirical ev-idence is available for populations similar to thoserepresentative of the test taker.

Standard 6.12

When group-level information is obtained byaggregating the results of partial tests taken byindividuals, evidence of validity and reliability/pre-cision should be reported for the level of aggre-gation at which results are reported. Scoresshould not be reported for individuals withoutappropriate evidence to support the interpretationsfor intended uses.

Comment: Large-scale assessments often achieveefficiency by “matrix sampling” the content domainby asking different test takers different questions.

119

TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 119

Page 130: STANDARDS

The testing then requires less time from each testtaker, while the aggregation of individual resultsprovides for domain coverage that can be adequatefor meaningful group- or program-level interpre-tations, such as for schools or grade levels withina locality or particular subject areas. However, be-cause the individual is administered only an in-complete test, an individual score would havelimited meaning, if any.

Standard 6.13

When a material error is found in test scores orother important information issued by a testingorganization or other institution, this informationand a corrected score report should be distributedas soon as practicable to all known recipientswho might otherwise use the erroneous scoresas a basis for decision making. The correctedreport should be labeled as such. What was doneto correct the reports should be documented.The reason for the corrected score report shouldbe made clear to the recipients of the report.

Comment: A material error is one that couldchange the interpretation of the test score andmake a difference in a significant way. An exampleis an erroneous test score (e.g., incorrectly computedor fraudulently obtained) that would affect animportant decision about the test taker, such as acredentialing decision or the awarding of a highschool diploma. Innocuous typographical errorswould be excluded. Timeliness is essential for de-cisions that will be made soon after the test scoresare received. Where test results have been used toinform high-stakes decisions, corrective actionsby test users may be necessary to rectify circum-stances affected by erroneous scores, in additionto issuing corrected reports. The reporting or cor-rective actions may not be possible or practicablein certain work or other settings. Test users shoulddevelop a policy of how to handle material errorsin test scores and should document what wasdone in the case of suspected or actual materialerrors.

Standard 6.14

Organizations that maintain individually iden-tifiable test score information should develop aclear set of policy guidelines on the duration ofretention of an individual’s records and on theavailability and use over time of such data for re-search or other purposes. The policy should bedocumented and available to the test taker. Testusers should maintain appropriate data security,which should include administrative, technical,and physical protections.

Comment: In some instances, test scores becomeobsolete over time, no longer reflecting the currentstate of the test taker. Outdated scores shouldgenerally not be used or made available, exceptfor research purposes. In other cases, test scoresobtained in past years can be useful, as in longitu-dinal assessment or the tracking of deteriorationof function or cognition. The key issue is thevalid use of the information. Organizations andindividuals who maintain individually identifiabletest score information should be aware of andcomply with legal and professional requirements.Organizations and individuals who maintain testscores on individuals may be requested to providedata to researchers or other third-party users.Where data release is deemed appropriate and isnot prohibited by statutes or regulations, the testuser should protect the confidentiality of the testtakers through appropriate policies, such as de-identifying test data or requiring nondisclosureand confidentiality of the data. Organizationsand individuals who maintain or use confidentialinformation about test takers or their scores shouldhave and implement an appropriate policy formaintaining security and integrity of the data, in-cluding protecting from accidental or deliberatemodification as well as preventing loss or unau-thorized destruction. In some cases, organizationsmay need to obtain test takers’ consent to use ordisclose records. Adequate security and appropriateprotocols should be established when confidentialtest data are made part of a larger record (e.g., an

120

CHAPTER 6

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 120

Page 131: STANDARDS

electronic medical record) or merged into a datawarehouse. If records are to be released for clinicaland/or forensic evaluations, care should be takento release them to appropriately licensed individuals,with appropriate signed release authorization bythe test taker or appropriate legal authority.

Standard 6.15

When individual test data are retained, both thetest protocol and any written report should alsobe preserved in some form.

Comment:The protocol may be needed to respondto a possible challenge from a test taker or to fa-cilitate interpretation at a subsequent time. Theprotocol would ordinarily be accompanied bytesting materials and test scores. Retention ofmore detailed records of responses would dependon circumstances and should be covered in a re-tention policy. Record keeping may be subject tolegal and professional requirements. Policy forthe release of any test information for other thanresearch purposes is discussed in chapter 9, “TheRights and Responsibilities of Test Users.”

Standard 6.16

Transmission of individually identified test scoresto authorized individuals or institutions shouldbe done in a manner that protects the confidentialnature of the scores and pertinent ancillary information.

Comment: Care is always needed when commu-nicating the scores of identified test takers, regardlessof the form of communication. Similar care maybe needed to protect the confidentiality of ancillaryinformation, such as personally identifiable infor-mation on disability status for students or clinicaltest scores shared between practitioners. Appropriatecaution with respect to confidential informationshould be exercised in communicating face toface, as well as by telephone, fax, and other formsof written communication. Similarly, transmissionof test data through electronic media and trans-mission and storage on computer networks— including wireless transmission and storage or pro-cessing on the Internet— require caution to maintainappropriate confidentiality and security. Data in-tegrity must also be maintained by preventing in-appropriate modification of results during suchtransmissions. Test users are responsible for un-derstanding and adhering to applicable legal obli-gations in their data management, transmission,use, and retention practices, including collection,handling, storage, and disposition. Test users shouldset and follow appropriate security policies regardingconfidential test data and other assessment infor-mation. Release of clinical raw data, tests, orprotocols to third parties should follow laws, reg-ulations, and guidelines provided by professionalorganizations and should take into account theimpact of availability of tests in public domains(e.g., court proceedings) and the potential for vio-lation of intellectual property rights.

121

TEST ADMINISTRATION, SCORING, REPORTING, AND INTERPRETATION

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 121

Page 132: STANDARDS

ch06.qxp_AERA Standards 6/18/14 2:33 PM Page 122

Page 133: STANDARDS

This chapter provides general standards for thepreparation and publication of test documentationby test developers, publishers, and other providersof tests. Other chapters contain specific standardsthat should be useful in the preparation of materialsto be included in a test’s documentation. Inaddition, test users may have their own docu-mentation requirements. The rights and respon-sibilities of test users are discussed in chapter 9.The supporting documents for tests are the

primary means by which test developers, pub-lishers, and other providers of tests communicatewith test users. These documents are evaluatedon the basis of their completeness, accuracy, cur-rency, and clarity and should be available toqualified individuals as appropriate. A test’s doc-umentation typically specifies the nature of thetest; the use(s) for which it was developed; theprocesses involved in the test’s development;technical information related to scoring, inter-pretation, and evidence of validity, fairness, andreliability/precision; scaling, norming, and stan-dard-setting information if appropriate to theinstrument; and guidelines for test administration,reporting, and interpretation. The objective ofthe documentation is to provide test users withthe information needed to help them assess thenature and quality of the test, the resultingscores, and the interpretations based on the testscores. The information may be reported in doc-uments such as test manuals, technical manuals,user’s guides, research reports, specimen sets, ex-amination kits, directions for test administratorsand scorers, or preview materials for test takers. Regardless of who develops a test (e.g., test

publisher, certification or licensure board, employer,or educational institution) or how many usersexist, the development process should includethorough, timely, and useful documentation. Al-though proper documentation of the evidencesupporting the interpretation of test scores for

proposed uses of a test is important, failure toformally document such evidence in advance doesnot automatically render the corresponding testuse or interpretation invalid. For example, consideran unpublished employment selection test developedby a psychologist solely for internal use within asingle organization, where there is an immediateneed to fill vacancies. The test may properly beput to operational use after needed validity evidenceis collected but before formal documentation ofthe evidence is completed. Similarly, a test usedfor certification may need to be revised frequently,in which case technical reports describing thetest’s development as well as information concerningitem, exam, and candidate performance shouldbe produced periodically, but not necessarily priorto every exam.Test documentation is effective if it commu-

nicates information to user groups in a mannerthat is appropriate for the particular audience. Toaccommodate the breadth of training of thosewho use tests, separate documents or sections ofdocuments may be written for identifiable categoriesof users such as practitioners, consultants, ad-ministrators, researchers, educators, and sometimesexaminees. For example, the test user who ad-ministers the tests and interprets the results needsguidelines for doing so. Those who are responsiblefor selecting tests need to be able to judge thetechnical adequacy of the tests and therefore needsome combination of technical manuals, user’sguides, test manuals, test supplements, examinationkits, and specimen sets. Ordinarily, these supportingdocuments are provided to potential test users ortest reviewers with sufficient information to enablethem to evaluate the appropriateness and technicaladequacy of a test. The types of information pre-sented in these documents typically include a de-scription of the intended test-taking population,stated purpose of the test, test specifications, itemformats, administration and scoring procedures,

123

7. SUPPORTING DOCUMENTATION FOR TESTS

BACKGROUND

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 123

Page 134: STANDARDS

test security protocols, cut scores or other standards,and a description of the test development process.Also typically provided are summaries of technicaldata such as psychometric indices of the items;reliability/precision and validity evidence; normativedata; and cut scores or rules for combining scores,including those for computer-generated interpre-tations of test scores.An essential feature of the documentation for

every test is a discussion of the common appropriateand inappropriate uses and interpretations of thetest scores and a summary of the evidence sup-porting these conclusions. The inclusion of examplesof score interpretations consistent with the testdeveloper’s intended applications helps users makeaccurate inferences on the basis of the test scores.When possible, examples of improper test usesand inappropriate test score interpretations canhelp guard against the misuse of the test or itsscores. When feasible, common negative unintendedconsequences of test use (including missed op-portunities) should be described and suggestionsgiven for avoiding such consequences.Test documents need to include enough in-

formation to allow test users and reviewers to de-termine the appropriateness of the test for its in-tended uses. Other materials that provide moredetails about research by the publisher or inde-pendent investigators (e.g., the samples on whichthe research is based and summative data) shouldbe cited and should be readily obtainable by the

test user or reviewer. This supplemental materialcan be provided in any of a variety of publishedor unpublished forms and in either paper or elec-tronic formats. In addition to technical documentation, de-

scriptive materials are needed in some settings toinform examinees and other interested partiesabout the nature and content of a test. The amountand type of information provided will depend onthe particular test and application. For example,in situations requiring informed consent, informationshould be sufficient for test takers (or their repre-sentatives) to make a sound judgment about thetest. Such information should be phrased in non-technical language and should contain informationthat is consistent with the use of the test scoresand is sufficient to help the user make an informeddecision. The materials may include a general de-scription and rationale for the test; intended usesof the test results; sample items or complete sampletests; and information about conditions of test ad-ministration, confidentiality, and retention of testresults. For some applications, however, the truenature and purpose of the test are purposely hiddenor disguised to prevent faking or response bias. Inthese instances, examinees may be motivated toreveal more or less of a characteristic intended tobe assessed. Hiding or disguising the true natureor purpose of a test is acceptable provided that theactions involved are consistent with legal principlesand ethical standards.

124

CHAPTER 7

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 124

Page 135: STANDARDS

The standards in this chapter begin with an over-arching standard (numbered 7.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intofour thematic clusters labeled as follows:

1. Content of Test Documents: Appropriate Use2. Content of Test Documents: Test Development3. Content of Test Documents: Test Adminis-tration and Scoring

4. Timeliness of Delivery of Test Documents

Standard 7.0

Information relating to tests should be clearlydocumented so that those who use tests canmake informed decisions regarding which testto use for a specific purpose, how to administerthe chosen test, and how to interpret test scores.

Comment:Test developers and publishers shouldprovide general information to help test usersand researchers determine the appropriateness ofan intended test use in a specific context. Whentest developers and publishers become aware of aparticular test use that cannot be justified, theyshould indicate this fact clearly. General informationalso should be provided for test takers and legalguardians who must provide consent prior to atest’s administration. (See Standard 8.4 regardinginformed consent.) Administrators and even thegeneral public may also need general informationabout the test and its results so that they can cor-rectly interpret the results. Test documents should be complete, accurate,

and clearly written so that the intended audiencecan readily understand the content. Test docu-mentation should be provided in a format that isaccessible to the population for which it isintended. For tests used for educational account-ability purposes, documentation should be madepublicly available in a format and language that

are accessible to potential users, including appro-priate school personnel, parents, students fromall relevant subgroups of intended test takers,and the members of the community (e.g., via theInternet). Test documentation in educational set-tings might also include guidance on how userscould use test materials and results to improveinstruction.Test documents should provide sufficient detail

to permit reviewers and researchers to evaluateimportant analyses published in the test manualor technical report. For example, reporting corre-lation matrices in the test document may allowthe test user to judge the data on which decisionsand conclusions were based. Similarly, describingin detail the sample and the nature of factoranalyses that were conducted may allow the testuser to replicate reported studies.Test documentation will also help those who

are affected by the score interpretations to decidewhether to participate in the testing program orhow to participate if participation is not optional.

Cluster 1. Content of Test Documents:Appropriate Use

Standard 7.1

The rationale for a test, recommended uses ofthe test, support for such uses, and informationthat assists in score interpretation should bedocumented. When particular misuses of a testcan be reasonably anticipated, cautions againstsuch misuses should be specified.

Comment: Test publishers should make everyeffort to caution test users against known misusesof tests. However, test publishers cannot anticipateall possible misuses of a test. If publishers doknow of persistent test misuse by a test user, addi-tional educational efforts, including providing in-formation regarding potential harm to the individual,organization, or society, may be appropriate.

125

SUPPORTING DOCUMENTATION FOR TESTS

STANDARDS FOR SUPPORTING DOCUMENTATION FOR TESTS

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 125

Page 136: STANDARDS

126

CHAPTER 7

Standard 7.2

The population for whom a test is intendedand specifications for the test should be docu-mented. If normative data are provided, theprocedures used to gather the data should beexplained; the norming population should bedescribed in terms of relevant demographic vari-ables; and the year(s) in which the data werecollected should be reported.

Comment: Known limitations of a test for certainpopulations should be clearly delineated in thetest documents. For example, a test used to assesseducational progress may not be appropriate foremployee selection in business and industry. Other documentation can assist the user in

identifying the appropriate normative informationto use to interpret test scores appropriately. Forexample, the time of year in which the normativedata were collected may be relevant in some edu-cational settings. In organizational settings, infor-mation on the context in which normative datawere gathered (e.g., in concurrent or predictivestudies; for development or selection purposes)may also have implications for which norms areappropriate for operational use.

Standard 7.3

When the information is available and appro-priately shared, test documents should cite arepresentative set of the studies pertaining togeneral and specific uses of the test.

Comment: If a study cited by the test publisher isnot published, summaries should be made availableon request to test users and researchers by thepublisher.

Cluster 2. Content of Test Documents:Test Development

Standard 7.4

Test documentation should summarize test de-velopment procedures, including descriptions and

the results of the statistical analyses that wereused in the development of the test, evidence ofthe reliability/precision of scores and the validityof their recommended interpretations, and themethods for establishing performance cut scores.

Comment: When applicable, test documentsshould include descriptions of the proceduresused to develop items and create the item pool, tocreate tests or forms of tests, to establish scales forreported scores, and to set standards and rules forcut scores or combining scores. Test documentsshould also provide information that allows theuser to evaluate bias or fairness for all relevantgroups of intended test takers when it is meaningfuland feasible for such studies to be conducted. Inaddition, other statistical data should be providedas appropriate, such as item-level information,information on the effects of various cut scores(e.g., number of candidates passing at potentialcut scores, level of adverse impact at potential cutscores), information about raw scores and reportedscores, normative data, the standard errors ofmeasurement, and a description of the proceduresused to equate multiple forms. (See chaps. 3 and4 for more information on the evaluation offairness and on procedures and statistics commonlyused in test development.)

Standard 7.5

Test documents should record the relevant char-acteristics of the individuals or groups of indi-viduals who participated in data collection effortsassociated with test development or validation(e.g., demographic information, job status, gradelevel); the nature of the data that were contributed(e.g., predictor data, criterion data); the natureof judgments made by subject matter experts(e.g., content validation linkages); the instructionsthat were provided to participants in datacollection efforts for their specific tasks; and theconditions under which the test data werecollected in the validity study.

Comment: Test developers should describe therelevant characteristics of those who participatedin various steps of the test development process

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 126

Page 137: STANDARDS

and what tasks each person or group performed.For example, the participants who set the test cutscores and their relevant expertise should be doc-umented. Depending on the use of the test results,relevant characteristics of the participants mayinclude race/ethnicity, gender, age, employmentstatus, education, disability status, and primarylanguage. Descriptions of the tasks and the specificinstructions provided to the participants may helpfuture test users select and subsequently use thetest appropriately. Testing conditions, such as theextent of proctoring in the validity study, mayhave implications for the generalizability of theresults and should be documented. Any changesto the standardized testing conditions, such as ac-commodations or modifications made to the testor test administration, should also be documented.Test developers and users should take care tocomply with applicable legal requirements andprofessional standards relating to privacy and datasecurity when providing the documentationrequired by this standard.

Standard 7.6

When a test is available in more than onelanguage, the test documentation should provideinformation on the procedures that were employedto translate and adapt the test. Informationshould also be provided regarding thereliability/precision and validity evidence for theadapted form when feasible.

Comment: In addition to providing informationon translation and adaptation procedures, the testdocuments should include the demographics oftranslators and samples of test takers used in theadaptation process, as well as information on anyscore interpretation issues for each language intowhich the test has been translated and adapted.Evidence of reliability/precision, validity, and com-parability of translated and adapted scores shouldbe provided in test documentation when feasible.(See Standard 3.14, in chap. 3, for further discussionof translations.)

Cluster 3. Content of Test Documents:Test Administration and Scoring

Standard 7.7

Test documents should specify user qualificationsthat are required to administer and score a test,as well as the user qualifications needed tointerpret the test scores accurately.

Comment: Statements of user qualifications shouldspecify the training, certification, competencies,and experience needed to allow access to a test orscores obtained with it. When user qualificationsare expressed in terms of the knowledge, skills,abilities, and other characteristics required to ad-minister, score, and interpret a test, the test docu-mentation should clearly define the requirementsso the user can properly evaluate the competenceof administrators.

Standard 7.8

Test documentation should include detailed in-structions on how a test is to be administeredand scored.

Comment: Regardless of whether a test is to beadministered in paper-and-pencil format, computerformat, or orally, or whether the test is performancebased, instructions for administration should beincluded in the test documentation. As appropriate,these instructions should include all factors relatedto test administration, including qualifications,competencies, and training of test administrators;equipment needed; protocols for test administrators;timing instructions; and procedures for imple-mentation of test accommodations. When available,test documentation should also include estimatesof the time required to administer the test toclinical, disabled, or other special populations forwhom the test is intended to be used, based ondata obtained from these groups during thenorming of the test. In addition, test users needinstructions on how to score a test and what cut

127

SUPPORTING DOCUMENTATION FOR TESTS

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 127

Page 138: STANDARDS

scores to use (or whether to use cut scores) in in-terpreting scores. If the test user does not scorethe test, instructions should be given on how tohave a test scored. Finally, test administration doc-umentation should include instructions for dealingwith irregularities in test administration andguidance on how they should be documented.If a test is designed so that more than one

method can be used for administration or forrecording responses— such as marking responsesin a test booklet, on a separate answer sheet, or via computer— then the manual should clearly docu-ment the extent to which scores arising from ap-plication of these methods are interchangeable. Ifthe scores are not interchangeable, this fact shouldbe reported, and guidance should be given on thecomparability of scores obtained under the variousconditions or methods of administration.

Standard 7.9

If test security is critical to the interpretation oftest scores, the documentation should explainthe steps necessary to protect test materials andto prevent inappropriate exchange of informationduring the test administration session.

Comment: When the proper interpretation oftest scores assumes that the test taker has not beenexposed to the test content or received illicit assis-tance, the instructions should include proceduresfor ensuring the security of the testing process andof all test materials at all times. Security proceduresmay include guidance for storing and distributingtest materials as well as instructions for maintaininga secure testing process, such as identifying testtakers and seating test takers to prevent exchangeof information. Test users should be aware thatfederal and state laws, regulations, and policiesmay affect security procedures.In many situations, test scores should also be

maintained securely. For example, in promotionaltesting in some employment settings, only thecandidate and the staffing personnel are authorizedto see the scores, and the candidate’s current su-pervisor is specifically prohibited from viewingthem. Documentation may include information

on how test scores are stored and who is authorizedto see the scores.

Standard 7.10

Tests that are designed to be scored and interpretedby test takers should be accompanied by scoringinstructions and interpretive materials that arewritten in language the test takers can understandand that assist them in understanding the testscores.

Comment: If a test is designed to be scored bytest takers or its scores interpreted by test takers,the publisher and test developer should developprocedures that facilitate accurate scoring and in-terpretation. Interpretive material may includeinformation such as the construct that was meas-ured, the test taker’s results, and the comparisongroup. The appropriate language for the scoringprocedures and interpretive materials is one thatmeets the particular language needs of the testtaker. Thus, the scoring and interpretive materialsmay need to be offered in the native language ofthe test taker to be understood.

Standard 7.11

Interpretive materials for tests that include casestudies should provide examples illustrating thediversity of prospective test takers.

Comment:When case studies can assist the userin the interpretation of the test scores and profiles,the case studies should be included in the testdocumentation and represent members of thesubgroups for which the test is relevant. Toillustrate the diversity of prospective test takers,case studies might cite examples involving womenand men of different ages, individuals differing insexual orientation, persons representing variousracial/ethnic or cultural groups, and individualswith disabilities. Test developers may wish toinform users that the inclusion of such examplesis intended to illustrate the diversity of prospectivetest takers and not to promote interpretation oftest scores in a manner that conflicts with legal

128

CHAPTER 7

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 128

Page 139: STANDARDS

requirements such as race or gender norming inemployment contexts.

Standard 7.12

When test scores are used to make predictionsabout future behavior, the evidence supportingthose predictions should be provided to the testuser.

Comment: The test user should be informed ofany cut scores or rules for combining raw or reportedscores that are necessary for understanding score in-terpretations. A description of both the group ofjudges used in establishing the cut scores and themethods used to derive the cut scores should beprovided. When security or proprietary reasons ne-cessitate the withholding of cut scores or rules forcombining scores, the owners of the intellectualproperty are responsible for documenting evidencein support of the validity of interpretations for in-tended uses. Such evidence might be provided, forexample, by reporting the finding of an independentreview of the algorithms by qualified professionals.When any interpretations of test scores, includingcomputer-generated interpretations, are provided,a summary of the evidence supporting the interpre-tations should be given, as well as the rules andguidelines used in making the interpretations.

Cluster 4. Timeliness of Delivery of TestDocuments

Standard 7.13

Supporting documents (e.g., test manuals, tech-nical manuals, user’s guides, and supplementalmaterial) should be made available to the appro-priate people in a timely manner.

Comment: Supporting documents should be sup-plied in a timely manner. Some documents (e.g.,administration instructions, user’s guides, sampletests or items) must be made available prior to the

first administration of the test. Other documents(e.g., technical manuals containing informationbased on data from the first administration) cannotbe supplied prior to that administration; however,such documents should be created promptly.The test developer or publisher should judge

carefully which information should be includedin first editions of the test manual, technicalmanual, or user’s guide and which informationcan be provided in supplements. For low-volume,unpublished tests, the documentation may berelatively brief. When the developer is also theuser, documentation and summaries are stillnecessary.

Standard 7.14

When substantial changes are made to a test,the test’s documentation should be amended,supplemented, or revised to keep informationfor users current and to provide useful additionalinformation or cautions.

Comment: Supporting documents should clearlynote the date of their publication as well as thename or version of the test for which the docu-mentation is relevant. When substantial changesare made to items and scoring, information onthe extent to which the old scores and new scoresare interchangeable should be included in the testdocumentation. Sometimes it is necessary to change a test or

testing procedure to remove construct-irrelevantvariance that may arise due to the characteristicsof an individual that are unrelated to the constructbeing measured (e.g., when testing individualswith disabilities). When a test or testing proceduresare altered, the documentation for the test shouldinclude a discussion of how the alteration mayaffect the validity and comparability of the testscores, and evidence should be provided to demon-strate the effect of the alteration on the scores ob-tained from the altered test or testing procedures,if sample size permits.

129

SUPPORTING DOCUMENTATION FOR TESTS

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 129

Page 140: STANDARDS

ch07.qxp_AERA Standards 6/18/14 2:34 PM Page 130

Page 141: STANDARDS

This chapter addresses issues of fairness from thepoint of view of the individual test taker. Mostaspects of fairness affect the validity of interpretationsof test scores for their intended uses. The standardsin this chapter address test takers’ rights and re-sponsibilities with regard to test security, theiraccess to test results, and their rights when irreg-ularities in their testing process are claimed. Otherissues of fairness are addressed in chapter 3(“Fairness in Testing”). General considerationsconcerning reports of test results are covered inchapter 6 (“Test Administration, Scoring, Reporting,and Interpretation”). Issues related to test takers’rights and responsibilities in clinical or individualsettings are also discussed in chapter 10 (“Psycho-logical Testing and Assessment”).

The standards in this chapter are directed totest providers, not to test takers. It is the sharedresponsibility of the test developer, test administrator,test proctor (if any), and test user to provide testtakers with information about their rights andtheir own responsibilities. The responsibility toinform the test taker should be apportioned ac-cording to particular circumstances.

Test takers have the right to be assessed withtests that meet current professional standards, in-cluding standards of technical quality, consistenttreatment, fairness, conditions for test adminis-tration, and reporting of results. The chapters inPart I, “Foundations,” and Part II, “Operations,”deal specifically with fair and appropriate testdesign, development, administration, scoring, andreporting. In addition, test takers have a right tobasic information about the test and how the testresults will be used. In most situations, fair andequitable treatment of test takers involves providinginformation about the general nature of the test,the intended use of test scores, and the confiden-tiality of the results in advance of testing. Whenfull disclosure of this information is not appropriate(as is the case with some psychological or em-

ployment tests), the information that is providedshould be consistent across test takers. Test takers,or their legal representatives when appropriate,need enough information about the test and theintended use of test results to reach an informeddecision about their participation.

In some instances, the laws or standards ofprofessional practice, such as those governing re-search on human subjects, require formal informedconsent for testing. In other instances (e.g., em-ployment testing), informed consent is impliedby other actions (e.g., submission of an employmentapplication), and formal consent is not required.The greater the consequences to the test taker,the greater the importance of ensuring that thetest taker is fully informed about the test and vol-untarily consents to participate, except whentesting without consent is permitted by law (e.g.,when participating in testing is legally required ormandated by a court order). If a test is optional,the test taker has the right to know the consequencesof taking or not taking the test. Under most cir-cumstances, the test taker has the right to askquestions or express concerns and should receivea timely response to legitimate inquiries.

When consistent with the purposes and natureof the assessment, general information is usuallyprovided about the test’s content and purposes.Some programs, in the interest of fairness, provideall test takers with helpful materials, such as studyguides, sample questions, or complete sampletests, when such information does not jeopardizethe validity of the interpretations of results fromfuture test administrations. Practice materialsshould have the same appearance and format asthe actual test. A practice test for a Web-based as-sessment, for example, should be available viacomputer. Employee selection programs may le-gitimately provide more training to certain classesof test takers (e.g., internal applicants) and not toothers (e.g., external applicants). For example, an

131

8. THE RIGHTS AND RESPONSIBILITIESOF TEST TAKERS

BACKGROUND

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 131

Page 142: STANDARDS

organization may train current employees on skillsthat are measured on employment tests in thecontext of an employee development programbut not offer that training to external applicants.Advice may also be provided about test-takingstrategies, including time management and theadvisability of omitting a response to an item(when omitting a response is permitted). Infor-mation on various testing policies, for exampleabout making accommodations available and de-termining for which individuals the accommoda-tions are appropriate, is also provided to the testtaker. In addition, communications to test takersshould include policies on retesting when majordisruptions of the test administration occur, whenthe test taker feels that the present performancedoes not appropriately reflect his or her true ca-pabilities, or when the test taker improves on hisor her underlying knowledge, skills, abilities, orother personal characteristics.

As participants in the assessment, test takershave responsibilities as well as rights. Their re-sponsibilities include being prepared to take thetest, following the directions of the test adminis-trator, representing themselves honestly on thetest, and protecting the security of the test materials.Requests for accommodations or modificationsare the responsibility of the test taker, or in thecase of minors, the test taker’s guardian. In grouptesting situations, test takers should not interferewith the performance of other test takers. In sometesting programs, test takers are also expected toinform the appropriate persons in a timely mannerif they believe there are reasons that their testresults will not reflect their true capabilities.

The validity of score interpretations rests onthe assumption that a test taker has earned fairlya particular score or categorical decision, such as“pass” or “fail.” Many forms of cheating or othermalfeasant behaviors can reduce the validity ofthe interpretations of test scores and cause harm

to other test takers, particularly in competitivesituations in which test takers’ scores are compared.There are many forms of behavior that affect testscores, such as using prohibited aids or arrangingfor someone to take the test in the test taker’splace. Similarly, there are many forms of behaviorthat jeopardize the security of test materials, in-cluding communicating the specific content ofthe test to other test takers in advance. The testtaker is obligated to respect the copyrights in testmaterials and may not reproduce the materialswithout authorization or disseminate in any formmaterial that is similar in nature to the test. Testtakers, as well as test administrators, have the re-sponsibility to protect test security by refusing todivulge any details of the test content to others,unless the particular test is designed to be openlyavailable in advance. Failure to honor these re-sponsibilities may compromise the validity oftest score interpretations for the test taker andfor others. Outside groups that develop items fortest preparation should base those items onpublicly disclosed information and not on infor-mation that has been inappropriately shared bytest takers.

Sometimes, testing programs use special scores,statistical indicators, and other indirect informationabout irregularities in testing to examine whetherthe test scores have been obtained fairly. Unusualpatterns of responses, large changes in test scoresupon retesting, response speed, and similar indicatorsmay trigger careful scrutiny of certain testing pro-tocols and test scores. The details of the proceduresfor detecting problems are generally kept secure toavoid compromising their use. However, test takersshould be informed that in special circumstances,such as response or test score anomalies, their testresponses may receive special scrutiny. Test takersshould be informed that their score may be canceledor other action taken if evidence of impropriety orfraud is discovered.

132

CHAPTER 8

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 132

Page 143: STANDARDS

The standards in this chapter begin with an over-arching standard (numbered 8.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intofour thematic clusters labeled as follows:

1. Test Takers’ Rights to Information Prior toTesting

2. Test Takers’ Rights to Access Their Test Results and to Be Protected From Unautho-rized Use of Test Results

3. Test Takers’ Rights to Fair and AccurateScore Reports

4. Test Takers’ Responsibilities for BehaviorThroughout the Test Administration Process

Standard 8.0

Test takers have the right to adequate informationto help them properly prepare for a test so thatthe test results accurately reflect their standingon the construct being assessed and lead to fairand accurate score interpretations. They alsohave the right to protection of their personallyidentifiable score results from unauthorized access,use, or disclosure. Further, test takers have theresponsibility to represent themselves accuratelyin the testing process and to respect copyright intest materials.

Comment: Specific standards for test takers’ rightsand responsibilities are described below. These in-clude standards for the kinds of information thatshould be provided to test takers prior to testingso they can properly prepare to take the test andso that their results accurately reflect their standingon the construct being assessed. Standards alsocover test takers’ access to their test results; protectionof the results from unauthorized access, use, ordisclosure by others; and test takers’ rights to fairand accurate score reports. In addition, standards

in this chapter address the responsibility of testtakers to represent themselves fairly and accuratelyduring the testing process and to respect the con-fidentiality of copyright in all test materials.

Cluster 1. Test Takers’ Rights toInformation Prior to Testing

Standard 8.1

Information about test content and purposesthat is available to any test taker prior to testingshould be available to all test takers. Shared in-formation should be available free of charge andin accessible formats.

Comment: The intent of this standard isequitable treatment for all test takers with respectto access to basic information about a testingevent, such as when and where the test will begiven, what materials should be brought, whatthe purpose of the test is, and how the results willbe used. When applicable, such offerings shouldbe made to all test takers and, to the degreepossible, should be in formats accessible to all testtakers. Accessibility of formats also applies to in-formation that may be provided on a publicwebsite. For example, depending on the formatof the information, conversions can be made sothat individuals with visual disabilities can accesstextual or graphical material. For test takers withdisabilities, providing these materials in accessibleformats may be required by law.

It merits noting that while general informationabout test content and purpose should be madeavailable to all test takers, some organizationsmay supplement this information with additionaltraining or coaching. For example, some employersmay teach basic skills to workers to help themqualify for higher level positions. Similarly, oneteacher in a school may choose to drill studentson a topic that will be tested while other teachersfocus on other topics.

133

THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS

STANDARDS FOR TEST TAKERS’ RIGHTS AND RESPONSIBILITIES

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 133

Page 144: STANDARDS

134

CHAPTER 8

Standard 8.2

Test takers should be provided in advance withas much information about the test, the testingprocess, the intended test use, test scoring criteria,testing policy, availability of accommodations,and confidentiality protection as is consistentwith obtaining valid responses and making ap-propriate interpretations of test scores.

Comment:When appropriate, test takers shouldbe informed in advance about test content, in-cluding subject area, topics covered, and itemformats. General advice should be given abouttest-taking strategies. For example, test takersshould usually be informed about the advisabilityof omitting responses and made aware of any im-posed time limits, so that they can manage theirtime appropriately. For computer administrations,test takers should be shown samples of the interfacethey will be expected to use during the test andbe provided an opportunity to practice with thosetools and master their use before the test begins.In addition, they should be told about possibilitiesfor revisiting items they have previously answeredor omitted.

In most testing situations, test takers shouldbe informed about the intended use of test scoresand the extent of the confidentiality of test results,and should be told whether and when they willhave access to their results. Exceptions occur whenknowledge of the purposes or intended score useswould violate the integrity of the interpretationsof the scores, such as when the test is intended todetect malingering. If a record of the testingsession is kept in written, video, audio, or anyother form, or if other records associated with thetesting event, such as scoring information, arekept, test takers are entitled to know what testinginformation will be released and to whom and forwhat purposes the results will be used. In somecases, legal standards apply to information aboutthe use and confidentiality of, and test-taker accessto, test scores. Policies concerning retesting shouldalso be communicated. Test takers should bewarned against improper behavior and made cog-nizant of the consequences of misconduct, such

as cheating, that could result in their being pro-hibited from completing the test or receiving testscores, or could make them subject to other sanc-tions. Test takers should be informed, at least in ageneral way, if there will be special scrutiny oftesting protocols or score patterns to detect breachesof security, cheating, or other improper behavior.

Standard 8.3

When the test taker is offered a choice of testformat, information about the characteristics ofeach format should be provided.

Comment:Test takers sometimes may choose be-tween paper-and-pencil administration of a testand computer administration. Some tests areoffered in different languages. Sometimes, an al-ternative assessment is offered. Test takers need toknow the characteristics of each alternative that isavailable to them so that they can make aninformed choice.

Standard 8.4

Informed consent should be obtained from testtakers, or from their legal representatives whenappropriate, before testing begins, except (a) when testing without consent is mandatedby law or governmental regulation, (b) whentesting is conducted as a regular part of schoolactivities, or (c) when consent is clearly implied,such as in employment settings. Informed consentmay be required by applicable law and professionalstandards.

Comment: Informed consent implies that thetest takers or their representatives are made aware,in language that they can understand, of thereasons for testing, the types of tests to be used,the intended uses of test takers’ test results orother information, and the range of materialconsequences of the intended use. It is generallyrecommended that persons be asked directly togive their formal consent rather than being askedonly to indicate if they are withholding theirconsent.

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 134

Page 145: STANDARDS

Consent is not required when testing is legallymandated, as in the case of a court-ordered psy-chological assessment, although there may be legalrequirements for providing information about thetesting session outcomes to the test taker. Nor isconsent typically required in educational settingsfor tests administered to all pupils. When testingis required for employment, credentialing, or ed-ucational admissions, applicants, by applying,have implicitly given consent to the testing. Whenfeasible, the person explaining the reason for atest should be experienced in communicatingwith individuals within the intended populationfor the test (e.g., individuals with disabilities orfrom different linguistic backgrounds).

Cluster 2. Test Takers’ Rights to AccessTheir Test Results and to Be ProtectedFrom Unauthorized Use of Test Results

Standard 8.5

Policies for the release of test scores with identi-fying information should be carefully consideredand clearly communicated to those who haveaccess to the scores. Policies should make surethat test results containing the names of individualtest takers or other personal identifying infor-mation are released only to those who have a le-gitimate, professional interest in the test takersand are permitted to access such informationunder applicable privacy laws, who are coveredby the test takers’ informed consent documents,or who are otherwise permitted by law to accessthe results.

Comment: Test results of individuals identifiedby name, or by some other information by meansof which a person can be readily identified, orreadily identified when the information is com-bined with other information, should be keptconfidential. In some situations, informationmay be provided on a confidential basis to otherpractitioners with a legitimate interest in theparticular case, consistent with legal and ethical

considerations, including, as applicable, privacylaws. Information may be provided to researchersif several conditions are all met: (a) each testtaker’s confidentiality is maintained, (b) the in-tended use is consistent with accepted researchpractice, (c) the use is in compliance with currentlegal and institutional requirements for subjects’rights and with applicable privacy laws, and (d)the use is consistent with the test taker’s informedconsent documents that are on file or with theconditions of implied consent that are appropriatein some settings.

Standard 8.6

Test data maintained or transmitted in datafiles, including all personally identifiable infor-mation (not just results), should be adequatelyprotected from improper access, use, or disclosure,including by reasonable physical, technical, andadministrative protections as appropriate to theparticular data set and its risks, and in compliancewith applicable legal requirements. Use of facsimiletransmission, computer networks, data banks,or other electronic data-processing or transmittalsystems should be restricted to situations inwhich confidentiality can be reasonably assured.Users should develop and/or follow policies,consistent with any legal requirements, forwhether and how test takers may review andcorrect personal information.

Comment: Risk of compromise is reduced byavoiding identification numbers or codes that arelinked to individuals and used for other purposes(e.g., Social Security numbers or employee IDs).If facsimile or computer communication is usedto transmit test responses to another site forscoring or if scores are similarly transmitted, rea-sonable provisions should be made to keep theinformation confidential, such as encrypting theinformation. In some circumstances, applicabledata security laws may require that specific measuresbe taken to protect the data. In most cases, thesepolicies will be developed by the owner of thedata.

135

THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 135

Page 146: STANDARDS

Cluster 3. Test Takers’ Rights to Fairand Accurate Score Reports

Standard 8.7

When score reporting assigns scores of individualtest takers into categories, the labels assigned tothe categories should be chosen to reflect intendedinferences and should be described precisely.

Comment: When labels are associated with testresults, care should be taken to avoid labels withunnecessarily stigmatizing implications. For ex-ample, descriptive labels such as “basic,” “proficient,”and “advanced” would carry less stigmatizing in-terpretations than terms such as “poor” or “unsat-isfactory.” In addition, information should beprovided regarding the accuracy of score classifi-cations (e.g., decision accuracy and decision con-sistency).

Standard 8.8

When test scores are used to make decisionsabout a test taker or to make recommendationsto a test taker or a third party, the test takershould have timely access to a copy of any reportof test scores and test interpretation, unless thatright has been waived explicitly in the test taker’sinformed consent document or implicitly throughthe application procedure in education, creden-tialing, or employment testing or is prohibitedby law or court order.

Comment: In some cases, a test taker may be ad-equately informed when the test report is givento an appropriate third party (e.g., treating psy-chologist or psychiatrist) who can interpret thefindings for the test taker. When the test taker isgiven a copy of the test report and there is acredible reason to believe that test scores mightbe incorrectly interpreted, the examiner or aknowledgeable third party should be available tointerpret them, even if the score report is clearlywritten, as the test taker may misunderstand orraise questions not specifically answered in the re-port. In employment testing situations, when test

results are used solely for the purpose of aidingselection decisions, waivers of access are often acondition of employment applications, althoughaccess to test information may often be appropriatelyrequired in other circumstances.

Cluster 4. Test Takers’ Responsibilitiesfor Behavior Throughout the TestAdministration Process

Standard 8.9

Test takers should be made aware that havingsomeone else take the test for them, disclosingconfidential test material, or engaging in anyother form of cheating is unacceptable and thatsuch behavior may result in sanctions.

Comment: Although the Standards cannot regulatetest takers’ behavior, test takers should be madeaware of their personal and legal responsibilities.Arranging for someone else to impersonate thetest taker constitutes fraud. In tests designed tomeasure a test taker’s independent thinking, pro-viding responses that make use of the work ofothers without attribution or that were preparedby someone other than the test taker constitutesplagiarism. Disclosure of confidential testing ma-terial for the purpose of giving other test takersadvance knowledge interferes with the validity oftest score interpretations; and circulation of testitems in print or electronic form may constitutecopyright infringement. In licensure and certificationtests, such actions may compromise public healthand safety. In general, the validity of test score in-terpretations is compromised by inappropriatetest disclosure.

Standard 8.10

In educational and credentialing testing programs,when an individual score report is expected tobe significantly delayed beyond a brief investigativeperiod because of possible irregularities such assuspected misconduct, the test taker should benotified and given the reason for the investigation.

136

CHAPTER 8

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 136

Page 147: STANDARDS

Reasonable efforts should be made to expeditethe review and to protect the interests of the testtaker. The test taker should be notified of thedisposition when the investigation is closed.

Standard 8.11

In educational and credentialing testing programs,when it is deemed necessary to cancel or withholda test taker’s score because of possible testing ir-regularities, including suspected misconduct, thetype of evidence and the general procedures tobe used to investigate the irregularity should beexplained to all test takers whose scores aredirectly affected by the decision. Test takersshould be given a timely opportunity to provideevidence that the score should not be canceledor withheld. Evidence considered in deciding onthe final action should be made available to thetest taker on request.

Comment: Any form of cheating or behaviorthat reduces the validity and fairness of the inter-pretations of test results should be investigatedpromptly, with appropriate action taken. A testscore may be withheld or canceled because of sus-pected misconduct by the test taker or because ofsome anomaly involving others, such as theft oradministrative mishap. An avenue of appeal shouldbe available and made known to candidates whosescores may be amended or withheld. Some testingorganizations offer the option of a prompt andfree retest or arbitration of disputes. The informationprovided to the test takers should be specificenough for them to understand the evidence thatis being used to support the contention of atesting irregularity but not specific enough todivulge trade secrets or to facilitate cheating.

Standard 8.12

In educational and credentialing testing pro-grams, a test taker is entitled to fair treatmentand a reasonable resolution process, appropriateto the particular circumstances, regarding chargesassociated with testing irregularities, or challengesissued by the test taker regarding accuracies ofthe scoring or scoring key. Test takers areentitled to be informed of any available meansof recourse.

Comment:When a test taker’s score is questionedand invalidated, or when a test taker seeks areview or revision of his or her score or of someother aspect of the testing, scoring, or reportingprocess, the test taker is entitled to some orderlyprocess for effective input into or review of thedecision making of the test administrator or testuser. Depending on the magnitude of the conse-quences associated with the test, this process canrange from an internal review of all relevant databy a test administrator, to an informal conversationwith an examinee, to a full administrative hearing.The greater the consequences, the greater theextent of procedural protections that should bemade available. Test takers should also be madeaware of procedures for recourse, possible fees as-sociated with recourse procedures, expected timefor resolution, and any other significant relatedissues, including consequences for the test taker.Some testing programs advise that the test takermay be represented by an attorney, althoughpossibly at the test taker’s expense. Depending onthe circumstances and context, principles of dueprocess under law may be relevant to the processafforded to test takers.

137

THE RIGHTS AND RESPONSIBILITIES OF TEST TAKERS

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 137

Page 148: STANDARDS

ch08.qxp_AERA Standards 6/18/14 2:34 PM Page 138

Page 149: STANDARDS

The previous chapters have dealt primarily withthe responsibilities of those who develop, promote,evaluate, or mandate the administration of testsand with the rights and responsibilities of testtakers. The present chapter centers attention onthe responsibilities of those who may be consideredthe users of tests. Test users are professionals whoselect the specific instruments or supervise test administration— on their own authority or at thebehest of others— as well as all other professionalswho actively participate in the interpretation anduse of test results. They include psychologists, ed-ucators, employers, test developers, test publishers,and other professionals. Given the reliance ontest results in many settings, pressure has typicallybeen placed on test users to explain test-based de-cisions and testing practices; in many circumstances,test users have legal obligations to document thevalidity and fairness of those decisions and practices.The standards in this chapter provide guidancewith regard to test administration procedures anddecision making in which tests play a part. Thus,the present chapter includes standards of a generalnature that apply in almost all testing contexts.

These Standards presume that a legitimate ed-ucational, psychological, credentialing, or em-ployment purpose justifies the time and expenseof test administration. In most settings, the usercommunicates this purpose to those who have alegitimate interest in the measurement processand subsequently conveys the implications of ex-aminee performance to those entitled to receivethe information. Depending on the measurementsetting, this group may include individual testtakers, parents and guardians, educators, employers,policy makers, the courts, or the general public.

Validity and reliability are critical considerationsin test selection and use, and test users shouldconsider evidence of (a) the validity of the inter-pretation for intended uses of the scores, (b) thereliability/precision of the scores, (c) the applicability

of the normative data available in the test manual,and (d) the potential positive and negative conse-quences of use. The accumulated research literatureshould also be considered, as well as, where ap-propriate, demographic characteristics (e.g., race/eth-nicity; gender; age; income; socioeconomic, cultural,and linguistic background; education; and othersocioeconomic variables) of the group for whichthe test was originally constructed and for whichnormative data are available. Test users can alsoconsult with measurement professionals. Thename of the test alone never provides adequateinformation for deciding whether to select it.

In some cases, the selection of tests and in-ventories is individualized for a particular client.In other settings, a predetermined battery of testsis taken by all participants. In both cases, testusers should be well versed in proper administrativeprocedures and are responsible for understandingthe validity and reliability evidence and articulatingthat evidence if the need arises. Test users whooversee testing and assessment are responsible forensuring that the test administrators who administerand score tests have received the appropriate edu-cation and training needed to perform these tasks.A higher level of competence is required of thetest user who interprets the scores and integratesthe inferences derived from the scores and otherrelevant information.

Test scores ideally are interpreted in light ofthe available data, the psychometric properties ofthe scores, indicators of effort, and the effects ofmoderator variables and demographic characteristicson test results. Because items or tasks containedin a test that was designed for a particular groupmay introduce construct-irrelevant variance whenused with other groups, selecting a test with de-mographically appropriate reference groups is im-portant to the generalizability of the inferencethat the test user seeks to make. When a test de-veloped and normed for one group is applied to

139

9. THE RIGHTS AND RESPONSIBILITIESOF TEST USERS

BACKGROUND

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 139

Page 150: STANDARDS

other groups, score interpretations should be qual-ified and presented as hypotheses rather thanconclusions. Further, statistical analyses conductedon only one group should be evaluated for appro-priateness when generalized to other examineepopulations. The test user should rely on anyavailable extant research evidence for the test todraw appropriate inferences and should be awareof requirements restricting certain practices (e.g.,norming by race or gender in certain contexts).

Moreover, where applicable, an interpretationof test takers’ scores needs to consider not only thedemonstrated relationship between the scores andthe criteria, but also the appropriateness of thelatter. The criteria need to be subjected to an ex-amination similar to the examination of the predictorsif one is to understand the degree to which the un-derlying constructs are congruent with the inferencesunder consideration. It is important that datawhich are not supportive of the inferences shouldbe acknowledged and either reconciled or noted aslimits to the confidence that can be placed in theinferences. The education and experience necessaryto interpret group tests are generally less stringentthan the qualifications necessary to interpret indi-vidually administered tests.

Test users should follow the standardized testadministration procedures outlined by the testdevelopers. Computer administration of testsshould also follow standardized procedures, andsufficient oversight should be provided to ensurethe integrity of test results. When nonstandardprocedures are needed, they should be describedand justified. Test users are also responsible forproviding appropriate testing conditions. For ex-ample, the test user may need to determinewhether a test taker is capable of reading at thelevel required and whether a test taker with vision,hearing, or neurological disabilities is adequatelyaccommodated. Chapter 3 (“Fairness in Testing”)addresses equal access considerations and standardsin detail.

Where administration of tests or use of testdata is mandated for a specific population by gov-ernmental authorities, educational institutions, li-censing boards, or employers, the developer anduser of an instrument may be essentially the same.

In such settings, there is often no clear separationin terms of professional responsibilities betweenthose who develop the instrument and those whoadminister it and interpret the results. Instrumentsproduced by independent publishers, on the otherhand, present a somewhat different picture. Typically,these will be used by different test users with avariety of populations and for diverse purposes.

The conscientious developer of a standardizedtest attempts to control who has access to the testand to educate potential users. Furthermore, mostpublishers and test sponsors work to prevent themisuse of standardized measures and the misin-terpretation of individual scores and group averages.Test manuals often illustrate sound and unsoundinterpretations and applications. Some identifyspecific practices that are not appropriate andshould be discouraged. Despite the best efforts oftest developers, however, appropriate test use andsound interpretation of test scores are likely to re-main primarily the responsibility of the test user.

Test takers, parents and guardians, legislators,policy makers, the media, the courts, and thepublic at large often prefer unambiguous interpre-tations of test data. In particular, they often tendto attribute positive or negative results, includinggroup differences, to a single factor or to the con-ditions that prevail in one social institution— most often, the home or the school. These consumersof test data frequently press for score-based rationalesfor decisions that are based only in part on testscores. The wise test user helps all interested partiesunderstand that sound decisions regarding testuse and score interpretation involve an element ofprofessional judgment. It is not always obvious tothe consumers that the choice of various informa-tion-gathering procedures involves experience thatis not easily quantified or verbalized. The user canhelp consumers appreciate the fact that the weightingof quantitative data, educational and occupationalinformation, behavioral observations, anecdotalreports, and other relevant data often cannot bespecified precisely. Nonetheless, test users shouldprovide reports and interpretations of test datathat are clear and understandable.

Because test results are frequently reportedas numbers, they often appear to be precise,

140

CHAPTER 9

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 140

Page 151: STANDARDS

and test data are sometimes allowed to overrideother sources of evidence about test takers.There are circumstances in which selection basedexclusively on test scores may be appropriate(e.g., in pre-employment screening). However,in educational, psychological, forensic, and someemployment settings, test users are well advised,and may be legally required, to consider otherrelevant sources of information on test takers,not just test scores. In such situations, psychol-ogists, educators, or other professionals familiarwith the local setting and with local test takersare often best qualified to integrate this diverseinformation effectively.

It is not appropriate for these standards todictate minimal levels of test-criterion correlation,classification accuracy, or reliability/precision forany given purpose. Such levels depend on factorssuch as the nature of the measured construct, theage of the tested individuals, and whether decisionsmust be made immediately on the strength of thebest available evidence, however weak, or whetherthey can be delayed until better evidence becomesavailable. But it is appropriate to expect the user toascertain what the alternatives are, what the qualityand consequences of these alternatives are, andwhether a delay in decision making would be ben-eficial. Cost-benefit compromises become necessaryin test use, as they often are in test development.However, in some contexts, legal requirements mayplace limits on the extent to which such compromisescan be made. As with standards for the variousphases of test development, when relevant standardsare not met in test use, the reasons should be per-suasive. The greater the potential impact on testtakers, for good or ill, the greater the need toidentify and satisfy the relevant standards.

In selecting a test and interpreting a test score,the test user is expected to have a clear understandingof the purposes of the testing and its probableconsequences. The knowledgeable user has definiteideas on how to achieve these purposes and howto avoid unfairness and undesirable consequences.In subscribing to the Standards, test publishers

and agencies mandating test use agree to provideinformation on the strengths and weaknesses oftheir instruments. They accept the responsibilityto warn against likely misinterpretations by unso-phisticated interpreters of individual scores or ag-gregated data. However, the ultimate responsibilityfor appropriate test use and interpretation liespredominantly with the test user. In assumingthis responsibility, the user must become knowl-edgeable about a test’s appropriate uses and thepopulations for which it is suitable. The test usershould be prepared to develop a logical analysisthat supports the various facets of the assessmentand the inferences made from the assessmentresults. Test users in all settings (e.g., clinical,counseling, credentialing, educational, employment,forensic, psychological) must also become adeptin communicating the implications of test resultsto those entitled to receive them.

In some instances, users may be obligated tocollect additional evidence about a test’s technicalquality. For example, if performance assessmentsare locally scored, evidence of the degree of inter-scorer agreement may be required. Users shouldalso be alert to the probable local consequences oftest use, particularly in the case of large-scaletesting programs. If the same test material is usedin successive years, users should actively monitorthe program to determine if reuse has compromisedthe integrity of the results.

Some of the standards that follow reiterateideas contained in other chapters, principallychapter 3 (“Fairness in Testing”), chapter 6 (“TestAdministration, Scoring, Reporting, and Inter-pretation”), chapter 8 (“The Rights and Respon-sibilities of Test Takers”), chapter 10 (“PsychologicalTesting and Assessment”), chapter 11 (“WorkplaceTesting and Credentialing”), and chapter 12 (“Ed-ucational Testing and Assessment”). This repetitionis intentional. It permits an enumeration in onechapter of the major obligations that must be as-sumed largely by the test administrator and user,although these responsibilities may refer to topicsthat are covered more fully in other chapters.

141

THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 141

Page 152: STANDARDS

142

CHAPTER 9

The standards in this chapter begin with an over-arching standard (numbered 9.0), which is designedto convey the central intent or primary focus ofthe chapter. The overarching standard may alsobe viewed as the guiding principle of the chapter,and is applicable to all tests and test users. Allsubsequent standards have been separated intothree thematic clusters labeled as follows:

1. Validity of Interpretations2. Dissemination of Information3. Test Security and Protection of Copyrights

Standard 9.0

Test users are responsible for knowing the validityevidence in support of the intended interpretationsof scores on tests that they use, from test selectionthrough the use of scores, as well as commonpositive and negative consequences of test use.Test users also have a legal and ethical responsibilityto protect the security of test content and theprivacy of test takers and should provide pertinentand timely information to test takers and othertest users with whom they share test scores.

Comment: Test users are professionals who fallinto several categories, including those who ad-minister tests and those who interpret and usethe results of tests. Test users who interpret anduse the results of tests are responsible for ascertainingthat there is appropriate validity evidence supportingtheir interpretations and uses of test results. Insome circumstances, test users are also legally re-sponsible for ascertaining the effect of their testingpractices on relevant subgroups and for consideringappropriate measures if negative consequencesexist. In addition, although test users are often re-quired to share the results of tests with test takersand other groups of test users, they must also re-member that test content has to be protected tomaintain the integrity of test scores, and that testtakers have reasonable expectations of privacy,which may be specified in certain federal or statelaws and regulations.

Cluster 1. Validity of Interpretations

Standard 9.1

Responsibility for test use should be assumed byor delegated to only those individuals who havethe training, professional credentials, and/or ex-perience necessary to handle this responsibility.All special qualifications for test administrationor interpretation specified in the test manualshould be met.

Comment: Test users should only interpret thescores of test takers whose special needs or char-acteristics are within the range of the test users’qualifications. This standard has special significancein areas such as clinical testing, forensic testing,personality testing, testing in special education,testing of people with disabilities or limitedexposure to the dominant culture, testing ofEnglish language learners, and in other such situ-ations where the potential impact is great. Whenthe situation or test-taker group falls outside theuser’s experience, assistance should be obtained.A number of professional organizations have codesof ethics that specify the qualifications requiredof those who administer tests and interpret scoreswithin the organizations’ scope of practice. Ulti-mately, the professional is responsible for ensuringthat the clinical training requirements, ethicalcodes, and legal standards for administering andinterpreting tests are met.

Standard 9.2

Prior to the adoption and use of a publishedtest, the test user should study and evaluate thematerials provided by the test developer. Of par-ticular importance are materials that summarizethe test’s purposes, specify the procedures fortest administration, define the intended popula-tion(s) of test takers, and discuss the score inter-pretations for which validity and reliability/pre-cision data are available.

STANDARDS FOR TEST USERS’ RIGHTS AND RESPONSIBILITIES

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 142

Page 153: STANDARDS

Comment: A prerequisite to sound test use isknowledge of the materials accompanying the in-strument. At a minimum, these include manualsprovided by the test developer. Ideally, the usershould be conversant with relevant studies reportedin the professional literature, and should be ableto discriminate between appropriate and inap-propriate tests for the intended use with theintended population. The level of scorereliability/precision and the types of validityevidence required for sound score interpretationsdepend on the test’s role in the assessment processand the potential impact of the process on thepeople involved. The test user should be aware oflegal restrictions that may constrain the use of thetest. On occasion, professional judgment maylead to the use of instruments for which there islittle evidence of validity of the score interpretationsfor the chosen use. In these situations, the usershould not imply that the scores, decisions, or in-ferences are based on well-documented evidencewith respect to reliability or validity.

Standard 9.3

The test user should have a clear rationale forthe intended uses of a test or evaluation procedurein terms of the validity of interpretations basedon the scores and the contribution the scoresmake to the assessment and decision-makingprocess.

Comment: The test user should be clear aboutthe reasons that a test is being given. In otherwords, justification for the role of each instrumentin selection, diagnosis, classification, and decisionmaking should be arrived at before test adminis-tration, not afterwards. In some cases, the reasonsfor the referrals provide the rationale for thechoice of the tests, inventories, and diagnosticprocedures to be used, and the rationale may alsobe supported in printed materials prepared bythe test publisher. The rationale may come fromother sources as well, such as the empiricalliterature.

Standard 9.4

When a test is to be used for a purpose forwhich little or no validity evidence is available,the user is responsible for documenting the ra-tionale for the selection of the test and obtainingevidence of the reliability/precision of the testscores and the validity of the interpretationssupporting the use of the scores for this purpose.

Comment: The individual who uses test scoresfor purposes that are not specifically recommendedby the test developer is responsible for collectingthe necessary validity evidence. Support for suchuses may sometimes be found in the professionalliterature. If previous evidence is not sufficient,then additional data should be collected over timeas the test is being used. The provisions of thisstandard should not be construed as prohibitingthe generation of hypotheses from test data. How-ever, these hypotheses should be clearly labeled astentative. Interested parties should be made awareof the potential limitations of the test scores insuch situations.

Standard 9.5

Test users should be alert to the possibility ofscoring errors and should take appropriate actionwhen errors are suspected.

Comment: The costs of scoring errors are great,particularly in high-stakes testing programs. Insome cases, rescoring may be requested by thetest taker. If such a test-taker right is recognizedin published materials, it should be respected.However, test users should not depend entirelyon test takers to alert them to the possibility ofscoring errors. Monitoring scoring accuracy shouldbe a routine responsibility of testing program ad-ministrators wherever feasible, and rescoring shouldbe done when mistakes are suspected.

Standard 9.6

Test users should be alert to potential misinter-pretations of test scores; they should take steps

143

THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 143

Page 154: STANDARDS

to minimize or avoid foreseeable misinterpretationsand inappropriate uses of test scores.

Comment: Untrained audiences may adopt sim-plistic interpretations of test results or may attributehigh or low scores or averages to a single causalfactor. Test users can sometimes anticipate suchmisinterpretations and should try to prevent them.Obviously, not every unintended interpretationcan be anticipated, and unforeseen negative con-sequences can occur. What is required is a reasonableeffort to encourage sound interpretations and usesand to address any negative consequences thatoccur.

Standard 9.7

Test users should verify periodically that theirinterpretations of test data continue to be ap-propriate, given any significant changes in thepopulation of test takers, the mode(s) of test ad-ministration, or the purposes in testing.

Comment: Over time, a gradual change in thecharacteristics of an examinee population maysignificantly affect the accuracy of inferencesdrawn from group averages. Modifications in testadministration in response to unforeseen circum-stances also may affect interpretations.

Standard 9.8

When test results are released to the public or topolicy makers, those responsible for the releaseshould provide and explain any supplementalinformation that will minimize possible misin-terpretations of the data.

Comment: Test users have a responsibility toreport results in ways that facilitate the intendedinterpretations for the proposed use(s) of thescores, and this responsibility extends beyond theindividual test taker to any individuals or groupswho are provided with test scores. Test users ingroup testing situations are responsible for ensuringthat the individuals who use the test results aretrained to interpret the scores properly. Preliminarybriefings prior to the release of test results can

give reporters, policy makers, or members of thepublic an opportunity to assimilate relevant data.Misinterpretation often can be the result of inad-equate presentation of information that bears ontest score interpretation.

Standard 9.9

When a test user contemplates an alteration intest format, mode of administration, instructions,or the language used in administering a test,the user should have a sound rationale and em-pirical evidence, when possible, for concludingthat the reliability/precision of scores and thevalidity of interpretations based on the scoreswill not be compromised.

Comment: In some instances, minor changes informat or mode of administration may be reasonablyexpected, without evidence, to have little or noeffect on test scores, classification decisions, and/orappropriateness of norms. In other instances,however, changes in the format or administrativeprocedures could have significant effects on thevalidity of interpretations of the scores— that is,these changes modify or change the constructbeing assessed. If a given modification becomeswidespread, evidence for validity should be gathered;if appropriate, norms should also be developedunder the modified conditions.

Standard 9.10

Test users should not rely solely on computer-generated interpretations of test results.

Comment: The user of automatically generatedscoring and reporting services has the obligationto be familiar with the principles on which suchinterpretations were derived. All users who aremaking inferences and decisions on the basis ofthese reports should have the ability to evaluate acomputer-based score interpretation in the lightof other relevant evidence on each test taker. Au-tomated narrative reports can be misleading, ifused in isolation, and are not a substitute forsound professional judgment.

144

CHAPTER 9

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 144

Page 155: STANDARDS

Standard 9.11

When circumstances require that a test be ad-ministered in the same language to all examineesin a linguistically diverse population, the testuser should investigate the validity of the scoreinterpretations for test takers with limited profi-ciency in the language of the test.

Comment: The achievement, abilities, and traitsof examinees who do not speak the language ofthe test as their primary language may be mis-measured by the test, even if administering an al-ternative test is legally unacceptable. Soundpractice requires ongoing evaluation of data toprovide evidence supporting the use of the testwith all linguistic groups or evidence to challengethe use of the test when language proficiency isnot relevant.

Standard 9.12

When a major purpose of testing is to describethe status of a local, regional, or particular ex-aminee population, the criteria for inclusion orexclusion of individuals should be adhered tostrictly.

Comment: Biased results can arise from the ex-clusion of particular subgroups of examinees.Thus, decisions to exclude or include examineesshould be based on appropriately representingthe population.

Standard 9.13

In educational, clinical, and counseling settings,a test taker’s score should not be interpreted inisolation; other relevant information that maylead to alternative explanations for the examinee’stest performance should be considered.

Comment: It is neither necessary nor feasible tomake an intensive review of every test taker’sscore. In some settings, there may be little or nocollateral information of value. In counseling,clinical, and educational settings, however, con-

siderable relevant information is sometimes available.Obvious alternative explanations of low scores in-clude low motivation, limited fluency in the lan-guage of the test, limited opportunity to learn,unfamiliarity with cultural concepts on whichtest items are based, and perceptual or motor im-pairments. The test user corroborates results fromtesting with additional information from a varietyof sources, such as interviews and results fromother tests (e.g., to address the concept of reliabilityof performance across time and/or tests). Whenan inference is based on a single study or basedon studies with samples that are not representativeof the test takers, the test user should be morecautious about the inference that is made. Inclinical and counseling settings, the test usershould not ignore how well the test taker is func-tioning in daily life. If tests are being administeredby computers and other electronic devices or viathe Internet, test users still have a responsibilityto provide support for the interpretation of testscores, including considerations of alternative ex-planations, when appropriate.

Standard 9.14

Test users should inform individuals who mayneed accommodations in test administration(e.g., older adults, test takers with disabilities,or English language learners) about the availabilityof accommodations and, when required, shouldsee that these accommodations are appropriatelymade available.

Comment: Appropriate accommodations dependon the nature of the test and the needs of the testtakers, and should be in keeping with the docu-mentation provided with the test. Test usersshould inform test takers of the availability of ac-commodations, and the onus may then fall onthe test takers or their guardians to request ac-commodations and provide documentation insupport of their requests. Test users should beable to indicate the information or evidence (e.g.,test manual, research study) used to choose anappropriate accommodation.

145

THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 145

Page 156: STANDARDS

Cluster 2. Dissemination of Information

Standard 9.15

Those who have a legitimate interest in an as-sessment should be informed about the purposesof testing, how tests will be administered, thefactors considered in scoring examinee responses,how the scores will be used, how long the recordswill be retained, and to whom and under whatconditions the records may be released.

Comment: Individuals with a legitimate interestin assessment results include, but may not belimited to, test takers, parents or guardians of testtakers, educators, and courts. This standard hasgreater relevance and application to educationaland clinical testing than to employment testing.In most uses of tests for screening job applicantsand applicants to educational programs, forlicensing professionals and awarding credentials,or for measuring achievement, the purposes oftesting and the uses to be made of the test scoresare obvious to the test takers. Nevertheless, it iswise to communicate this information at leastbriefly even in these settings. In some situations,however, the rationale for the testing may be clearto relatively few test takers. In such settings, amore detailed and explicit discussion may be war-ranted. Retention of records, security requirements,and privacy of records are often governed by legalrequirements or institutional practices, even insituations where release of records would clearlybenefit the examinees. Prior to testing, where ap-propriate, the test user should tell the test takerwho will have access to the test results and thewritten report, how the test results will be sharedwith the test taker, and whether and under whatconditions the test results will be shared with athird party or the public (e.g., in court proceedings).

Standard 9.16

Unless circumstances clearly require that test resultsbe withheld, a test user is obligated to provide atimely report of the results to the test taker andothers entitled to receive this information.

Comment: The nature of score reports is oftendictated by practical considerations. In somecases (e.g., with some certification or employmenttests), only a brief printed report may be feasible.In other cases, it may be desirable to provideboth an oral and a written report. The interpre-tation should vary according to the level of so-phistication of the recipient. When the examineeis a young child, an explanation of the test resultsis typically provided to parents or guardians.Feedback in the form of a score report or inter-pretation is not always provided when tests areadministered for personnel selection or promotion,or in certain other circumstances. In some cases,federal or state privacy laws may govern the scopeof information disclosed and to whom it may bedisclosed.

Standard 9.17

If a test taker or test user is concerned about theintegrity of the test taker’s scores, the test usershould inform the test taker of his or her relevantrights, including the possibility of appeal andrepresentation by counsel.

Comment: Proctors in entrance or licensure testingprograms may report irregularities in the test ad-ministration process that result in challenges fromtest takers (e.g., fire alarm in building or temporaryfailure of Internet access). Other challenges maybe raised by test users (e.g., university admissionsofficers) when test scores are grossly inconsistentwith other applicant information. Test takersshould be apprised of their rights, if any, in suchsituations.

Standard 9.18

Test users should explain to test takers their op-portunities, if any, to retake an examination;users should also indicate whether any earlier aswell as later scores will be reported to thoseentitled to receive the score reports.

Comment: Some testing programs permit testtakers to retake an examination several times, tocancel scores, or to have scores withheld from po-

146

CHAPTER 9

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 146

Page 157: STANDARDS

tential recipients. Test takers and other score re-cipients should be informed of such privileges, ifany, and the conditions under which they apply.

Standard 9.19

Test users are obligated to protect the privacy ofexaminees and institutions that are involved in atesting program, unless a disclosure of privateinformation is agreed upon or is specifically au-thorized by law.

Comment: Protection of the privacy of individualexaminees is a well-established principle in psy-chological and educational measurement. Storageand transmission of this type of informationshould meet existing professional and legal stan-dards, and care should be taken to protect theconfidentiality of scores and ancillary information(e.g., disability status). In certain circumstances,test users and testing agencies may adopt morestringent restrictions on the communication andsharing of test results than relevant law dictates.Privacy laws may apply to certain types of infor-mation, and similar or more rigorous standardssometimes arise through the codes of ethicsadopted by relevant professional organizations.In some testing programs the conditions for dis-closure are stated to the examinee prior to testing,and taking the test can constitute agreement tothe disclosure of test score information as specified.In other programs, the test taker or his or herparents or guardians must formally agree to anydisclosure of test information to individuals oragencies other than those specified in the test ad-ministrator’s published literature. Applicable privacylaws, if any, may govern and allow (as in the caseof school districts for accountability purposes) orprohibit (as in clinical settings) the disclosure oftest information. It should be noted that the rightof the public and the media to examine theaggregate test results of public school systems isoften guaranteed by law. This may often includetest scores disaggregated by demographic subgroupswhen the numbers are sufficient to yield statisticallysound results and to prevent the identification ofindividual test takers.

Standard 9.20

In situations where test results are shared withthe public, test users should formulate and sharethe established policy regarding the release ofthe results (e.g., timeliness, amount of detail)and apply that policy consistently over time.

Comment: Test developers and test users shouldconsider the practices of the communities theyserve and facilitate the creation of common policiesregarding the release of test results. For example,in many states, the release of data from large-scaleeducational tests is often required by law. However,even when the release of data is not required butis routinely done, test users should have clearpolicies governing the release procedures. Differentpolicies without appropriate rationales can confusethe public and lead to unnecessary controversy.

Cluster 3. Test Security and Protection of Copyrights

Standard 9.21

Test users have the responsibility to protect thesecurity of tests, including that of previous editions.

Comment: When tests are used for purposes ofselection, credentialing, educational accountability,or for clinical diagnosis, treatment, and monitoring,the rigorous protection of test security is essential,for reasons related to validity of inferences drawn,protection of intellectual property rights, and thecosts associated with developing tests. Test developers,test publishers, and individuals who hold the copy-rights on tests provide specific guidelines abouttest security and disposal of test materials. Thetest user is responsible for helping to ensure thesecurity of test materials according to the professionalguidelines established for that test as well as anyapplicable legal standards. Resale of copyrightedmaterials in open forums is a violation of thisstandard, and audio and video recordings fortraining purposes must also be handled in such away that they are not released to the public. These

147

THE RIGHTS AND RESPONSIBILITIES OF TEST USERS

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 147

Page 158: STANDARDS

prohibitions also apply to outdated and previouseditions of tests; test users should help to ensurethat test materials are securely disposed of whenno longer in use (e.g., upon retirement or afterpurchase of a new edition). Consistency and clarityin the definition of acceptable and unacceptablepractices is critical in such situations. When testsare involved in litigation, inspection of the instru-ments should be restricted— to the extent permittedby law— to those who are obligated legally or byprofessional ethics to safeguard test security.

Standard 9.22

Test users have the responsibility to respect testcopyrights, including copyrights of tests that areadministered via electronic devices.

Comment: Legally and ethically, test users maynot reproduce or create electronic versions ofcopyrighted materials for routine test use withoutconsent of the copyright holder. These materials— in both paper and electronic form— include testitems, test protocols, ancillary forms such as

answer sheets or profile forms, scoring templates,conversion tables of raw scores to reported scores,and tables of norms. Storage and transmission oftest information should satisfy existing legal andprofessional standards.

Standard 9.23

Test users should remind all test takers, includingthose taking electronically administered tests,and others who have access to test materialsthat copyright policies and regulations may pro-hibit the disclosure of test items without specificauthorization.

Comment: In some cases, information on copy-rights and prohibitions on the disclosure of testitems are provided in written form or verbally aspart of the procedure prior to beginning the testor as part of the administration procedures. How-ever, even in cases where this information is not aformal part of the test administration, if materialsare copyrighted, test users should inform testtakers of their responsibilities in this area.

148

CHAPTER 9

ch09.qxp_AERA Standards 6/18/14 2:34 PM Page 148

Page 159: STANDARDS

IIIPART III

Testing Applications

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 149

Page 160: STANDARDS

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 150

Page 161: STANDARDS

This chapter addresses issues important to profes-sionals who use psychological tests to assess indi-viduals. Topics covered in this chapter includetest selection and administration, test score inter-pretation, use of collateral information in psy-chological testing, types of tests, and purposes ofpsychological testing. The types of psychologicaltests reviewed in this chapter include cognitiveand neuropsychological, problem behavior, familyand couples, social and adaptive behavior, per-sonality, and vocational. In addition, the chapterincludes an overview of five common uses of psy-chological tests: for diagnosis; neuropsychologicalevaluation; intervention planning and outcomeevaluation; judicial and governmental decisions;and personal awareness, social identity, and psy-chological health, growth, and action. The standardsin this chapter are applicable to settings where in-depth assessment of people, individually or ingroups, is conducted. Psychological tests are usedin several other contexts as well, most notably inemployment and educational settings. Tests designedto measure specific job-related characteristics acrossmultiple candidates for selection purposes aretreated in the text and standards of chapter 11;tests used in educational settings are addressed indepth in chapter 12.It is critical that professionals who use tests to

conduct assessments of individuals have knowledgeof educational, linguistic, national, and culturalfactors as well as physical capabilities that influence(a) a test taker’s development, (b) the methods forobtaining and conveying information, and (c)the planning and implementation of interventions.Therefore, readers are encouraged to review chapter3, which discusses fairness in testing; chapter 8,which focuses on rights of test takers; and chapter9, which focuses on rights and responsibilities oftest users. In chapters 1, 2, 4, 5, 6, and 7, readerswill find important additional detail on validity;on reliability and precision; on test development;

on scaling and equating; on test administration,scoring, reporting, and interpretation; and onsupporting documentation. The use of psychological tests provides one

approach to collecting information within thelarger framework of a psychological assessment ofan individual. Typically, psychological assessmentsinvolve an interaction between a professional,who is trained and experienced in testing, the testtaker, and a client who may be the test taker oranother party. The test taker may be a child, anadolescent, or an adult. The client usually is theperson or agency that arranges for the assessment.Clients may be patients, counselees, parents, chil-dren, employees, employers, attorneys, students,government agencies, or other responsible parties.The settings in which psychological tests or in-ventories are used include (but are not limited to)preschools; elementary, middle, and secondaryschools; colleges and universities; pre-employmentsettings; hospitals; prisons; mental health andhealth clinics; and other professionals’ offices.The tasks involved in a psychological

assessment— collecting, evaluating, integrating,and reporting salient information relevant to theaspects of a test taker’s functioning that are under examination— comprise a complex and sophisti-cated set of professional activities. A psychologicalassessment is conducted to answer specific questionsabout a test taker’s psychological functioning orbehavior during a particular time interval or topredict an aspect of a test taker’s psychologicalfunctioning or behavior in the future. Becausetest scores characteristically are interpreted in thecontext of other information about the test taker,an individual psychological assessment usuallyalso includes interviewing the test taker; observingthe test taker’s behavior in the appropriate setting;reviewing educational, health, psychological, andother relevant records; and integrating thesefindings with other information that may be pro-

151

10. PSYCHOLOGICAL TESTING ANDASSESSMENT

BACKGROUND

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 151

Page 162: STANDARDS

vided by third parties. The results from tests andinventories used in psychological assessments mayhelp the professional to understand test takersmore fully and to develop more informed and ac-curate hypotheses, inferences, and decisions aboutaspects of the test taker’s psychological functioningor appropriate interventions. The interpretation of test and inventory scores

can be a valuable part of the assessment processand, if used appropriately, can provide useful in-formation to test takers as well as to other users ofthe test interpretation. For example, the results oftests and inventories may be used to assess the psy-chological functioning of an individual; to assigndiagnostic classification; to detect and characterizeneuropsychological impairment, developmental de-lays, and learning disabilities; to determine thevalidity of a symptom; to assess cognitive and per-sonality strengths or mental health and emotionalbehavior problems; to assess vocational interestsand values; to determine developmental stages; toassist in health decision making; or to evaluatetreatment outcomes. Test results also may provideinformation used to make decisions that have apowerful and lasting impact on people’s lives (e.g.,vocational and educational decisions; diagnoses;treatment plans, including plans for psychophar-macological intervention; intervention and outcomeevaluations; health decisions; disability determina-tions; decisions on parole sentencing, civil com-mitment, child custody, and competency to standtrial; personal injury litigation; and death penaltydecisions).

Test Selection and Administration

The selection and administration of psychologicaltests and inventories often is individualized foreach participant. However, in some settings pre-determined tests may be taken by all participants,and interpretations of results may be provided ina group setting.The assessment process begins by clarifying,

as much as possible, the reasons why a test takerwill be assessed. Guided by these reasons or otherrelevant concerns, the tests, inventories, and di-agnostic procedures to be used are selected, and

other sources of information needed to evaluatethe test taker are identified. Preliminary findingsmay lead to the selection of additional tests. Theprofessional is responsible for being familiar withthe evidence of validity for the intended uses ofscores from the tests and inventories selected, in-cluding computer-administered or online tests.Evidence of the reliability/precision of scores, andthe availability of applicable normative data inthe test’s accumulated research literature alsoshould be considered during test selection. In thecase of tests that have been revised, editionscurrently supported by the publisher usuallyshould be selected. On occasion, use of an earlieredition of an instrument is appropriate (e.g.,when longitudinal research is conducted, or whenan earlier edition contains relevant subtests notincluded in a later edition). In addition, professionalsare responsible for guarding against reliance ontest scores that are outdated; in such cases, retestingis appropriate. In international applications, it isespecially important to verify that the constructbeing assessed has equivalent meaning across in-ternational borders and cultural contexts.Validity and reliability/precision considerations

are paramount, but the demographic characteristicsof the group(s) for which the test originally wasconstructed and for which initial and subsequentnormative data are available also are importanttest selection considerations. Selecting a test withdemographically and clinically appropriate nor-mative groups relevant for the test taker and forthe purpose of the assessment is important forthe generalizability of the inferences that the pro-fessional seeks to make. Applying a test constructedfor one group to other groups may not be appro-priate, and score interpretations, if the test isused, should be qualified and presented as hy-potheses rather than conclusions.Tests and inventories that meet high technical

standards of quality are a necessary but not a suf-ficient condition for the responsible administrationand scoring of tests and interpretation and use oftest scores. A professional conducting a psychologicalassessment must complete the appropriate educationand training, acquire appropriate credentials,adhere to professional ethical guidelines, and pos-

152

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 152

Page 163: STANDARDS

sesses a high degree of professional judgment andscientific knowledge. Professionals who oversee testing and assessment

should be thoroughly versed in proper test admin-istration procedures. They are responsible for en-suring that all persons who administer and scoretests have received the appropriate education andtraining needed to perform their assigned tasks.Test administrators should administer tests in themanner that the test manuals indicate and shouldadhere to ethical and professional standards. Theeducation and experience necessary to administergroup tests and/or to proctor computer-administeredtests generally are less extensive than the qualificationsnecessary to administer and interpret scores fromindividually administered tests that require inter-actions between the test taker and the test admin-istrator. In many situations where complex behavioralobservations are required, the use of a nonprofes-sional to administer or score tests may be inappro-priate. Prior to beginning the assessment process,the test taker or a responsible party acting on thetest taker’s behalf (e.g., parent, legal guardian)should understand who will have access to the testresults and the written report, how test results willbe shared with the test taker, and whether andwhen decisions based on the test results will beshared with the test taker and/or a third party orthe public (e.g., in court proceedings).Test administrators must be aware of any per-

sonal limitations that affect their ability toadminister and score the test fairly and accurately.These limitations may include physical, perceptual,and cognitive factors. Some tests place considerabledemands on the test administrator (e.g., recordingresponses rapidly, manipulating equipment, orperforming complex item scoring during admin-istration). Test administrators who cannot com-fortably meet these demands should not administersuch tests. For tests that require oral instructionsprior to or during administration, test administratorsshould be sure that there are no barriers to beingclearly understood by test takers.When using a battery of tests, the professional

should determine the appropriate order of teststo be administered. For example, when adminis-tering cognitive and neuropsychological tests,

some professionals first administer tests to assessbasic domains (e.g., attention) and end with teststo assess more complex domains (e.g., executivefunctions). Professionals also are responsible forestablishing testing conditions that are appropriateto the test taker’s needs and abilities. For example,the examiner may need to determine if the testtaker is capable of reading at the level requiredand if vision, hearing, psychomotor, or clinicalimpairments or neurological deficits are adequatelyaccommodated. Chapter 3 addresses access con-siderations and standards in detail.

Standardized administration is not requiredfor all tests but is important for the interpretationof test scores for many tests and purposes. Inthose situations, standardized test administrationprocedures should be followed. When nonstandardadministration procedures are needed or allowed,they should be described and justified. The inter-preter of the test results should be informed if thetest was unproctored or if it was administeredunder nonstandardized procedures. In some cir-cumstances, test administration may provide theopportunity for skilled examiners to carefullyobserve the performance of test takers under stan-dardized conditions. For example, the test ad-ministrators’ observations may allow them torecord behaviors being assessed, to understandthe manner in which test takers arrived at theiranswers, to identify test-taker strengths and weak-nesses, and to make modifications in the testingprocess. If tests are administered by computer orother technological devices or online, the profes-sional is responsible for determining if the purposeof the assessment and the capabilities of the testtaker require the presence of a proctor or supportstaff (e.g., to assist with the use of the computerequipment or software). Also, some computer-administered tests may require giving the testtaker the opportunity to receive instructions andto practice prior to the test administration. Chapters4 and 6 provide additional detail on technologicallyadministered tests.Inappropriate effort on the part of the person

being assessed may affect the results of psychologicalassessment and may introduce error into the meas-urement of the construct in question. Therefore,

153

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 153

Page 164: STANDARDS

in some cases, the importance of expending ap-propriate effort when taking the test should beexplained to the test taker. For many tests, measuresof effort can be derived from stand-alone tests orfrom responses embedded within a standard as-sessment procedure (e.g., increased numbers oferrors, inconsistent responding, and unusual re-sponses relevant to symptom patterns), and effortmay be measured throughout the assessmentprocess. When low levels of effort and motivationare evident during the test administration, con-tinuing an evaluation may result in inappropriatescore interpretations. Professionals are responsible for protecting

the confidentiality and security of the test resultsand the testing materials. Storage and transmissionof this type of information should satisfy relevantprofessional and legal standards.

Test Score Interpretation

Test scores used in psychological assessment ideallyare interpreted in light of a number of factors, in-cluding the available normative data appropriateto the characteristics of the test taker, the psycho-metric properties of the test, indicators of effort,the circumstances of the test taker at the time thetest is given, the temporal stability of the constructsbeing measured, and the effects of moderatorvariables and demographic characteristics on testresults. The professional rarely has the resourcesavailable to personally conduct the research or toassemble representative norms that, in some typesof assessment, might be needed to make accurateinferences about each individual test taker’s past,current, and future functioning. Therefore, theprofessional may need to rely on the research andthe body of scientific knowledge available for thetest that support appropriate inferences. Presentationof validity and reliability/precision evidence oftenis not needed in the written report summarizingthe findings of the assessment, but the professionalshould strive to understand, and be prepared toarticulate, such evidence as the need arises.When making inferences about a test taker’s

past, present, and future behaviors and other char-acteristics from test scores, the professional should

consider other available data that support orchallenge the inferences. For example, the profes-sional should review the test taker’s history and in-formation about past behaviors, as well as therelevant literature, to develop familiarity with sup-porting evidence. At times, the professional alsoshould corroborate results from one testing sessionwith results from other tests and testing sessionsto address reliability/precision and validity of theinferences made about the test taker’s performanceacross time and/or tests. Triangulation of multiplesources of information— including stylistic andtest-taking behaviors inferred from observationduring the test administration— may strengthenconfidence in the inference. Importantly, data thatare not supportive of the inferences should be ac-knowledged and either reconciled with other in-formation or noted as a limitation to the confidenceplaced in the inference. When there is strong evi-dence for the reliability/precision and validity ofthe scores for the intended uses of a test andstrong evidence for the appropriateness of the testfor the test taker being assessed, then the professional’sability to draw appropriate inferences increases.When an inference is based on a single study orbased on several studies whose samples are oflimited generalizability to the test taker, then theprofessional should be more cautious about theinference and note in the report limitations regardingconclusions drawn from the inference. Threats to the interpretability of obtained

scores are minimized by clearly defining how par-ticular psychological tests are to be used. Thesethreats occur as a result of construct-irrelevantvariance (i.e., aspects of the test and the testingprocess that are not relevant to the purpose of thetest scores) and construct underrepresentation(i.e., failure of the test to account for importantfacets relevant to the purpose of the testing). Re-sponse bias and faking are examples of construct-irrelevant components that may significantly skewthe obtained scores, possibly resulting in inaccurateor misleading interpretations. In situations whereresponse bias or faking is anticipated, professionalsmay choose a test that has scales (e.g., percentageof “yes” answers, percentage of “no” answers;“faking good,” “faking bad”) that clarify the threats

154

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 154

Page 165: STANDARDS

to validity. In so doing, the professionals may beable to assess the degree to which test takers areacquiescing to the perceived demands of the testadministrator or attempting to portray themselvesas impaired by “faking bad,” or as well functioningby “faking good.”For some purposes, including career counseling

and neuropsychological assessment, batteries oftests are frequently used. For example, career coun-seling batteries may include tests of abilities, values,interests, and personality. Neuropsychologicalbatteries may include measures of orientation, at-tention, communication skills, executive function,fluency, visual-motor and visual-spatial skills,problem solving, organization, memory, intelligence,academic achievement, and/or personality, alongwith tests of effort. When psychological test batteriesincorporate multiple methods and scores, patternsof test results frequently are interpreted as reflectinga construct or even an interaction among constructsunderlying test performance. Interactions amongthe constructs underlying configurations of testoutcomes may be postulated on the basis of testscore patterns. The literature reporting evidence ofreliability/precision and validity of configurationsof scores that supports the proposed interpretationsshould be identified when possible. However, it isunderstood that little, if any, literature exists thatdescribes the validity of interpretations of scoresfrom highly customized or flexible batteries oftests. The professional should recognize that variabilityin scores on different tests within a battery commonlyoccurs in the general population, and should usebase rate data, when available, to determine whetherthe observed variability is exceptional. If the literatureis incomplete, the resulting inferences may be pre-sented with the qualification that they are hypothesesfor future verification rather than probabilisticstatements regarding the likelihood of some behaviorthat imply some known validity evidence.

Collateral Information Used inPsychological Testing and Assessment

Test scores that are used as part of a psychologicalassessment are best interpreted in the context ofthe test taker’s personal history and other relevant

traits and personal characteristics. The quality ofinterpretations made from psychological tests andassessments often can be enhanced by obtainingcredible collateral information from various third-party sources, such as significant others, teachers,health professionals, and school, legal, military,and employment records. The quality of collateralinformation is enhanced by using various methodsto acquire it. Structured behavioral observations,checklists, ratings, and interviews are a few of themethods that may be used, along with objectivetest scores to minimize the need for the scorer torely on individual judgment. For example, anevaluation of career goals may be enhanced byobtaining a history of employment as well as byadministering tests to assess academic aptitudeand achievement, vocational interests, work values,personality, and temperament. The availability ofinformation on multiple traits or attributes, whenacquired from various sources and through theuse of various methods, enables professionals toassess more accurately an individual’s psychosocialfunctioning and facilitates more effective decisionmaking. When using collateral data, the professionalshould take steps to ascertain their accuracy andreliability, especially when the data come fromthird parties who may have a vested interest inthe outcome of the assessment.

Types of Psychological Testing and Assessment

For purposes of this chapter, the types of psycho-logical tests have been divided into six categories:cognitive and neuropsychological tests; problembehavior tests; family and couples tests; socialand adaptive behavior tests; personality tests; andvocational tests.

Cognitive and Neuropsychological Testing and Assessment

Tests often are used to assess various classes ofcognitive and neuropsychological functioning, in-cluding intelligence, broad ability domains, andmore focused domains (e.g., abstract reasoningand categorical thinking; academic achievement;attention; cognitive ability; executive function;

155

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 155

Page 166: STANDARDS

language; learning and memory; motor and sen-sorimotor functions and lateral preferences; andperception and perceptual organization/integration).Overlap may occur in the constructs that areassessed by tests of differing functions or domains.In common with other types of tests, cognitiveand neuropsychological tests require a minimallysufficient level of test-taker capacity to maintainattention as well as appropriate effort. For example,when administering cognitive and neuropsycho-logical tests, some professionals first administertests to assess basic domains (e.g., attention) andend with administration of tests to assess morecomplex domains (e.g., executive function).

Abstract reasoning and categorical thinking. Testsof reasoning and thinking measure a broad arrayof skills and abilities, including the examinee’sability to infer relationships, to form new conceptsor strategies, to respond to changing environmentalcircumstances, and to act in goal-oriented situations,as well as the ability to understand a problem or aconcept, to develop a strategy to solve that problem,and, as necessary, to alter such concepts or strategiesas situations vary.

Academic achievement. Academic achievementtests are measures of knowledge and skills that aperson has acquired in formal and informallearning situations. Two major types of academicachievement tests include general achievementbatteries and diagnostic achievement tests. Generalachievement batteries are designed to assess aperson’s level of learning in multiple areas (e.g.,reading, mathematics, and spelling). In contrast,diagnostic achievement tests typically focus onone subject area (e.g., reading) and assess an aca-demic skill in greater detail. Test results are usedto determine the test taker’s strengths and mayalso help identify sources of academic difficultiesor deficiencies. Chapter 12 provides additionaldetail on academic achievement testing in educa-tional settings.

Attention. Attention refers to a domain that en-compasses the constructs of arousal, establishmentof sets, strategic deployment of attention, sustained

attention, divided attention, focused attention,selective attention, and vigilance. Tests may measure(a) levels of alertness, orientation, and localization;(b) the ability to focus, shift, and maintainattention and to track one or more stimuli undervarious conditions; (c) span of attention; and (d) short-term information storage functioning.Scores for each aspect of attention that have beenexamined should be reported individually so thatthe nature of an attention disorder can be clarified.

Cognitive ability. Measures designed to quantifycognitive abilities are among the most widely ad-ministered tests. The interpretation of results froma cognitive ability test is guided by the theoreticalconstructs used to develop the test. Some cognitiveability assessments are based on results from mul-tidimensional test batteries that are designed toassess a broad range of skills and abilities. Testresults are used to draw inferences about a person’soverall level of intellectual functioning and aboutstrengths and weaknesses in various cognitive abil-ities, and to diagnose cognitive disorders.

Executive function. This class of functions is in-volved in the organized performances (e.g., cognitiveflexibility, inhibitory control, multitasking) thatare necessary for the independent, purposive, andeffective attainment of goals in various cognitive-processing, problem-solving, and social situations.Some tests emphasize (a) reasoned plans of actionthat anticipate consequences of alternative solutions,(b) motor performance in problem-solving situationsthat require goal-oriented intentions, and/or (c) regulation of performance for achieving adesired outcome.

Language. Language deficiencies typically are iden-tified with assessments that focus on phonology,morphology, syntax, semantics, supralinguistics,and pragmatics. Various functions may be assessed,including listening, reading, and spoken and writtenlanguage skills and abilities. Language disorder as-sessments focus on functional speech and verbalcomprehension measured through oral, written,or gestural modes; lexical access and elaboration;repetition of spoken language; and associative

156

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 156

Page 167: STANDARDS

verbal fluency. If a multilingual person is assessedfor a possible language disorder, the degree towhich the disorder may be due more directly todevelopmental language issues (e.g., phonological,morphological, syntactic, semantic, or pragmaticdelays; intellectual disabilities; peripheral, sensory,or central neurological impairment; psychologicalconditions; or sensory disorders) than to lack ofproficiency in a given language must be addressed.

Learning and memory. This class of functionsinvolves the acquisition, retention, and retrievalof information beyond the requirements of im-mediate or short-term information processing andstorage. These tests may measure acquisition ofnew information through various sensory channelsand by means of assorted test formats (e.g., wordlists, prose passages, geometric figures, form-boards, digits, and musical melodies). Memorytests also may require retention and recall of oldinformation (e.g., personal data as well as commonlylearned facts and skills). In addition, testing ofrecognition of stored information may be used inunderstanding memory deficits.

Motor functions, sensorimotor functions, andlateral preferences. Motor functions (e.g., fingertapping) and sensory functions (e.g., tactile stim-ulation) are often measured as part of a compre-hensive neuropsychological evaluation. Motortests assess various aspects of movement such asspeed, dexterity, coordination, and purposefulmovement. Sensory tests evaluate function in theareas of vision, hearing, touch, and sometimessmell. Testing also is done to examine the integrationof perceptual and motor functions.

Perception and perceptual organization/integra-tion. This class of functioning involves reasoningand judgment as they relate to the processing andelaboration of complex sensory combinations andinputs. Tests of perception may emphasize imme-diate perceptual processing but also may requireconceptualizations that involve some reasoningand judgmental processes. Some tests have motorcomponents ranging from making simple move-ments to building complex constructions. These

tests assess activities ranging from perceptual speedto choice reaction time, to complex informationprocessing and visual-spatial reasoning.

Problem Behavior Testing and Assessment

Problem behaviors include behavioral adjustmentdifficulties that interfere with a person’s effectivefunctioning in daily life situations. Tests are usedto assess the individual’s behavior and self-per-ceptions for differential diagnosis and educationalclassification for a variety of emotional and be-havioral disorders and to aid in the developmentof treatment plans. In some cases (e.g., deathpenalty evaluations), retrospective analysis isrequired and multiple sources of information helpprovide the most comprehensive assessmentpossible. Observing a person in her or his envi-ronment often is helpful for understanding fullythe specific demands of the environment, notonly to offer a more comprehensive assessmentbut to provide more useful recommendations.

Family and Couples Testing and Assessment

Family testing addresses the issues of family dy-namics, cohesion, and interpersonal relationsamong family members, including partners, parents,children, and extended family members. Tests de-veloped to assess families and couples are distin-guished by whether they measure the interactionpatterns of partial or whole families, in both casesrequiring simultaneous focus on two or morefamily members in terms of their transactions.Testing with couples may address factors such asissues of intimacy, compatibility, shared interests,trust, and spiritual beliefs.

Social and Adaptive Behavior Testing and Assessment

Measures of social and adaptive behaviors assessmotivation and ability to care for oneself andrelate to others. Social and adaptive behaviors arebased on a repertoire of knowledge, skills, andabilities that enable a person to meet the daily de-mands and expectations of the environment, suchas eating, dressing, working, participating in leisureactivities, using transportation, interacting withpeers, communicating with others, making pur-

157

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 157

Page 168: STANDARDS

chases, managing money, maintaining a schedule,living independently, being socially responsive,and engaging in healthy behaviors.

Personality Testing and Assessment

The assessment of personality requires a synthesisof aspects of an individual’s functioning that con-tribute to the formulation and expression ofthoughts, attitudes, emotions, and behaviors.Some of these aspects are stable over time; otherschange with age or are situation specific. Cognitiveand emotional functioning may be consideredseparately in assessing an individual, but their in-fluences are interrelated. For example, a personwhose perceptions are highly accurate, or who isrelatively stable emotionally, may be able to controlsuspiciousness better than a person whose per-ceptions are inaccurate or distorted or who isemotionally unstable.Scores or personality descriptors derived from

a personality test may be regarded as reflectingthe underlying theoretical constructs or empiricallyderived scales or factors that guided the test’s con-struction. The stimulus-and-response formats ofpersonality tests vary widely. Some include a seriesof questions (e.g., self-report inventories) to whichthe test taker is required to respond by choosingfrom multiple well-defined options; others involvebeing placed in a novel situation in which the testtaker’s response is not completely structured (e.g.,responding to visual stimuli, telling stories, dis-cussing pictures, or responding to other projectivestimuli). Results may consist of themes, patterns,or diagnostic indicators, as well as scores. The re-sponses are scored and combined into eitherlogically or statistically derived dimensions estab-lished by previous research.Personality tests may be designed to assess

normal or abnormal attitudes, feelings, traits, andrelated characteristics. Tests intended to measurenormal personality characteristics are constructedto yield scores reflecting the degree to which aperson manifests personality dimensions empiricallyidentified and hypothesized to be present in thebehavior of most individuals. A person’s configu-ration of scores on these dimensions is then usedto infer how the person behaves presently and

how she or he may behave in new situations. Testscores outside the expected range may be consideredstrong expressions of normal traits or may be in-dicative of psychopathology. Such scores also mayreflect normal functioning of the person within aculture different from that of the population onwhich the norms are based.Other personality tests are designed specifically

to measure constructs underlying abnormal func-tioning and psychopathology. Developers of someof these tests use previously diagnosed individualsto construct their scales and base their interpretationson the association between the test’s scale scores,within a given range, and the behavioral correlatesof persons who scored within that range, as com-pared with clinical samples. If interpretationsmade from scores go beyond the theory thatguided the test’s construction, then evidence ofthe validity of the interpretations should becollected and analyzed from additional relevantdata.

Vocational Testing and Assessment

Vocational testing generally includes the meas-urement of interests, work needs, and values, aswell as consideration and assessment of related el-ements of career development, maturity, and in-decision. Academic achievement and cognitiveabilities, discussed earlier in the section on cognitiveability, also are important components in vocationaltesting and assessment. Results from these testsoften are used to enhance personal growth andunderstanding and for career counseling, out-placement counseling, and vocational decisionmaking. These interventions frequently take placein the context of educational and vocational reha-bilitation. However, vocational testing may alsobe used in the workplace as part of corporate pro-grams for career planning.

Interest inventories. The measurement of interestsis designed to identify a person’s preferences forvarious activities. Self-report interest inventoriesare widely used to assess personal preferences, in-cluding likes and dislikes for various work andleisure activities, school subjects, occupations, ortypes of people. The resulting scores may provide

158

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 158

Page 169: STANDARDS

insight into types and patterns of interests in ed-ucational curricula (e.g., college majors), in variousfields of work (e.g., specific occupations), or inmore general or basic areas of interests related tospecific activities (e.g., sales, office practices, ormechanical activities).

Work values inventories. The measurement ofwork values identifies a person’s preferences forthe various reinforcements one may obtain fromwork activities. Sometimes these values are identifiedas needs that persons seek to satisfy. Work valuesor needs may be categorized as intrinsic and im-portant for the pleasure gained from the activity(e.g., being independent, using one’s abilities) oras extrinsic and important for the rewards theybring (e.g., pay, promotion). The format of workvalues tests usually involves a self-rating of theimportance of the value associated with qualitiesdescribed by the items.

Measures of career development, maturity, andindecision.Additional areas of vocational assessmentinclude measures of career development and ma-turity and measures of career indecision. Inventoriesthat measure career development and maturitytypically elicit self-descriptions in response toitems that inquire about individuals’ knowledgeof the world of work; self-appraisal of their deci-sion-making skills; attitudes toward careers andcareer choices; and the degree to which the indi-viduals already have engaged in career planning.Measures of career indecision usually are constructedand standardized to assess both the level of careerindecision of a test taker and the reasons for, orantecedents of, this indecision. Results from testssuch as these are often used with individuals andgroups to guide the design and delivery of careerservices and to evaluate the effectiveness of careerinterventions.

Purposes of Psychological Testing and Assessment

For purposes of this chapter, psychological testuses have been divided into five categories: testingfor diagnosis; testing for neuropsychological eval-

uations; testing for intervention planning andoutcome evaluation; testing for judicial and gov-ernmental decisions; and testing for personalawareness, social identity, and psychological health,growth, and action. However, these categories arenot always mutually exclusive.

Testing for Diagnosis

Diagnosis refers to a process that includes the col-lection and integration of test results with priorand current information about a person, togetherwith relevant contextual conditions, to identifycharacteristics of healthy psychological functioningas well as psychological disorders. Disorders maymanifest themselves in information obtainedduring the testing of an individual’s cognitive,emotional, adaptive, behavioral, personality, neu-ropsychological, physical, or social attributes.Psychological tests are helpful to professionals

involved in the diagnosis of an individual’s psycho-logical health. Testing may be performed to confirma hypothesized diagnosis or to rule out alternativediagnoses. Diagnosis is complicated by the prevalenceof comorbidity between diagnostic categories. Forexample, an individual diagnosed with dementiamay simultaneously be diagnosed as depressed. Ora child diagnosed as having a learning disabilityalso may be diagnosed as suffering from an attentiondeficit/hyperactivity disorder. The goal of diagnosisis to provide a brief description of the test taker’spsychological dysfunction and to assist each testtaker in receiving the appropriate interventions forthe psychological or behavioral dysfunctions thatthe client, or a third party, views as impairing theclient’s expected functioning and/or enjoyment oflife. When the intent of assessment is differentialdiagnosis, the professional should use tests forwhich there is evidence that the scores distinguishbetween two or more diagnostic groups. Groupmean differences do not provide sufficient evidencefor the accuracy of differential diagnosis; additionalinformation, such as effect sizes or data indicatingthe degree of overlap between criterion groups,also should be provided by the test developers. Indeveloping treatment plans, professionals often usenoncategorical diagnostic descriptions of clientfunctioning along treatment-relevant dimensions

159

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 159

Page 170: STANDARDS

(e.g., functional capacity, degree of anxiety, amountof suspiciousness, openness to interpretations,amount of insight into behaviors, and level of in-tellectual functioning).Diagnostic criteria may vary from one nomen-

clature system to another. Noting which nomen-clature system is being used is an important initialstep because different diagnostic systems may usethe same diagnostic term to describe differentsymptoms. Even within one diagnostic system,the symptoms described by the same term maydiffer between editions of the manual. Similarly, atest that uses a diagnostic term in its title maydiffer significantly from another test using a similartitle or from a subscale using the same term. Forexample, some diagnostic systems may define de-pression by behavioral symptomatology (e.g., psy-chomotor retardation, disturbance in appetite orsleep), by affective symptomatology (e.g., dysphoricfeeling, emotional flatness), or by cognitive symp-tomatology (e.g., thoughts of hopelessness, mor-bidity). Further, rarely are the symptoms ofdiagnostic categories mutually exclusive. Hence, itcan be expected that a given symptom may beshared by several diagnostic categories. More knowl-edgeable and precisely drawn inferences relatingto a diagnosis may be obtained from test scores ifappropriate weight is given to the symptomsincluded in the diagnostic category and to thesuitability of each test for assessing the symptoms.Therefore, the first step in evaluating a test’ssuitability for yielding scores or information in-dicative of a particular diagnostic syndrome is tocompare the construct that the test is intended tomeasure with the symptomatology described inthe diagnostic criteria.Different methods may be used to assess par-

ticular diagnostic categories. Some methods relyprimarily on structured interviews using a “yes”/”no”or “true”/”false” format, in which the professionalis interested in the presence or absence of diagno-sis-specific symptomatology. Other methods oftenrely principally on tests of personality or cognitivefunctioning and use configurations of obtainedscores. These configurations of scores indicate thedegree to which a test taker’s responses are similarto those of individuals who have been determined

by prior research to belong to a specific diagnosticgroup.Diagnoses made with the help of test scores

typically are based on empirically demonstratedrelationships between the test score and the diag-nostic category. Validity studies that demonstraterelationships between test scores and diagnosticcategories currently are available for some, butnot all, diagnostic categories. Many more studiesdemonstrate evidence of validity for the relationsbetween test scores and various subsets of symptomsthat contribute to a diagnostic category. Althoughit often is not feasible for individual professionalsto personally conduct research into relationshipsbetween obtained scores and diagnostic categories,familiarity with the research literature that examinesthese relationships is important.The professional often can enhance the diag-

nostic interpretations derived from test scores byintegrating the test results with inferences madefrom other sources of information regarding thetest taker’s functioning, such as self-reportedhistory, information provided by significant others,or systematic observations in the natural environ-ment or in the testing setting. In arriving at a di-agnosis, a professional also looks for informationthat does not corroborate the diagnosis, and inthose instances, places appropriate limits on thedegree of confidence placed in the diagnosis.When relevant to a referral decision, the professionalshould acknowledge alternative diagnoses thatmay require consideration. Particular attentionshould be paid to all relevant available data beforeconcluding that a test taker falls into a diagnosticcategory. Cultural competency is paramount inthe effort to avoid misdiagnosing or overpatholo-gizing culturally appropriate behavior, affect, orcognition. Tests also are used to assess the appro-priateness of continuing the initial diagnosis, es-pecially after a course of treatment or if the client’spsychological functioning has changed over time.

Testing for Neuropsychological Evaluations

Neuropsychological testing analyzes the test taker’scurrent psychological and behavioral status, includingmanifestations of neurological, neuropathological,and neurochemical changes that may arise during

160

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 160

Page 171: STANDARDS

development or from psychopathology, bodilyand/or brain injury, or illness. The purposes ofneuropsychological testing typically include, butare not limited to, the following: differentialdiagnosis associated with the sources of cognitive,perceptual, and personality dysfunction; differentialdiagnosis between two or more suspected etiologiesof cerebral dysfunction; evaluation of impairedfunctioning secondary to a cortical or subcorticalevent; establishment of neuropsychological baselinemeasurements for monitoring progressive cerebraldisease or recovery effects; comparison of testresults before and after pharmacologic, surgical,behavioral, or psychological interventions; identi-fication of patterns of higher cortical functionsand dysfunctions for the formulation of rehabilitationstrategies and for the design of remedial procedures;and characterization of brain behavior functionsto assist in criminal and civil legal actions.

Testing for Intervention Planning and Outcome Evaluation

Professionals often rely on test results for assistancein planning, executing, and evaluating interventions.Therefore, their awareness of validity informationthat supports or does not support the relationshipsamong test results, prescribed interventions, anddesired outcomes is important. Interventions maybe used to prevent the onset of one or moresymptoms, to remediate deficits, and to providefor a person’s basic physical, psychological, andsocial needs to enhance quality of life. Interventionplanning typically occurs following an evaluationof the nature, evolution, and severity of a disorderand a review of personal and contextual conditionsthat may affect its resolution. Subsequent evaluationsthat require the repeated administration of thesame test may occur in an effort to further diagnosethe nature and severity of the disorder, to reviewthe effects of interventions, to revise the interven-tions as needed, and to meet ethical and legalstandards.

Testing for Judicial and Governmental Decisions

Clients may voluntarily seek psychological assess-ment to assist in matters before a court of law orother government agency. Conversely, courts or

other government agencies sometimes require aperson to submit involuntarily to a psychologicalassessment that may involve a wide range of psy-chological tests. The goal of these psychologicalassessments is to provide important informationto a third party (e.g., test taker’s attorney, opposingattorney, judge, or administrative board) aboutthe psychological functioning of the test takerthat has bearing on the legal issues in question.Informed consent generally should be obtained;informed consent for children or mentally in-competent individuals (e.g., individuals with de-mentia) should be obtained from legal guardians.At the outset of the evaluation for judicial andgovernment decisions, the professional should ex-plain the intended purposes of the evaluation andidentify who is expected to have access to the testresults and the report. Often, the professionaland the test taker are not fully aware of legalissues or parameters that impinge on the evaluation,and if the test taker declines to proceed afterbeing notified of the nature and purpose of theexamination, the professional, as appropriate, mayattempt to administer the assessment, postponethe assessment, advise the test taker to contacther or his attorney, or notify the individual oragency requesting the assessment about the testtaker’s unwillingness to proceed. Assessments for legal reasons may occur as

part of a civil proceeding (e.g., involuntary com-mitment, testamentary capacity, competence tostand trial, ruling of child custody, personal injury,law suit), a criminal proceeding (e.g., competenceto stand trial, ruling of not guilty by reason of in-sanity, mitigating circumstances in sentencing),determination of reasonable accommodations foremployees with disabilities, or an administrativeproceeding or decision (e.g., license revocation,parole, worker’s compensation). The professionalis responsible for explaining test scores and the in-terpretations made from them in terms of thelegal criteria by which the jury, judge, or adminis-trative board will decide the legal issue. In instancesinvolving legal issues, it is important to assess theexaminee’s test-taking orientation, including responsebias, to ensure that the legal proceedings have notaffected the responses given. For example, persons

161

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 161

Page 172: STANDARDS

seeking to obtain the greatest possible monetaryaward for a personal injury may be motivated toexaggerate cognitive and emotional symptoms,whereas persons attempting to forestall the loss ofa professional license may attempt to portraythemselves in the best possible light by minimizingsymptoms or deficits. In forming an assessmentopinion, it is necessary to interpret the test scoreswith informed knowledge relating to the availablevalidity and reliability evidence. When formingsuch opinions, it also is necessary to integrate atest taker’s test scores with all other sources of in-formation that bear on the test taker’s currentstatus, including psychological, health, educational,occupational, legal, sociocultural, and other relevantcollateral records.Some tests are intended to provide information

about a client’s functioning that helps clarify agiven legal issue (e.g., parental functioning in achild custody case or a defendant’s ability to un-derstand charges in hearings on competency tostand trial). The manuals of some tests also providedemographic and actuarial data for normativegroups that are representative of persons involvedin the legal system. However, many tests measureconstructs that are generally relevant to the legalissues even though norms specific to the judicialor governmental context may not be available.Professionals are expected to make every effort tobe aware of evidence of validity and reliability/precision that supports or does not support theirinterpretations and to place appropriate limits onthe opinions rendered. Test users who practice injudicial and governmental settings are expected tobe aware of conflicts of interest that may lead tobias in the interpretation of test results.Protecting the confidentiality of a test taker’s

test results and of the test instrument itself posesparticular challenges for professionals involvedwith attorneys, judges, jurors, and other legal de-cision makers. The test taker has the right toexpect that test results will be communicated onlyto persons who are legally authorized to receivethem and that other information from the testingsession that is not relevant to the evaluation willnot be reported. The professional should beapprised of possible threats to confidentiality and

test security (e.g., releasing the test questions, theexaminee’s responses, or raw or standardized scoreson tests to another qualified professional) andshould seek, if necessary, appropriate legal andprofessional remedies.

Testing for Personal Awareness, Social Identity,and Psychological Health, Growth, and Action

Tests and inventories frequently are used to provideinformation to help individuals understand them-selves, identify their own strengths and weaknesses,and clarify issues important to their own develop-ment. For example, test results from personalityinventories may help test takers better understandthemselves and their interactions with others.Measures of ethnic identity and acculturation— two components of social identity— that assessthe cognitive, affective, and behavioral facets ofthe ways in which people identify with theircultural backgrounds, also may be informative.Psychological tests are used sometimes to assess

an individual’s ability to understand and adapt tohealth conditions. In these instances, observationsand checklists, as well as tests, are used to measurethe understanding that an individual with a healthcondition (e.g., diabetes) has about the diseaseprocess and about behavioral and cognitive tech-niques applicable to the amelioration or controlof the symptoms of the disease state.Results from interest inventories and tests of

ability may be useful to individuals who are makingeducational and career decisions. Appropriate cog-nitive and neuropsychological tests that have beennormed and standardized for children may facilitatethe monitoring of development and growth duringthe formative years, when relevant interventionsmay be more efficacious for recognizing and pre-venting potentially disabling learning difficulties.Test scores for young adults or children on thesetypes of measures may change in later years;therefore, test users should be cautious about over-reliance on results that may be outdated.Test results may be used in several ways for

self-exploration, growth, and decision making.First, the results can provide individuals with newinformation that allows them to compare themselveswith others or to evaluate themselves by focusing

162

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 162

Page 173: STANDARDS

on self-descriptions and self-characterizations. Testresults may also serve to stimulate discussions be-tween test taker and professional, to facilitatetest-taker insights, to provide directions for futuretreatment considerations, to help individualsidentify strengths and weaknesses, and to providethe professional with a general framework for or-ganizing and integrating information about anindividual. Testing for personal growth may takeplace in training and development programs,within an educational curriculum, during psy-chotherapy, in rehabilitation programs as part ofan educational or career-planning process, or inother situations.

Summary

The responsible use of tests in psychologicalpractice requires a commitment by the professionalto develop and maintain the necessary knowledge

and competence to select, administer, and interprettests and inventories as crucial elements of thepsychological testing and assessment process (seechap. 9). The standards in this chapter provide aframework for guiding the professional towardachieving relevance and effectiveness in the use ofpsychological tests within the boundaries or limitsdefined by the professional’s educational, experi-ential, and ethical foundations. Earlier chaptersand standards that are relevant to psychologicaltesting and assessment describe general aspects oftest quality (chaps. 1 and 2), fairness (chap. 3),test design and development (chap. 4), and testadministration (chap. 6). Chapter 11 discussestest uses for the workplace, including credentialing,and the importance of collecting data that provideevidence of a test’s accuracy for predicting jobperformance; chapter 12 discusses educationalapplications; and chapter 13 discusses test use inprogram evaluation and public policy.

163

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 163

Page 174: STANDARDS

The standards in this chapter have been separatedinto five thematic clusters labeled as follows:

1. Test User Qualifications2. Test Selection3. Test Administration4. Test Interpretation5. Test Security

Cluster 1. Test User Qualifications

Standard 10.1

Those who use psychological tests should confinetheir testing and related assessment activities totheir areas of competence, as demonstratedthrough education, training, experience, and ap-propriate credentials.

Comment: Responsible use and interpretation oftest scores require appropriate levels of experience,sound professional judgment, and understandingof the empirical and theoretical foundations oftests. For many assessments, competency also re-quires sufficient familiarity with the populationof which the test taker is a member to facilitatetest selection, test administration, and test scoreinterpretation. For example, when personalitytests and neuropsychological tests are administeredas part of a psychological assessment of anindividual, the test scores must be understood inthe context of the individual’s physical and psy-chological state; cultural and linguistic development;and educational, gender, health, and occupationalbackground. Scoring also must take into accountother evidence relevant to the tests used. Testscore interpretation requires professionally re-sponsible judgment that is exercised within theboundaries of knowledge and skill afforded bythe professional’s education, training, and supervisedexperience, as well as the context in which the as-sessment is being performed.

Standard 10.2

Those who select tests and draw inferences fromtest scores should be familiar with the relevantevidence of validity and reliability/precision forthe intended uses of the test scores and assessments,and should be prepared to articulate a logicalanalysis that supports all facets of the assessmentand the inferences made from the assessment.

Comment: A presentation and analysis of validityand reliability/precision evidence generally is notneeded in a report that is provided for the testtaker or a third party, because it is too cumbersomeand of little interest to most report readers. However,in situations in which the selection of tests may beproblematic (e.g., oral subtests with deaf test takers),a brief description of the rationale for using or notusing particular measures is advisable.When potential inferences derived from psy-

chological test scores are not supported by currentdata yet may hold promise for future validation,they may be described by the test developer andtest user as hypotheses for further validation intest score interpretation. Those receiving inter-pretations of such results should be cautionedthat such inferences do not yet have adequatelydemonstrated evidence of validity and should notbe the basis for a diagnostic decision or prognosticformulation.

Standard 10.3

Professionals should verify that persons undertheir supervision have appropriate knowledgeand skills to administer and score tests.

Comment: Individuals administering tests butnot involved in their selection or interpretationshould be supervised by a professional. Theyshould have knowledge of, as well as experiencewith, the test takers’ presenting problems (e.g.,brain injury) and the test settings (e.g., clinical,forensic).

164

CHAPTER 10

STANDARDS FOR PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 164

Page 175: STANDARDS

Cluster 2. Test Selection

Standard 10.4

Tests that are combined to form a battery oftests should be appropriate for the purposes ofthe assessment.

Comment: For example, in a neuropsychologicalassessment for evidence of an injury to an area ofthe brain, it is necessary to select a combinationof tests with known diagnostic sensitivity andspecificity to impairments arising from trauma tospecific regions of the brain.

Standard 10.5

Tests selected for use in psychological testingshould be suitable for the characteristics andbackground of the test taker.

Comment:When tests are part of a psychologicalassessment, the professional generally should takeinto account characteristics of the individual testtaker, including age and developmental level,race/ethnicity, gender, and linguistic and/or physicalcharacteristics that may affect the ability of thetest taker to meet the requirements of the test.The professional should also take into accountthe availability of norms and evidence of validityfor a population representative of the test taker. Ifno normative or validity studies are available for arelevant population, test interpretations shouldbe qualified and presented as hypotheses ratherthan conclusions.

Standard 10.6

When differential diagnosis is needed, the pro-fessional should choose, if possible, a test ortests for which there is credible evidence thatthe scores of the test(s) distinguish between thetwo or more diagnostic groups of concern ratherthan merely distinguishing abnormal cases fromthe general population.

Comment: Professionals will find it particularlyhelpful if evidence of validity is in a form that

enables them to determine how much confidencecan be placed in interpretations for an individual.Differences between group means and their sta-tistical significance provide inadequate informationregarding validity for individual diagnostic pur-poses. Additional information that might be con-sidered includes effect sizes or a table showingthe degree of overlap of predictor distributionsamong different criterion groups.

Cluster 3. Test Administration

Standard 10.7

Prior to testing, professionals and test adminis-trators should provide the test taker, or appropriateothers as applicable, with introductory informationin a manner understandable to the test taker.

Comment:The goal of optimal test administrationis to reduce error in the measurement of the con-struct. For example, the test taker should understandparameters surrounding the test, such as testingtime limits, feedback or lack thereof, and oppor-tunities to take breaks. In addition, the test takershould have an understanding of the limits ofconfidentiality, who will have access to the testresults, whether and when test results or decisionsbased on the scores will be shared with the testtaker, whether the test taker will have an opportunityto retest, and under what circumstances retestingcould occur.

Standard 10.8

Professionals and test administrators shouldfollow administration instructions, includingcalibration of technical equipment and verificationof scoring accuracy and replicability, and shouldprovide settings for testing that facilitate theperformance of test takers.

Comment: Because the normative data againstwhich a test taker’s performance will be evaluatedwere collected under the reported standard pro-cedures, the professional needs to be aware of andtake into account the effect that any nonstandard

165

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 165

Page 176: STANDARDS

procedures may have on the test taker’s obtainedscore and the interpretation of that score. Whenusing tests that employ an unstructured responseformat, such as some projective tests, the professionalshould follow the administration instructions pro-vided and apply objective scoring criteria whenavailable and appropriate. In some cases, testing may be conducted in a

realistic setting to determine how a test taker re-sponds in these settings. For example, an assessmentfor an attention disorder may be conducted in anoisy or distracting environment rather than inan environment that typically protects the testtaker from such external threats to performanceefficiency.

Standard 10.9

Professionals should take into account the purposeof the assessment, the construct being measured,and the capabilities of the test taker whendeciding whether technology-based administrationof tests should be used.

Comment: Quality control should be integral tothe administration of computerized or technolo-gy-based tests. Some technology-based tests mayrequire that test takers have an opportunity toreceive instruction and to practice prior to thetest administration, unless assessing ability to usethe equipment is the purpose of the test. Theprofessional is responsible for determining whetherthe technology-based administration of the testshould be proctored, or whether technical supportstaff are necessary to assist with the use of the testequipment and software. The interpreter of thetest scores should be informed if the test was un-proctored or if no support staff were available.

Cluster 4. Test Interpretation

Standard 10.10

Those who select tests and interpret test resultsshould not allow individuals or groups withvested interests in the outcomes of an assessment

to have an inappropriate influence on the inter-pretation of the assessment results.

Comment: Individuals or groups with a vestedinterest in the significance or meaning of thefindings from psychological testing may includebut are not limited to employers, health profes-sionals, legal representatives, school personnel,third-party payers, and family members. In someinstances, legal requirements may limit a profes-sional’s ability to prevent inappropriate interpre-tations of assessments from affecting decisions,but professionals have an obligation to documentany disagreement in such circumstances.

Standard 10.11

Professionals should share test scores and inter-pretations with the test taker when appropriateor required by law. Such information should beexpressed in language that the test taker or,when appropriate, the test taker’s legal represen-tative, can understand.

Comment:Test scores and interpretations shouldbe expressed in terms that can be understoodreadily by the test taker or others entitled to theresults. In most instances, a report should be gen-erated and made available to the referral source.That report should adhere to standards requiredby the profession and/or the referral source, andthe information should be documented in amanner that is understandable to the referralsource. In some clinical situations, providing feed-back to the test taker may actually cause harm.Care should be taken to minimize unintendedconsequences of test feedback. Any disclosure oftest results to an individual or any decision not torelease such results should be consistent with ap-plicable legal standards, such as privacy laws.

Standard 10.12

In psychological assessment, the interpretationof test scores or patterns of test battery resultsshould consider other factors that may influencea particular testing outcome. Where appropriate,

166

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 166

Page 177: STANDARDS

a description of such factors and an analysis ofthe alternative hypotheses or explanations re-garding what may have contributed to the patternof results should be included in the report.

Comment: Many factors (e.g., culture, gender,race/ethnicity, educational level, effort, employmentstatus, left- or right-handedness, current mentalstate, health status, linguistic preference, andtesting situation) may influence individual testresults and the overall outcome of the psychologicalassessment. When preparing test score interpreta-tions and reports drawn from an assessment, pro-fessionals should consider the extent to whichthese factors may introduce construct-irrelevantvariance into the test results. The interpretationof test results in the assessment process also shouldbe informed, when possible or appropriate, by ananalysis of stylistic and other qualitative featuresof test-taking behavior that may be obtained fromobservations, interviews, and historical information.Inclusion of qualitative information may assist inunderstanding the outcome of tests and evaluations.In addition, tests of faking or effort often are usedto determine the possibility of deception or ma-lingering.

Standard 10.13

When the validity of a diagnosis is appraised byevaluating the level of agreement between inter-pretations of the test scores and the diagnosis,the diagnostic terms or categories employedshould be carefully defined or identified.

Comment: Two diagnostic systems typically usedare psychiatric (i.e., based on the Diagnostic andStatistical Manual of Mental Disorders) and healthrelated (i.e., based on the International Classificationof Disease). As applicable, the system used todiagnose the test taker should be noted. Some syn-dromes (e.g., Mild Cognitive Impairment, SocialLearning Disability) do not appear in either system;for these, a description of the deficits should beused, with the closest diagnosis possible.

Standard 10.14

Criterion-related evidence of validity should beavailable when recommendations or decisionsare presented by the professional as having anactuarial basis.

Comment: Test score interpretations should notimply that empirical evidence exists for a relationshipamong particular test results, prescribed interven-tions, and desired outcomes, unless such evidenceis available for populations similar to those repre-sentative of the examinee.

Standard 10.15

The interpretation of test or test battery resultsfor diagnostic purposes should be based on mul-tiple sources of test and collateral informationand on an understanding of the normative, em-pirical, and theoretical foundations, as well asthe limitations, of such tests and data.

Comment: A given pattern of test performancesrepresents a cross-sectional view of the individualbeing assessed within a particular context. Theinterpretation of findings derived from a complexbattery of tests in such contexts requires appro-priate education about, supervised experiencewith, and knowledge of procedural, theoretical,and empirical limitations of the tests and theevaluation procedure.

Standard 10.16

If a publisher suggests that tests are to be usedin combination with one another, the professionalshould review the recommended procedures andevidence for combining tests and determinewhether the rationale provided by the publisheris appropriate for the specific combination oftests and their intended uses.

Comment: For example, if measures of intelligenceare packaged with measures of memory, or if

167

PSYCHOLOGICAL TESTING AND ASSESSMENT

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 167

Page 178: STANDARDS

measures of interests and personality styles are pack-aged together, then supporting reliability/precisionand validity data for such combinations of the testscores and interpretations should be available.

Standard 10.17

Those who use computer-generated interpreta-tions of test data should verify that the qualityof the evidence of validity is sufficient for theinterpretations.

Comment: Efforts to reduce a complex set ofdata into computer-generated interpretations of agiven construct may yield misleading or oversim-plified analyses of the meanings of test scores,which in turn may lead to faulty diagnostic andprognostic decisions. Norms on which the inter-pretations are based should be reviewed for theirrelevance and appropriateness.

Cluster 5. Test Security

Standard 10.18

Professionals and others who have access to testmaterials and test results should maintain theconfidentiality of the test results and testingmaterials consistent with scientific, professional,legal, and ethical requirements. Tests (including

obsolete versions) should not be made availableto the public or resold to unqualified test users.

Comment: Professionals should be knowledgeableabout and should conform to record-keeping andconfidentiality guidelines required by applicablefederal law and within the jurisdictions wherethey practice, as well as guidelines of the professionalorganizations to which they belong. The test pub-lisher, the test user, the test taker, and third parties(e.g., school, court, employer) may have differentlevels of understanding or recognition of the needfor confidentiality of test materials. To the extentpossible, the professional who uses tests is responsiblefor managing the confidentiality of test informationacross all parties. It is important for the professionalto be aware of possible threats to confidentialityand the legal and professional remedies available.Professionals also are responsible for maintainingthe security of testing materials and respecting thecopyrights of all tests. Distribution, display, orresale of test materials (including obsolete editions)to unauthorized recipients infringes the copyrightof the materials and compromises test security.When it is necessary to reveal test content in theprocess of explaining results or in a court proceeding,this should happen in a controlled environment.When possible, copies of the content should notbe distributed, or should be distributed in a mannerthat protects test security to the extent possible.

168

CHAPTER 10

ch10.qxp_AERA Standards 6/18/14 2:35 PM Page 168

Page 179: STANDARDS

Organizations use employment testing for manypurposes, including employee selection, placement,and promotion. Selection generally refers to decisionsabout which individuals will enter the organization;placement refers to decisions about how to assignindividuals to positions within the organization;and promotion refers to decisions about which in-dividuals within the organization will advance.What all three have in common is a focus on theprediction of future job behaviors, with the goalof influencing organizational outcomes such asefficiency, growth, productivity, and employeemotivation and satisfaction.

Testing used in the processes of licensure andcertification, which will here be generically calledcredentialing, focuses on an applicant’s currentskill or competence in a specified domain. Inmany occupations, individual practitioners mustbe licensed by governmental agencies. In otheroccupations, it is professional societies, employers,or other organizations that assume responsibilityfor credentialing. Although licensure typically in-volves provision of a credential for entry into anoccupation, credentialing programs may exist atvarious levels, from novice to expert in a givenfield. Certification is usually sought voluntarily,although occupations differ in the degree to whichobtaining certification influences employabilityor advancement. The credentialing process mayinclude testing and other requirements, such aseducation or supervised experiences. The Standardsapplies to the use of tests as a component of thebroader credentialing process.

Testing is also conducted in workplaces for avariety of purposes other than staffing decisionsand credentialing. Testing as a tool for personalgrowth can be part of training and developmentprograms, in which instruments measuring per-sonality characteristics, interests, values, preferences,and work styles are commonly used with the goal

of providing self-insight to employees. Testingcan also take place in the context of programevaluation, as in the case of an experimental studyof the effectiveness of a training program, wheretests may be administered as pre- and post-measures. Some assessments conducted in em-ployment settings, such as unstructured job in-terviews for which no claim of predictive validityis made, are nonstandardized in nature, and it isgenerally not feasible to apply standards to suchassessments. The focus of this chapter, however, ison the use of testing specifically in staffing decisionsand credentialing. Many additional issues relevantto uses of testing in organizational settings arediscussed in other chapters: technical matters inchapters 1, 2, 4, and 5; documentation in chapter7; and individualized psychological and personalityassessment of job candidates in chapter 10.

As described in chapter 3, the ideal of fairnessin testing is achieved if a given test score has thesame meaning for all individuals and is not sub-stantially influenced by construct-irrelevant barriersto individuals’ performance. For example, a visuallyimpaired person may have difficulty reading ques-tions on a personality inventory or other vocationalassessment provided in small print. Young peoplejust entering the workforce may be less sophisticatedin test-taking strategies than more experiencedjob applicants, and their scores may suffer. Aperson unfamiliar with computer technology mayhave difficulty with the user interface for acomputer simulation assessment. In each of thesecases, performance is hindered by a source ofvariance that is unrelated to the construct ofinterest. Sound testing practice involves carefulmonitoring of all aspects of the assessment processand appropriate action when needed to preventundue disadvantages or advantages for some can-didates caused by factors unrelated to the constructbeing assessed.

169

11. WORKPLACE TESTING ANDCREDENTIALING

BACKGROUND

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 169

Page 180: STANDARDS

Employment Testing

The Influence of Context on Test Use

Employment testing involves using test informationto aid in personnel decision making. Both thecontent and the context of employment testingvary widely. Content may cover various domains ofknowledge, skills, abilities, traits, dispositions, values,and other individual characteristics. Some contextualfeatures represent choices made by the employingorganization; others represent constraints that mustbe accommodated by the employing organization.Decisions about the design, evaluation, and imple-mentation of a testing system are specific to thecontext in which the system is to be used. Importantcontextual features include the following:

Internal versus external candidate pool. In someinstances, such as promotional settings, the can-didates to be tested are already employed by theorganization. In others, applications are soughtfrom individuals outside the organization. In yetother cases, a mix of internal and external candidatesis sought.

Trained versus untrained candidates. In someinstances, individuals with little training in a spe-cialized knowledge or skill are sought, eitherbecause the job does not require the specializedknowledge or skill or because the organizationplans to offer training after the point of hire. Inother instances, trained or experienced workersare sought with the expectation that they can im-mediately perform a specialized job. Thus, a par-ticular job may require very different selectionsystems, depending on whether trained or untrainedindividuals will be hired or promoted.

Short-term versus long-term focus. In some in-stances, the goal of the selection system is topredict performance immediately upon or shortlyafter hire. In other instances, the concern is withlonger-term performance, as in the case of pre-dictions as to whether candidates will successfullycomplete a multiyear overseas job assignment.Concerns about changing job tasks and job re-quirements also can lead to a focus on knowledge,

skills, abilities, and other characteristics projectedto be necessary for performance on the target jobin the future, even if they are not part of the jobas currently constituted.

Screening in versus screening out. In some in-stances, the goal of the selection system is toscreen in individuals who are likely to be veryhigh performers on one set of behavioral oroutcome criteria of interest to the organization.In others, the goal is to screen out individualswho are likely to be very poor performers. For ex-ample, an organization may wish to screen out asmall proportion of individuals for whom the riskof pathological, deviant, counterproductive, orcriminal behavior on the job is deemed too high.The same organization may want to screen in ap-plicants who have a high probability of superiorperformance.

Mechanical versus judgmental decision making.In some instances, test information is used in amechanical, automated fashion. This is the casewhen scores on a test battery are combined byformula and candidates are selected in strict top-down rank order, or when only candidates abovespecific cut scores are eligible to continue to sub-sequent stages of a selection system. In other in-stances, information from a test is judgmentallyintegrated with information from other tests andwith nontest information to form an overall as-sessment of the candidate.

Ongoing versus one-time use of a test. In someinstances, a test may be used over an extendedperiod in an organization, permitting the accu-mulation of data and experience using the test inthat context. In other instances, concerns abouttest security are such that repeated use is infeasible,and a new test is required for each test adminis-tration. For example, a work-sample test for life-guards, requiring retrieval of a mannequin fromthe bottom of a pool, is not compromised if can-didates possess detailed knowledge of the test inadvance. In contrast, a written job-knowledgetest for police officers may be severely compromisedif some candidates have access to the test in

170

CHAPTER 11

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 170

Page 181: STANDARDS

advance. The key question is whether advanceknowledge of test content affects candidates’ per-formance unfairly and consequently changes theconstructs measured by the test and the validityof inferences based on the scores.

Fixed applicant pool versus continuous flow. Insome instances, an applicant pool can be assembledprior to beginning the selection process, as whenan organization’s policy is to consider all candidateswho apply before a specific date. In other cases,there is a continuous flow of applicants aboutwhom employment decisions need to be made onan ongoing basis. Ranking of candidates is possiblein the case of the fixed pool; in the case of a con-tinuous flow, a decision may need to be madeabout each candidate independent of informationabout other candidates.

Small versus large sample size. Sample size affectsthe degree to which different lines of evidencecan be used to examine validity and fairness of in-terpretations of test scores for proposed uses oftests. For example, relying on the local setting toestablish empirical linkages between test and cri-terion scores is not technically feasible with smallsample sizes. In employment testing, sample sizesare often small; at the extreme is a job with only asingle incumbent. Large sample sizes are sometimesavailable when there are many incumbents forthe job, when multiple jobs share similar require-ments and can be pooled, or when organizationswith similar jobs collaborate in developing a se-lection system.

A new job. A special case of the problem of smallsample size exists when a new job is created andthere are no job incumbents. As new jobs emerge,employers need selection procedures to staff thenew positions. Professional judgment may be usedto identify appropriate employment tests and pro-vide a rationale for the selection program eventhough the array of methods for documentingvalidity may be restricted. Although validityevidence based on criterion-oriented studies canrarely be assembled prior to the creation of a newjob, the methods for generalizing validity evidence

in situations with small sample sizes can be used(see the discussion on page 173 concerning settingswith small samples), as well as content-orientedstudies using the subject matter experts responsiblefor designing the job.

Size of applicant pool relative to the number ofjob openings. The size of an applicant pool canconstrain the type of testing system that is feasible.For desirable jobs, very large numbers of candidatesmay compete, and short screening tests may beused to reduce the pool to a size for which the ad-ministration of more time-consuming and expensivetests is practical. Large applicant pools may alsopose test security concerns, limiting the organizationto testing methods that permit simultaneous testadministration to all candidates.

Thus, test use by employers is conditioned bycontextual features. Knowledge of these featuresplays an important part in the professional judgmentthat will influence both the types of testing systemdeveloped and the strategies used to evaluate crit-ically the validity of interpretations of test scoresfor proposed uses of the tests.

The Validation Process in Employment Testing

The validation process often begins with a jobanalysis in which information about job dutiesand tasks, responsibilities, worker characteristics,and other relevant information is collected. Thisinformation provides an empirical basis for artic-ulating what is meant by job performance in thejob under consideration, for developing measuresof job performance, and for hypothesizing char-acteristics of individuals that may be predictive ofperformance.

The fundamental inference to be drawn fromtest scores in most applications of testing in em-ployment settings is one of prediction: The testuser wishes to make an inference from test resultsto some future job behavior or job outcome.Even when the validation strategy used does notinvolve empirical predictor-criterion linkages, asin the case of validity evidence based on testcontent, there is an implied criterion. Thus,although different strategies for gathering evidence

171

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 171

Page 182: STANDARDS

may be used, the inference to be supported is thatscores on the test can be used to predict subsequentjob behavior. The validation process in employmentsettings involves the gathering and evaluation ofevidence relevant to sustaining or challenging thisinference. As detailed below and in chapter 1 (inthe section “Evidence Based on Relations to OtherVariables”), a variety of validation strategies canbe used to support the inference.

It follows that establishing this predictive in-ference requires attention to two domains: that ofthe test (the predictor) and that of the job behavioror outcome of interest (the criterion). Evaluatingthe use of a test for an employment decision canbe viewed as testing the hypothesis of a linkagebetween these domains. Operationally, there aremany ways of linking these domains, as illustratedby the diagram below.

Alternative links between predictor and criterion measures

The diagram differentiates between a predictorconstruct domain and a predictor measure, andbetween a criterion construct domain and acriterion measure. A predictor construct domain isdefined by specifying the set of behaviors, knowl-edge, skills, abilities, traits, dispositions, and valuesthat will be included under particular constructlabels (e.g., verbal reasoning, typing speed, con-scientiousness). Similarly, a criterion constructdomain specifies the set of job behaviors or joboutcomes that will be included under particularconstruct labels (e.g., performance of core jobtasks, teamwork, attendance, sales volume, overalljob performance). Predictor and criterion measures

are intended to assess an individual’s standing onthe characteristics assessed in those domains.

The diagram enumerates inferences about anumber of linkages that are commonly of interest.The first linkage (labeled 1 in the diagram) is be-tween scores on a predictor measure and scoreson a criterion measure. This inference is testedthrough empirical examination of relationshipsbetween the two measures. The second and fourthlinkages (labeled 2 and 4) are conceptually similar:Both examine the relationship of an operationalmeasure to the construct domain of interest.Logical analysis, expert judgment, and convergencewith or divergence from conceptually similar ordifferent measures are among the forms of evidencethat can be examined in testing these linkages.Linkage 3 involves the relationship between thepredictor construct domain and the criterionconstruct domain. This inferred linkage is estab-lished on the basis of theoretical and logicalanalysis. It commonly draws on systematic eval-uation of job content and expert judgment as tothe individual characteristics linked to successfuljob performance. Linkage 5 examines a direct re-lationship of the predictor measure to the criterionconstruct domain.

Some predictor measures are designed explicitlyas samples of the criterion construct domain ofinterest; thus, isomorphism between the measureand the construct domain constitutes directevidence for linkage 5. Establishing linkage 5 inthis fashion is the hallmark of approaches thatrely heavily on what the Standards refers to asvalidity evidence based on test content. Tests inwhich candidates for lifeguard positions performrescue operations, or in which candidates forword processor positions type and edit text,provide examples of test content that forms thebasis for validity.

A prerequisite to the use of a predictor measurefor personnel selection is that the inferences con-cerning the linkage between the predictor measureand the criterion construct domain be established.As the diagram illustrates, there are multiplestrategies for establishing this crucial linkage. Onestrategy is direct, via linkage 5; a second involves

172

CHAPTER 11

predictormeasure

criterionmeasure

criterionconstructdomain

predictorconstructdomain

1

2

3

45

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 172

Page 183: STANDARDS

pairing linkage 1 and linkage 4; and a thirdinvolves pairing linkage 2 and linkage 3.

When the test is designed as a sample of thecriterion construct domain, the validity evidencecan be established directly via linkage 5. Anotherstrategy for linking a predictor measure and thecriterion construct domain focuses on linkages 1and 4: pairing an empirical link between the pre-dictor and criterion measures with evidence ofthe adequacy with which the criterion measurerepresents the criterion construct domain. Theempirical link between the predictor measure andthe criterion measure is part of what the Standardsrefers to as validity evidence based on relationshipsto other variables. The empirical link of the testand the criterion measure must be supplementedby evidence of the relevance of the criterionmeasure to the criterion construct domain tocomplete the linkage between the test and the cri-terion construct domain. Evidence of the relevanceof the criterion measure to the criterion constructdomain is commonly based on job analysis, al-though in some cases the link between the domainand the measure is so direct that relevance is ap-parent without job analysis (e.g., when the criterionconstruct of interest is absenteeism or turnover).Note that this strategy does not necessarily relyon a well-developed predictor construct domain.Predictor measures such as empirically keyedbiodata measures are constructed on the basis ofempirical links between test item responses andthe criterion measure of interest. Such measuresmay, in some instances, be developed without afully established conception of the predictor con-struct domain; the basis for their use is the directempirical link between test responses and a relevantcriterion measure. Unless sample sizes are verylarge, capitalization on chance may be a problem,in which case appropriate steps should be taken(e.g., cross-validation).

Yet another strategy for linking predictor scoresand the criterion construct domain focuses onpairing evidence of the adequacy with which thepredictor measure represents the predictor constructdomain (linkage 2) with evidence of the linkagebetween the predictor construct domain and thecriterion construct domain (linkage 3). As noted

above, there is no single direct route to establishingthese linkages. They involve lines of evidence sub-sumed under “construct validity” in prior con-ceptualizations of the validation process. A com-bination of lines of evidence (e.g., expert judgmentof the characteristics predictive of job success, in-ferences drawn from an analysis of critical incidentsof effective and ineffective job performance, andinterview and observation methods) may supportinferences about the predictor constructs linkedto the criterion construct domain. Measures ofthese predictor constructs may then be selectedor developed, and the linkage between the predictormeasure and the predictor construct domain canbe established with various lines of evidence forlinkage 2, discussed above.

The various strategies for linking predictorscores to the criterion construct domain maydiffer in their potential applicability to any givenemployment testing context. While the availabilityof certain lines of evidence may be constrained,such constraints do not reduce the importance ofestablishing a validity argument for the predictiveinference.

For example, methods for establishing linkagesare more limited in settings with only smallsamples available. In such situations, gatheringlocal evidence of predictor-criterion relationshipsis not feasible, and approaches to generalizing ev-idence from other settings may be more useful. Avariety of methods exist for generalizing evidenceof the validity of the interpretation of the predictiveinference from other settings. Validity evidencemay be directly transported from another settingin a case where sound evidence (e.g., careful jobanalysis) indicates that the local job is highlycomparable to the job for which the validity dataare being imported. These methods may rely onevidence for linkage 1 and linkage 4 that have al-ready been established in other studies, as in thecase of the transportability study described previ-ously. Evidence for linkage 1 may also be establishedusing techniques such as meta-analysis to combineresults from multiple studies, and a careful jobanalysis may establish evidence for linkage 4 byshowing the focal job to be similar to other jobsincluded in the meta-analysis. At the extreme, a

173

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 173

Page 184: STANDARDS

selection system may be developed for a newlycreated job with no current incumbents. Here,generalizing evidence from other settings may beespecially helpful.

For many testing applications, there is a con-siderable cumulative body of research that speaksto some, if not all, of the inferences discussedabove. A meta-analytic integration of this researchcan form an integral part of the strategy forlinking test information to the construct domainof interest. The value of collecting local validationdata varies with the magnitude, relevance, andconsistency of research findings using similar pre-dictor measures and similar criterion constructdomains for similar jobs. In some cases, a smalland inconsistent cumulative research record maylead to a validation strategy that relies heavily onlocal data; in others, a large, consistent researchbase may make investing resources in additionallocal data collection unnecessary.

Thus, multiple sources of data and multiplelines of evidence can be drawn upon to evaluatethe linkage between a predictor measure and thecriterion construct domain of interest. There is nosingle preferred method of inquiry for establishingthis linkage. Rather, the test user must considerthe specifics of the testing situation and applyprofessional judgment in developing a strategy fortesting the hypothesis of a linkage between thepredictor measure and the criterion domain.

Bases for Evaluating Employment Test Use

Although a primary goal of employment testingis the accurate prediction of subsequent job be-haviors or job outcomes, it is important torecognize that there are limits to the degree towhich such criteria can be predicted. Perfect pre-diction is an unattainable goal. First, behavior inwork settings is influenced by a wide variety oforganizational and extra-organizational factors,including supervisor and peer coaching, formaland informal training, job design, organizationalstructures and systems, and family responsibilities,among others. Second, behavior in work settingsis also influenced by a wide variety of individualcharacteristics, including knowledge, skills, abilities,personality, and work attitudes, among others.

Thus, any single characteristic will be only an im-perfect predictor, and even complex selection sys-tems only focus on the set of constructs deemedmost critical for the job, rather than on all char-acteristics that can influence job behavior. Third,some measurement error always occurs, even inwell-developed test and criterion measures.

Thus, testing systems cannot be judged againsta standard of perfect prediction. Rather, theyshould be judged in terms of comparisons withavailable alternative selection methods. Professionaljudgment, informed by knowledge of the researchliterature about the degree of predictive accuracyrelative to available alternatives, influences decisionsabout test use.

Decisions about test use are often influencedby additional considerations, including utility(i.e., cost-benefit) and return on investment, valuejudgments about the relative importance of selectingfor one criterion domain versus others, concernsabout applicant reactions to test content andprocesses, the availability and appropriateness ofalternative selection methods, and statutory orregulatory requirements governing test use, fairness,and policy objectives such as workforce diversity.Organizational values necessarily come into playin decisions about test use; thus, even organizationswith comparable evidence supporting an intendedinference drawn from test scores may reach differentconclusions about whether to use any particulartest.

Testing in Professional and Occupational Credentialing

Tests are widely used in the credentialing ofpersons for many occupations and professions.Licensing requirements are imposed by federal,state, and local governments to ensure that thosewho are licensed possess knowledge and skills insufficient degree to perform important occupationalactivities safely and effectively. Certification playsa similar role in many occupations not regulatedby governments and is often a necessary precursorto advancement. Certification has also becomewidely used to indicate that a person has specificskills (e.g., operation of specialized auto repair

174

CHAPTER 11

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 174

Page 185: STANDARDS

equipment) or knowledge (e.g., estate planning),which may be only a part of their occupationalduties. Licensure and certification will here gener-ically be called credentialing.

Tests used in credentialing are intended toprovide the public, including employers and gov-ernment agencies, with a dependable mechanismfor identifying practitioners who have met particularstandards. The standards may be strict, but not sostringent as to unduly restrain the right of qualifiedindividuals to offer their services to the public.Credentialing also serves to protect the public byexcluding persons who are deemed to be notqualified to do the work of the profession or oc-cupation. Qualifications for credentials typicallyinclude educational requirements, some amountof supervised experience, and other specific criteria,as well as attainment of a passing score on one ormore examinations. Tests are used in credentialingin a broad spectrum of professions and occupations,including medicine, law, psychology, teaching,architecture, real estate, and cosmetology. In someof these, such as actuarial science, clinical neu-ropsychology, and medical specialties, tests arealso used to certify advanced levels of expertise.Relicensure or periodic recertification is alsorequired in some occupations and professions.

Tests used in credentialing are designed to de-termine whether the essential knowledge and skillshave been mastered by the candidate. The focusis on the standards of competence needed for ef-fective performance (e.g., in licensure this refersto safe and effective performance in practice).Test design generally starts with an adequate def-inition of the occupation or specialty, so thatpersons can be clearly identified as engaging inthe activity. Then the nature and requirements ofthe occupation, in its current form, are delineated.To identify the knowledge and skills necessary forcompetent practice, it is important to completean analysis of the actual work performed andthen document the tasks and responsibilities thatare essential to the occupation or profession ofinterest. A wide variety of empirical approachesmay be used, including the critical incident tech-nique, job analysis, training needs assessments, orpractice studies and surveys of practicing profes-

sionals. Panels of experts in the field often workin collaboration with measurement experts todefine test specifications, including the knowledgeand skills needed for safe, effective performanceand an appropriate way of assessing them. TheStandards apply to all forms of testing, includingtraditional multiple-choice and other selected-re-sponse tests, constructed-response tasks, portfolios,situational judgment tasks, and oral examinations.More elaborate performance tasks, sometimesusing computer-based simulation, are also usedin assessing such practice components as, for ex-ample, patient diagnosis or treatment planning.Hands-on performance tasks may also be used(e.g., operating a boom crane or filling a tooth),with observation and evaluation by one or moreexaminers.

Credentialing tests may cover a number of re-lated but distinct areas of knowledge or skill. De-signing the testing program includes decidingwhat areas are to be covered, whether one or aseries of tests is to be used, and how multiple testscores are to be combined to reach an overall de-cision. In some cases, high scores on some testsare permitted to offset (i.e., compensate for) lowscores on other tests, so that an additive combinationis appropriate. In other cases, a conjunctive decisionmodel requiring acceptable performance on eachtest in an examination series is used. The type ofpass-fail decision model appropriate for a creden-tialing program should be carefully considered,and the conceptual and/or empirical basis for thedecision model should be articulated.

Validation of credentialing tests depends mainlyon content-related evidence, often in the form ofjudgments that the test adequately represents thecontent domain associated with the occupation orspecialty being considered. Such evidence may besupplemented with other forms of evidence externalto the test. For example, information may be pro-vided about the process by which specificationsfor the content domain were developed and theexpertise of the individuals making judgmentsabout the content domain. Criterion-relatedevidence is of limited applicability because cre-dentialing examinations are not intended to predictindividual performance in a specific job but rather

175

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 175

Page 186: STANDARDS

to provide evidence that candidates have acquiredthe knowledge, skills, and judgment required foreffective performance, often in a wide variety ofjobs or settings (we use the term judgment to referto the applications of knowledge and skill to par-ticular situations). In addition, measures of per-formance in practice are generally not availablefor those who are not granted a credential.

Defining the minimum level of knowledgeand skill required for licensure or certification isone of the most important and difficult tasksfacing those responsible for credentialing. Thevalidity of the interpretation of the test scores de-pends on whether the standard for passing makesan appropriate distinction between adequate andinadequate performance. Often, panels of expertsare used to specify the level of performance thatshould be required. Standards must be highenough to ensure that the public, employers, andgovernment agencies are well served, but not sohigh as to be unreasonably limiting. Verifying theappropriateness of the cut score or scores on a testused for licensure or certification is a criticalelement of the validation process. Chapter 5provides a general discussion of setting cut scores(see Standards 5.21–5.23 for specific topics con-cerning cut scores).

Legislative bodies sometimes attempt to legislatea cut score, such as answering 70% of test itemscorrectly. Cut scores established in such an arbitraryfashion can be harmful for two reasons. First,without detailed information about the test, jobrequirements, and their relationship, sound standardsetting is impossible. Second, without detailedinformation about the format of the test and thedifficulty of items, such arbitrary cut scores havelittle meaning.

Scores from credentialing tests need to beprecise in the vicinity of the cut score. They maynot need to be as precise for test takers whoclearly pass or clearly fail. Computer-based masterytests may include a provision to end the testingwhen it becomes clear that a decision about thecandidate’s performance can be made, resultingin a shorter test for candidates whose performanceclearly exceeds or falls below the minimum per-formance required for a passing score. Because

mastery tests may not be designed to provide ac-curate results over the full score range, many suchtests report results as simply “pass” or “fail.” Whenfeedback is given to candidates about how well orhow poorly they performed, precision throughoutthe score range is needed. Conditional standarderrors of measurement, discussed in chapter 2,provide information about the precision of specificscores.

Candidates who fail may profit from infor-mation about the areas in which their performancewas especially weak. This is the reason thatsubscores are sometimes provided. Subscores areoften based on relatively small numbers of itemsand can be much less reliable than the total score.Moreover, differences in subscores may simplyreflect measurement error. For these reasons, thedecision to provide subscores to candidates shouldbe made carefully, and information should beprovided to facilitate proper interpretation. Chapter2 and Standard 2.3 speak to the importance ofsubscore reliability.

Because credentialing tends to involve highstakes and is an ongoing process, with tests givenon a regular schedule, it is generally not desirableto use the same test form repeatedly. Thus, newforms, or versions of the test, are generally neededon an ongoing basis. From a technical perspective,all forms of a test should be prepared to the samespecifications, assess the same content domains,and use the same weighting of components ortopics.

Alternate test forms should have the samescore scale so that scores can retain their meaning.Various methods of linking or equating alternateforms can be used to ensure that the standard forpassing represents the same level of performanceon all forms. Note that release of past test formsmay compromise the extent to which differenttest forms are comparable.

Practice in professions and occupations oftenchanges over time. Evolving legal restrictions,progress in scientific fields, and refinements intechniques can result in a need for changes in testcontent. Each profession or occupation shouldperiodically reevaluate the knowledge and skillsmeasured in its examination used to meet the re-

176

CHAPTER 11

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 176

Page 187: STANDARDS

quirements of the credential. When change issubstantial, it becomes necessary to revise thedefinition of the profession, and the test content,to reflect changing circumstances. These changesto the test may alter the meaning of the scorescale. When major revisions are made in the testor when the score scale changes, the cut scoreshould also be reestablished.

Some credentialing groups consider it necessary,as a practical matter, to adjust their passing scoreor other criteria periodically to regulate the numberof accredited candidates entering the profession.This questionable procedure raises serious problemsfor the technical quality of the test scores andthreatens the validity of the interpretation of apassing score as indicating entry-level competence.Adjusting the cut score periodically also impliesthat standards are set higher in some years than inothers, a practice that is difficult to justify on thegrounds of quality of performance. The scorescale is sometimes adjusted so that a certainnumber or proportion of candidates will reachthe passing score. This approach, while less obviousto the candidates than changing the cut score, isalso technically inappropriate because it changesthe meaning of the scores from year to year.

Passing a credentialing examination should signifythat the candidate meets the knowledge and skillstandards set by the credentialing body to ensureeffective practice.

Issues of cheating and test security are ofspecial importance for testing practices in creden-tialing. Issues of test security are covered inchapters 6 and 9. Issues of cheating by test takersare covered in chapter 8 (see Standards 8.9–8.12, addressing testing irregularities).

Fairness and access, discussed in chapter 3,are important for licensing and certificationtesting. An evaluation of an accommodation ormodification for a credentialing test should takeinto consideration the critical functions performedin the work targeted by the test. In the case ofcredentialing tests, the criticality of job functionsis informed by the public interest as well as thenature of the work itself. When a conditionlimits an individual’s ability to perform a criticalfunction of a job, an accommodation or modifi-cation of the licensing or certification exam maynot be appropriate (i.e., some changes may fun-damentally alter factors that the examination isdesigned to measure for protection of the public’shealth, safety, and welfare).

177

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 177

Page 188: STANDARDS

The standards in this chapter have been separatedinto three thematic clusters labeled as follows:

1. Standards Generally Applicable to Both Employment Testing and Credentialing

2. Standards for Employment Testing3. Standards for Credentialing

Cluster 1. Standards GenerallyApplicable to Both Employment Testingand Credentialing

Standard 11.1

Prior to development and implementation of anemployment or credentialing test, a clear statementof the intended interpretations of test scores forspecified uses should be made. The subsequentvalidation effort should be designed to determinehow well this has been achieved for all relevantsubgroups.

Comment: The objectives of employment andcredentialing tests can vary considerably. Someemployment tests aim to screen out those leastsuited for the job in question, while others are de-signed to identify those best suited for the job.Employment tests also vary in the aspects of jobbehavior they are intended to predict, which mayinclude quantity or quality of work output, tenure,counterproductive behavior, and teamwork, amongothers. Credentialing tests and some employmenttests are designed to identify candidates who havemet some specified level of proficiency in a targetdomain of knowledge, skills, or judgment.

Standard 11.2

Evidence of validity based on test content requiresa thorough and explicit definition of the contentdomain of interest.

Comment: In general, the job content domain foran employment test should be described in terms

of the tasks that are performed and/or the knowledge,skills, abilities, and other characteristics that are re-quired on the job. They should be clearly definedso that they can be linked to test content. Theknowledge, skills, abilities, and other characteristicsincluded in the content domain should be thosethat qualified applicants already possess when beingconsidered for the job in question. Moreover, theimportance of these characteristics for the jobunder consideration should not be expected tochange substantially over a specified period of time.

For credentialing tests, the target content do-main generally consists of the knowledge, skills,and judgment required for effective performance.The target content domain should be clearlydefined so it can be linked to test content.

Standard 11.3

When test content is a primary source of validityevidence in support of the interpretation for theuse of a test for employment decisions or cre-dentialing, a close link between test content andthe job or professional/occupational requirementsshould be demonstrated.

Comment: For example, if the test content samplesjob tasks with considerable fidelity (e.g., withactual job samples such as machine operation) or,in the judgment of experts, correctly simulatesjob task content (e.g., with certain assessmentcenter exercises), or if the test samples specific jobknowledge (e.g., information necessary to performcertain tasks) or skills required for competentperformance, then content-related evidence canbe offered as the principal form of evidence of va-lidity. If the link between the test content and thejob content is not clear and direct, other lines ofvalidity evidence take on greater importance.

When evidence of validity based on test contentis presented for a job or class of jobs, the evidenceshould include a description of the major jobcharacteristics that a test is meant to sample. It isoften valuable to also include information about

178

CHAPTER 11

STANDARDS FOR WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 178

Page 189: STANDARDS

the relative frequency, importance, or criticalityof the elements. For a credentialing examination,the evidence should include a description of themajor responsibilities, tasks, and/or activities per-formed by practitioners that the test is meant tosample, as well as the underlying knowledge andskills required to perform those responsibilities,tasks, and/or activities.

Standard 11.4

When multiple test scores or test scores andnontest information are integrated for the purposeof making a decision, the role played by eachshould be clearly explicated, and the inferencemade from each source of information shouldbe supported by validity evidence.

Comment: In credentialing, candidates may berequired to score at or above a specified minimumon each of several tests (e.g., a practical, skill-based examination and a multiple-choice knowledgetest) or at or above a cut score on a total compositescore. Specific educational and/or experience re-quirements may also be mandated. A rationaleand its supporting evidence should be providedfor each requirement. For tests and assessments,such evidence includes, but is not necessarilylimited to, the reliability/precision of scores andthe correlations among the tests and assessments.

In employment testing, a decision maker mayintegrate test scores with interview data, referencechecks, and many other sources of information inmaking employment decisions. The inferencesdrawn from test scores should be limited to thosefor which validity evidence is available. Forexample, viewing a high test score as indicatingoverall job suitability, and thus precluding theneed for reference checks, would be an inappropriateinference from a test measuring a single narrow,albeit relevant, domain, such as job knowledge.In other circumstances, decision makers integratescores across multiple tests, or across multiplescales within a given test.

Cluster 2. Standards for Employment Testing

Standard 11.5

When a test is used to predict a criterion, thedecision to conduct local empirical studies ofpredictor-criterion relationships and the inter-pretation of the results should be grounded inknowledge of relevant research.

Comment: The cumulative literature on the rela-tionship between a particular type of predictorand type of criterion may be sufficiently large andconsistent to support the predictor-criterion rela-tionship without additional research. In some set-tings, the cumulative research literature may beso substantial and so consistent that a dissimilarfinding in a local study should be viewed withcaution unless the local study is exceptionallysound. Local studies are of greatest value in settingswhere the cumulative research literature is sparse(e.g., due to the novelty of the predictor and/orcriterion used), where the cumulative record isinconsistent, or where the cumulative literaturedoes not include studies similar to the study fromthe local setting (e.g., a study of a test with a largecumulative literature dealing exclusively with pro-duction jobs and a local setting involving managerialjobs).

Standard 11.6

Reliance on local evidence of empirically deter-mined predictor-criterion relationships as a vali-dation strategy is contingent on a determinationof technical feasibility.

Comment:Meaningful evidence of predictor-cri-terion relationships is conditional on a number offeatures, including (a) the job’s being relativelystable rather than in a period of rapid evolution;(b) the availability of a relevant and reliablecriterion measure; (c) the availability of a samplereasonably representative of the population of in-terest; and (d) an adequate sample size for estimating

179

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 179

Page 190: STANDARDS

the strength of the predictor-criterion relationship.If any of these conditions is not met, some alter-native validation strategy should be used. For ex-ample, as noted in the comment to Standard11.5, the cumulative research literature mayprovide strong evidence of validity.

Standard 11.7

When empirical evidence of predictor-criterionrelationships is part of the pattern of evidenceused to support test use, the criterion measure(s)used should reflect the criterion construct domainof interest to the organization. All criteria usedshould represent important work behaviors orwork outputs, either on the job or in job-relevanttraining, as indicated by an appropriate reviewof information about the job.

Comment: When criteria are constructed torepresent job activities or behaviors (e.g., super-visory ratings of subordinates on important jobdimensions), systematic collection of informationabout the job should inform the developmentof the criterion measures. However, there is noclear choice among the many available jobanalysis methods. Note that job analysis is notlimited to direct observation of the job or directsampling of subject matter experts; large-scalejob-analytic databases often provide useful in-formation. There is not a clear need for jobanalysis to support criterion use when measuressuch as absenteeism, turnover, or accidents arethe criteria of interest.

Standard 11.8

Individuals conducting and interpreting empiricalstudies of predictor-criterion relationships shouldidentify artifacts that may have influenced studyfindings, such as errors of measurement, rangerestriction, criterion deficiency, criterion con-tamination, and missing data. Evidence of thepresence or absence of such features, and ofactions taken to remove or control their influence,should be documented and made available asneeded.

Comment: Errors of measurement in the criterionand restrictions on the variability of predictor orcriterion scores systematically reduce estimates ofthe relationship between predictor measures andthe criterion construct domain, but proceduresfor correction for the effects of these artifacts areavailable. When these procedures are applied,both corrected and uncorrected values should bepresented, along with the rationale for the correctionprocedures chosen. Statistical significance testsfor uncorrected correlations should not be usedwith corrected correlations. Other features to beconsidered include issues such as missing data forsome variables for some individuals, decisionsabout the retention or removal of extreme datapoints, the effects of capitalization on chance inselecting predictors from a larger set on the basisof strength of predictor-criterion relationships,and the possibility of spurious predictor-criterionrelationships, as in the case of collecting criterionratings from supervisors who know selection testscores. Chapter 3, on fairness, describes additionalissues that should be considered.

Standard 11.9

Evidence of predictor-criterion relationships ina current local situation should not be inferredfrom a single previous validation study unlessthe previous study of the predictor-criterion re-lationships was done under favorable conditions(i.e., with a large sample size and a relevant cri-terion) and the current situation correspondsclosely to the previous situation.

Comment: Close correspondence means that thecriteria (e.g., the job requirements or underlyingpsychological constructs) are substantially thesame (e.g., as is determined by a job analysis),and that the predictor is substantially the same.Judgments about the degree of correspondenceshould be based on factors that are likely to affectthe predictor-criterion relationship. For example,a test of situational judgment found to predictperformance of managers in one country may ormay not predict managerial performance in anothercountry with a very different culture.

180

CHAPTER 11

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 180

Page 191: STANDARDS

Standard 11.10

If tests are to be used to make job classificationdecisions (e.g., if the pattern of predictor scoreswill be used to make differential job assignments),evidence that scores are linked to different levelsor likelihoods of success among jobs, job groups,or job levels is needed.

Comment: As noted in chapter 1, it is possiblefor tests to be highly predictive of performancefor different jobs but not provide evidence of dif-ferential success among the jobs. For example,the same people may be predicted to be successfulfor each of the jobs.

Standard 11.11

If evidence based on test content is a primarysource of validity evidence supporting the use ofa test for selection into a particular job, a similarinference should be made about the test in a newsituation only if the job and situation are sub-stantially the same as the job and situation wherethe original validity evidence was collected.

Comment: Appropriate test use in this contextrequires that the critical job content factors besubstantially the same (e.g., as is determined by ajob analysis) and that the reading level of the testmaterial not exceed that appropriate for the newjob. In addition, the original meaning of the testmaterials should not be substantially changed inthe new situation. For example, “salt is to pepper”may be the correct answer to the analogy item“white is to black” in a culture where people ordi-narily use black pepper, but the item would havea different meaning in a culture where whitepepper is the norm.

Standard 11.12

When the use of a given test for personnelselection relies on relationships between a predictorconstruct domain that the test represents and acriterion construct domain, two links need to beestablished. First, there should be evidence thatthe test scores are reliable and that the test

content adequately samples the predictor constructdomain; and second, there should be evidencefor the relationship between the predictor constructdomain and major factors of the criterion constructdomain.

Comment: There should be a clear conceptualrationale for these linkages. Both the predictorconstruct domain and the criterion construct do-main to which it is to be linked should be definedcarefully. There is no single preferred route to es-tablishing these linkages. Evidence in support oflinkages between the two construct domains caninclude patterns of findings in the researchliterature and systematic evaluation of job contentto identify predictor constructs linked to the cri-terion domain. The bases for judgments linkingthe predictor and criterion construct domainsshould be documented.

For example, a test of cognitive ability mightbe used to predict performance in a job that iscomplex and requires sophisticated analysis ofmany factors. Here, the predictor construct domainwould be cognitive ability, and verifying the firstlink would entail demonstrating that the test isan adequate measure of the cognitive abilitydomain. The second linkage might be supportedby multiple lines of evidence, including a compi-lation of research findings showing a consistentrelationship between cognitive ability and per-formance on complex tasks, and by judgmentsfrom subject matter experts regarding the impor-tance of cognitive ability for performance in theperformance domain.

Cluster 3. Standards for Credentialing

Standard 11.13

The content domain to be covered by a creden-tialing test should be defined clearly and justifiedin terms of the importance of the content forcredential-worthy performance in an occupationor profession. A rationale and evidence shouldbe provided to support the claim that the knowl-edge or skills being assessed are required for cre-dential-worthy performance in that occupation

181

WORKPLACE TESTING AND CREDENTIALING

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 181

Page 192: STANDARDS

and are consistent with the purpose for whichthe credentialing program was instituted.

Comment: Typically, some form of job or practiceanalysis provides the primary basis for definingthe content domain. If the same examination isused in the credentialing of people employed in avariety of settings and specialties, a number ofdifferent job settings may need to be analyzed.Although the job analysis techniques may besimilar to those used in employment testing, theemphasis for credentialing is limited appropriatelyto knowledge and skills necessary for effectivepractice. The knowledge and skills contained in acore curriculum designed to train people for thejob or occupation may be relevant, especially ifthe curriculum has been designed to be consistentwith empirical job or practice analyses.

In tests used for licensure, knowledge andskills that may be important to success but arenot directly related to the purpose of licensure(e.g., protecting the public) should not be included.For example, in accounting, marketing skills maybe important for success, and assessment of thoseskills might have utility for organizations selectingaccountants for employment. However, lack ofthose skills may not present a threat to the public,and thus the skills would appropriately be excludedfrom this licensing examination. The fact thatsuccessful practitioners possess certain knowledgeor skills is relevant but not persuasive. Such infor-mation needs to be coupled with an analysis ofthe purpose of a credentialing program and thereasons that the knowledge or skills are requiredin an occupation or profession.

Standard 11.14

Estimates of the consistency of test-based cre-dentialing decisions should be provided in additionto other sources of reliability evidence.

Comment: The standards for decision consistencydescribed in chapter 2 are applicable to tests usedfor licensure and certification. Other types of re-

liability estimates and associated standard errorsof measurement may also be useful, particularlythe conditional standard error at the cut score.However, the consistency of decisions on whetherto certify is of primary importance.

Standard 11.15

Rules and procedures that are used to combinescores on different parts of an assessment orscores from multiple assessments to determinethe overall outcome of a credentialing test shouldbe reported to test takers, preferably before thetest is administered.

Comment: In some credentialing cases, candidatesmay be required to score at or above a specifiedminimum on each of several tests. In other cases,the pass-fail decision may be based solely on a totalcomposite score. If tests will be combined into acomposite, candidates should be provided infor-mation about the relative weighting of the tests. Itis not always possible to inform candidates of theexact weights prior to test administration becausethe weights may depend on empirical properties ofthe score distributions (e.g., their variances). However,candidates should be informed of the intention ofweighting (e.g., test A contributes 25% and test Bcontributes 75% to the total score).

Standard 11.16

The level of performance required for passing acredentialing test should depend on the knowledgeand skills necessary for credential-worthy per-formance in the occupation or profession andshould not be adjusted to control the number orproportion of persons passing the test.

Comment: The cut score should be determinedby a careful analysis and judgment of credential-worthy performance (see chap. 5). When thereare alternate forms of a test, the cut score shouldrefer to the same level of performance for allforms.

182

CHAPTER 11

ch11.qxp_AERA Standards 6/18/14 2:36 PM Page 182

Page 193: STANDARDS

Educational testing has a long history of use forinforming decisions about learning, instruction,and educational policy. Results of tests are usedto make judgments about the status, progress, oraccomplishments of individual students, as wellas entities such as schools, school districts, states,or nations. Tests used in educational settings rep-resent a variety of approaches, ranging from tra-ditional multiple-choice and open-ended itemformats to performance assessments, includingscorable portfolios. As noted in the introductorychapter, a distinction is sometimes made betweenthe terms test and assessment, the latter term en-compassing broader sources of information thana score on a single instrument. In this chapter weuse both terms, sometimes interchangeably, becausethe standards discussed generally apply to both.

This chapter does not explicitly address issuesrelated to tests developed or selected exclusivelyto inform learning and instruction at the classroomlevel. Those tests often have consequences forstudents, including influencing instructionalactions, placing students in educational programs,and affecting grades that may affect admission tocolleges. The Standards provide desirable criteriaof quality that can be applied to such tests.However, as with past editions, practical consid-erations limit the Standards’ applicability at theclassroom level. Formal validation practices areoften not feasible for classroom tests becauseschools and teachers do not have the resources todocument the characteristics of their tests and arenot publishing their tests for widespread use.Nevertheless, the core expectations of validity, re-liability/precision, and fairness should be consideredin the development of such tests.

The Standards clearly applies to formal testswhose scores or other results are used for purposesthat extend beyond the classroom, such as bench-mark or interim tests that schools and districtsuse to monitor student progress. The Standards

also applies to assessments that are adopted foruse across classrooms and whose developers makeclaims for the validity of score interpretations forintended uses. Admittedly, this distinction is notalways clear. Increasingly, districts, schools, andteachers are using an array of coordinated instructionand/or assessment systems, many of which aretechnology based. These systems may include, forexample, banks of test items that individualteachers can use in constructing tests for theirown purposes, focused assessment exercises thataccompany instructional lessons, or simulationsand games designed for instruction or assessmentpurposes. Even though it is not always possible toseparate measurement issues from correspondinginstructional and learning issues in these systems,assessments that are part of these systems andthat serve purposes beyond an individual teacher’sinstruction fall within the purview of the Standards.Developers of these systems bear responsibilityfor adhering to the Standards to support theirclaims.

Both the introductory discussion and the stan-dards provided in this chapter are organized intothree broad clusters: (1) design and developmentof educational assessments; (2) use and interpre-tation of educational assessments; and (3) admin-istration, scoring, and reporting of educationalassessments. Although the clusters are related tothe chapters addressing operational areas of thestandards, this discussion draws upon the principlesand concepts provided in the foundational chapterson validity, reliability/precision, and fairness andapplies them to educational settings. It shouldalso be noted that this chapter does not specificallyaddress the use of test results in mandated ac-countability systems that may impose perform-ance-based rewards or sanctions on institutionssuch as schools or school districts or on individualssuch as teachers or principals. Accountability ap-plications involving aggregates of scores are

183

12. EDUCATIONAL TESTING AND ASSESSMENT

BACKGROUND

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 183

Page 194: STANDARDS

addressed in chapter 13 (“Uses of Tests for ProgramEvaluation, Policy Studies, and Accountability”).

Design and Development of Educational Assessments

Educational tests are designed and developed toprovide scores that support interpretations forthe intended test purposes and uses. Design anddevelopment of educational tests, therefore, beginsby considering test purpose. Once a test’s purposesare established, considerations related to thespecifics of test design and development can beaddressed.

Major Purposes of Educational Testing

Although educational tests are used in a variety ofways, most address at least one of three majorpurposes: (a) to make inferences that informteaching and learning at the individual or curricularlevel; (b) to make inferences about outcomes forindividual students and groups of students; and(c) to inform decisions about students, such ascertifying students’ acquisition of particular knowl-edge and skills for promotion, placement in specialinstructional programs, or graduation.

Informing teaching and learning. Assessmentsthat inform teaching and learning start with cleargoals for student learning and may involve avariety of strategies for assessing student statusand progress. The goals are typically cognitive innature, such as student understanding of rationalnumber equivalence, but may also address affectivestates or psychomotor skills. For example, teachingand learning goals could include increasing studentinterest in science or teaching students to formletters with a pen or pencil.

Many assessments that inform teaching andlearning are used for formative purposes. Teachersuse them in day-to-day classroom settings to guideongoing instruction. For example, teachers mayassess students prior to starting a new unit to as-certain whether they have acquired the necessaryprerequisite knowledge and skills. Teachers maythen gather evidence throughout the unit to seewhether students are making anticipated progress

and to identify any gaps and/or misconceptionsthat need to be addressed.

More formal assessments used for teachingand learning purposes may not only inform class-room instruction but also provide individual andaggregated assessment data that others may use tosupport learning improvement. For example,teachers in a district may periodically administercommercial or locally constructed assessmentsthat are aligned with the district curriculum orstate content standards. These tests may be usedto evaluate student learning over one or moreunits of instruction. Results may be reported im-mediately to students, teachers, and/or school ordistrict leaders. The results may also be brokendown by content standard or subdomain to helpteachers and instructional leaders identify strengthsand weaknesses in students’ learning and/or toidentify students, teachers, and/or schools thatmay need special assistance. For example, specialprograms may be designed to tutor students inspecific areas in which test results indicate theyneed help. Because the test results may influencedecisions about subsequent instruction, it is im-portant to base content domain or subdomainscores on sufficient numbers of items or tasks toreliably support the intended uses.

In some cases, assessments administered duringthe school year may be used to predict studentperformance on a year-end summative assessment.If the predicted performance on the year-end as-sessment is low, additional instructional interven-tions may be warranted. Statistical techniques,such as linear regression, may be used to establishthe predictive relationships. A confounding variablein such predictions may be the extent to whichinstructional interventions based on interim resultsimprove the performance of initially low-scoringstudents over the course of the school year; thepredictive relationships will decrease to the extentthat student learning is improved.

Assessing student outcomes. The assessment ofstudent outcomes typically serves summative func-tions, that is, to help assess pupils’ learning at thecompletion of a particular instructional sequence(e.g., the end of the school year). Educational testing

184

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 184

Page 195: STANDARDS

of student outcomes can be concerned with severaltypes of score interpretations, including standards-based interpretations, growth-based interpretations,and normative interpretations. These outcomes mayrelate to the individual student or be aggregatedover groups of students, for example, classes, sub-groups, schools, districts, states, or nations.

Standards-based interpretations of student out-comes typically start with content standards, whichspecify what students are expected to know andbe able to do. Such standards are typically establishedby committees of experts in the area to be tested.Content standards should be clear and specificand give teachers, students, and parents sufficientdirection to guide teaching and learning. Academicachievement standards, which are sometimes referredto as performance standards, connect content stan-dards to information that describes how well stu-dents are acquiring the knowledge and skills con-tained in academic content standards. Performancestandards may include labels for levels of per-formance (e.g., “basic,” “proficient,” “advanced”),descriptions of what students at different per-formance levels know and can do, examples ofstudent work that illustrate the range of achievementwithin each performance level, and cut scoresspecifying the levels of performance on an assess-ment that separate adjacent levels of achievement.The process of establishing the cut scores for theacademic achievement standards is often referredto as standard setting.

Although it follows from a consideration ofstandards-based testing that assessments shouldbe tightly aligned with content standards, it isusually not possible to comprehensively measureall of the content standards using a single summativetest. For example, content standards that focuson student collaboration, oral argumentation, orscientific lab activities do not easily lend themselvesto measurement by traditional tests. As a result,certain content standards may be underemphasizedin instruction at the expense of standards that canbe measured by the end-of-year summative test.Such limitations may be addressed by developingassessment components that focus on variousaspects of a set of common content standards.

For example, performance assessments that aremore closely connected with instructional unitsmay measure certain content standards that arenot easily assessed by a more traditional end-of-year summative assessment.

The evaluation of student outcomes can alsoinvolve interpretations related to student progressor growth over time, rather than just performanceat a particular time. In standards-based testing,an important consideration is measuring studentgrowth from year to year, both at the level of theindividual student and aggregated across students,for example at the teacher, subgroup, or schoollevel. A number of educational assessments areused to monitor the progress or growth of individualstudents within and/or across school years. Testsused for these purposes are sometimes supportedby vertical scales that span a broad range of devel-opmental or educational levels and include (butare not limited to) both conventional multileveltest batteries and computerized adaptive assessments.In constructing vertical scales for educationaltests, it is important to align standards and/orlearning objectives vertically across grades and todesign tests at adjacent levels (or grades) that havesubstantial overlap in the content measured.

However, a variety of alternative statisticalmodels exist for measuring student growth, notall of which require the use of a vertical scale. Inusing and evaluating various growth models, itis important to clearly understand which questionseach growth model can (and cannot) answer,what assumptions each growth model is basedon, and what appropriate inferences can bederived from each growth model’s results. Missingdata can create challenges for some growthmodels. Attention should be paid to whethersome populations are being excluded from themodel due to missing data (for example, studentswho are mobile or have poor attendance). Otherfactors to consider in the use of growth modelsare the relative reliability/precision of scores es-timated for groups with different amounts ofmissing data, and whether the model treats stu-dents the same regardless of where they are onthe performance continuum.

185

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 185

Page 196: STANDARDS

Student outcomes in educational testing aresometimes evaluated through norm-referenced in-terpretations. A norm-referenced interpretationcompares a student’s performance with the per-formances of other students. Such interpretationsmay be made when assessing both status andgrowth. Comparisons may be made to all students,to a particular subgroup (e.g., other test takers whohave majored in the test taker’s intended field ofstudy), or to subgroups based on many other con-ditions (e.g., students with similar academic per-formance, students from similar schools). Normscan be developed for a variety of targeted populationsranging from national or international samples ofstudents to the students in a particular schooldistrict (i.e., local norms). Norm-referenced inter-pretations should consider differences in the targetpopulations at different times of a school year andin different years. When a test is routinely admin-istered to an entire target population, as in the caseof a statewide assessment, norm-referenced inter-pretations are relatively easy to produce and generallyapply only to a single point in the school year.However, national norms for a standardized achieve-ment test are often provided at several intervalswithin the school year. In that case, developersshould indicate whether the norms covering a par-ticular time interval were based on data or interpolatedfrom data collected at other times of year. For ex-ample, winter norms are often based on an inter-polation between empirical norms collected in falland spring. The basis for calculating interpolatednorms should be documented so that users can bemade aware of the underlying assumptions aboutstudent growth over the school year.

Because of the time and expense associatedwith developing national norms, many test developersreport alternative user norms that consist of descriptivestatistics based on all those who take their test or ademographically representative subset of those testtakers over a given period of time. Although such statistics— based on people who happen to takethe test— are often useful, the norms based onthem will change as the makeup of the referencegroup changes. Consequently, user norms shouldnot be confused with norms representative of moresystematically sampled groups.

Informing decisions about students. Test resultsare often used in the process of making decisionsabout individual students, for example, abouthigh school graduation, placement in certain ed-ucational programs, or promotion from one gradeto the next. In higher education, test resultsinform admissions decisions and the placementof admitted students in different courses (e.g., re-medial or regular) or instructional programs.

Fairness is a fundamental concern with alltests, but because decisions regarding educationalplacement, promotion, or graduation can haveprofound individual effects, fairness is paramountwhen tests are used to inform such decisions.Fairness in this context can be enhanced throughcareful consideration of conditions that affect stu-dents’ opportunities to demonstrate their capa-bilities. For example, when tests are used for pro-motion and graduation, the fairness of individualscore interpretations can be enhanced by (a) pro-viding students with multiple opportunities todemonstrate their capabilities through repeatedtesting with alternate forms or other construct-equivalent means; (b) providing students withadequate notice of the skills and content to betested, along with appropriate test preparationmaterials; (c) providing students with curriculumand instruction that afford them the opportunityto learn the content and skills to be tested; (d) providing students with equal access to disclosedtest content and responses as well as any specificguidance for test taking (e.g., test-taking strategies);(e) providing students with appropriate testingaccommodations to address particular access needs;and (f ) in appropriate cases, taking into accountmultiple criteria rather than just a single testscore.

Tests informing college admissions decisionsare used in conjunction with other informationabout students’ capabilities. Selection criteria mayvary within an institution by academic specializationand may include past academic records, transcripts,and grade-point average or rank in class. Scoreson tests used to certify students for high schoolgraduation or scores on tests administered at theend of specific high school courses may be usedin college admissions decisions. The interpretations

186

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 186

Page 197: STANDARDS

inherent in these uses of high school tests shouldbe supported by multiple lines of relevant validityevidence (e.g., both concurrent and predictive ev-idence). Other measures used by some institutionsin making admissions decisions are samples ofprevious work by students, lists of academic andservice accomplishments, letters of recommendation,and student-composed statements evaluated forthe appropriateness of the goals and experience ofthe student and/or for writing proficiency.

Tests used to place students in appropriatecollege-level or remedial courses play an importantrole in both community colleges and four-yearinstitutions. Most institutions either use commercialplacement tests or develop their own tests forplacement purposes. The items on placement testsare typically selected to serve this single purposein an efficient manner and usually do not com-prehensively measure prerequisite content. Forexample, a placement test in algebra will coveronly a subset of algebra content taught in highschool. Results of some placement tests are usedto exempt students from having to take a coursethat would normally be required. Other placementtests are used by advisors for placing students inremedial courses or the most appropriate coursein an introductory college-level sequence. In somecases, placement decisions are mechanized throughthe application of locally determined cut scoreson the placement exam. Such cut scores shouldbe established through a documented process in-volving appropriate stakeholders and validatedthrough empirical research.

Results from educational tests may also informdecisions related to placing students in special in-structional programs, including those for studentswith disabilities, English learners, and gifted andtalented students. Test scores alone should neverbe used as the sole basis for including any studentin special education programming, or excludingany student from such programming. Test scoresshould be interpreted in the context of the student’shistory, functioning, and needs. Nevertheless, testresults may provide an important basis for deter-mining whether a student has a disability andwhat the student’s educational needs are.

Development of Educational Tests

As with all tests, once the construct and purposesof an educational test have been delineated, con-sideration must be given to the intended populationof test takers, as well as to practical issues such asavailable testing time and the resources availableto support the development effort. In the devel-opment of educational tests, focus is placed onmeasuring the knowledge, skills, and abilities ofall examinees in the intended population withoutintroducing any advantages or disadvantagesbecause of individual characteristics (e.g., age,culture, disability, gender, language, race/ethnicity)that are irrelevant to the construct the test is in-tended to measure. The principles of universal design— an approach to assessment developmentthat attempts to maximize the accessibility of atest for all of its intended examinees— provideone basis for developing educational assessmentsin this manner. Paramount in the process is explicitdocumentation of the steps taken during the de-velopment process to provide evidence of fairness,reliability/precision, and validity for the test’s in-tended uses. The higher the stakes associated withthe assessment, the more attention needs to bepaid to such documentation. More detailed con-siderations related to the development of educationaltests are discussed in the chapters on fairness intesting (chap. 3) and test design and development(chap. 4).

A variety of formats are used in developingeducational tests, ranging from traditional itemformats such as multiple-choice and open-endeditems to performance assessments, includingscorable portfolios, simulations, and games. Ex-amples of such performance assessments mightinclude solving problems using manipulable ma-terials, making complex inferences after collectinginformation, or explaining orally or in writingthe rationale for a particular course of governmentaction under given economic conditions. An in-dividual portfolio may be used as another type ofperformance assessment. Scorable portfolios aresystematic collections of educational productstypically collected, and possibly revised, over time.

187

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 187

Page 198: STANDARDS

Technology is often used in educational settingsto present testing material and to record andscore test takers’ responses. Examples include en-hancements of text by audio instructions tofacilitate student understanding, computer-basedand adaptive tests, and simulation exercises whereattributes of performance assessments are supportedby technology. Some test administration formatsalso may have the capacity to capture aspects ofstudents’ processes as they solve test items. Theymay, for example, monitor time spent on items,solutions tried and rejected, or editing sequencesfor texts created by test takers. Technologies alsomake it possible to provide test administrationconditions designed to accommodate studentswith particular needs, such as those with differentlanguage backgrounds, attention deficit disorders,or physical disabilities.

Interpretations of scores on technology-basedtests are evaluated by the same standards forvalidity, reliability/precision, and fairness as testsadministered through more traditional means. Itis especially important that test takers be familiarizedwith the assessment technologies so that any un-familiarity with an input device or assessment in-terface does not lead to inferences based on con-struct-irrelevant variance. Furthermore, explicitconsideration of sources of construct-irrelevantvariance should be part of the validation frameworkas new technologies or interfaces are incorporatedinto assessment programs. Finally, it is importantto describe scoring algorithms used in technolo-gy-based tests and the expert models on whichthey may be based, and to provide technical datasupporting their use in the testing system docu-mentation. Such documentation, however, shouldstop short of jeopardizing the security of the as-sessment in ways that could adversely affect thevalidity of score interpretations.

Assessments Serving Multiple Purposes

By evaluating students’ knowledge and skillsrelative to a specific set of academic goals, testresults may serve a variety of purposes, includingimproving instruction to better meet studentneeds; evaluating curriculum and instruction dis-trict-wide; identifying students, schools and/or

teachers who need help; and/or predicting eachstudent’s likelihood of success on a summative as-sessment. It is important to validate the interpre-tations made from test scores on such assessmentsfor each of their intended uses.

There are often tensions associated with usingeducational assessments for multiple purposes.For example, a test developed to monitor theprogress or growth of individual students acrossschool years is unlikely to also effectively providedetailed and actionable diagnostic informationabout students’ strengths and weaknesses. Similarly,an assessment designed to be given several timesover the course of the school year to predictstudent performance on a year-end summativeassessment is unlikely to provide useful informationabout student learning with respect to particularinstructional units. Most educational tests willserve one purpose better than others; and themore purposes an educational test is purported toserve, the less likely it is to serve any of those pur-poses effectively. For this reason, test developersand users should design and/or select educationalassessments to achieve the purposes they believeare most important, and they should considerwhether additional purposes can be fulfilled andshould monitor the appropriateness of any identifiedadditional uses.

Use and Interpretation of Educational Assessments

Stakes and Consequences of Assessment

The importance of the results of testing programsfor individuals, institutions, or groups is often re-ferred to as the stakes of the testing program.When the stakes for an individual are high, andimportant decisions depend substantially on testperformance, the responsibility for providing evi-dence supporting a test’s intended purposes isgreater than might be expected for tests used inlow-stakes settings. Although it is never possibleto achieve perfect accuracy in describing an indi-vidual’s performance, efforts need to be made tominimize errors of measurement or errors in clas-sifying individuals into categories such as “pass,”

188

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 188

Page 199: STANDARDS

“fail,” “admit,” or “reject.” Further, supportingthe validity of interpretations for high-stakes pur-poses, whether individual or institutional, typicallyentails collecting sound collateral informationthat can be used to assist in understanding thefactors that contributed to test results and toprovide corroborating evidence that supports in-ferences based on the results. For example, testresults can be influenced by multiple factors, bothinstitutional and individual, such as the qualityof education provided, students’ exposure to edu-cation (e.g., through regular school attendance),and students’ motivation to perform well on thetest. Collecting this type of information can con-tribute to appropriate interpretations of test results.

The high-stakes nature of some testing programscan create special challenges when new test versionsare introduced. For example, a state may introducea series of high school end-of-course tests that arebased on new content standards and are partiallytied to graduation requirements. The operationaluse of these new tests must be accompanied bydocumentation that students have indeed beeninstructed on content aligned to the new standards.Because of feasibility constraints, this may requirea carefully planned phase-in period that includesspecial surveys or qualitative research studies thatprovide the needed opportunity-to-learn docu-mentation. Until such documentation is available,the tests should not be used for their intendedhigh-stakes purpose.

Many types of educational tests are viewed astools of educational policy. Beyond any intendedpolicy goals, it is important to consider potentialunintended effects of large-scale testing programs.These possible unintended effects include (a) nar-rowing of curricula in some schools to focus ex-clusively on anticipated test content, (b) restrictionof the range of instructional approaches to corre-spond to the testing format, (c) higher dropoutrates among students who do not pass the test,and (d) encouragement of instructional or ad-ministrative practices that may raise test scoreswithout improving the quality of education. It isessential for those who mandate and use educationaltests to be aware of such potential negative conse-quences (including missed opportunities to improve

teaching and learning), to collect informationthat bears on these issues, and to make decisionsabout the uses of assessments that take this infor-mation into account.

Assessments for Students With Disabilities and English Language Learners

In the 1999 edition of the Standards, the materialon educational testing for special populations fo-cused primarily on individualized diagnostic as-sessment and educational placement for studentswith special needs. Since then, requirements stem-ming from federal legislation have significantlyincreased the participation of special populationsin large-scale educational assessment programs.Special populations have also become more diverseand now represent a larger percentage of thosetest takers who participate in general educationprograms. More students are being diagnosedwith disabilities, and more of these students areincluded in general education programs and instate standards-based assessments. In addition,the number of students who are English languagelearners has grown dramatically, and the numberincluded in educational assessments has increasedaccordingly.

As discussed in chapter 3 (“Fairness in Testing”),assessments for special populations involve a con-tinuum of potential adaptations, ranging fromspecially developed alternate assessments to modi-fications and accommodations of regular assessments.The purpose of alternate assessments and adaptationsis to increase the accessibility of tests that may nototherwise allow students with some characteristicsto display their knowledge and skills. Assessmentsfor special populations may also include assessmentsdeveloped for English language learners and indi-vidually administered assessments that are usedfor diagnosis and placement.

Alternate assessments.The term alternate assessmentsas used here, in the context of educational testing,refers to assessments developed for students withsignificant cognitive disabilities. Based on per-formance standards different from those used forregular assessments, alternate assessments providethese students with the opportunity to demonstrate

189

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 189

Page 200: STANDARDS

their standing and progress in learning. An alternateassessment might consist of an observation checklist,a multilevel assessment with performance tasks,or a portfolio that includes responses to selected-response and/or open-ended tasks. The assessmenttasks are developed with the special characteristicsof this population in mind. For example, amultilevel assessment with performance tasksmight include scaffolding procedures in whichthe examiner eliminates question distracters whenstudents answer incorrectly, in order to reducequestion complexity. Or, in a portfolio assessment,the teacher might include work samples and otherassessment information tailored specifically to thestudent. The teacher may assess the same Englishlanguage arts standard by asking one student towrite a story and another to sequence a storyusing picture cards, depending on which activityprovides students with access to demonstrate whatthey know and can do.

The development and use of alternate assess-ments in education have been heavily influencedby federal legislation. Federal regulations mayrequire that alternate assessments used in a givenstate have explicit connections to the contentstandards measured by the regular state assessmentwhile allowing for content with less depth, breadth,and complexity. Such requirements clearly influencethe design and development of alternate assessmentsin state standards-based programs.

Alternate assessments in education should beheld to the same technical requirements that applyto regular large-scale assessments. These includedocumentation and empirical data that supporttest development, standard setting, validity, relia-bility/precision, and technical characteristics ofthe tests. When the number of students servedunder alternate assessments is too small to generatestable statistical data, the test developer and usersshould describe alternate judgmental or otherprocedures used to document evidence of the va-lidity of score interpretations.

A variety of comparability issues may arisewhen alternate assessments are used in statewidetesting programs, for example, in aggregating theresults of alternate and regular assessments or incomparing trend data for subgroups when alternate

assessments have been used in some years andregular assessments in other years.

Accommodations and modifications. To enableassessment systems to include all students, ac-commodations and modifications are provided tothose students who need them, including thosewho participate in alternate assessments becauseof their significant cognitive disabilities. Adaptations,which include both accommodations and modi-fications, provide access to educational assessments.

Accommodations are adaptations to test formator administration (such as changes in the way thetest is presented, the setting for the test, or theway in which the student responds) that maintainthe same construct and produce results that arecomparable to those obtained by students whodo not use accommodations. Accommodationsmay be provided to English language learners toaddress their linguistic needs, as well as to studentswith disabilities to address specific, individualcharacteristics that otherwise would interfere withaccessibility. For example, a student with extremedyslexia may be provided with a screen reader toread aloud the scenarios and questions on a testmeasuring science inquiry skills. The screen readerwould be considered an accommodation becausereading is not part of the defined construct (scienceinquiry) and the scores obtained by the studenton the test would be assumed to be comparableto those obtained by students testing under regularconditions.

The use of accommodations should be sup-ported by evidence that their application doesnot change the construct that is being measuredby the assessment. Such evidence may be availablefrom studies of similar applications but may alsorequire specially designed research.

Modifications are adaptations to test format oradministration that change the construct beingmeasured in order to make it accessible for designatedstudents while retaining as much of the originalconstruct as possible. Modifications result in scoresthat differ in meaning from those for the regularassessment. For example, a student with extremedyslexia may be provided with a screen reader toread aloud the passages and questions on a reading

190

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 190

Page 201: STANDARDS

comprehension test that includes decoding as partof the construct. In this case, the screen readerwould be considered a modification because itchanges the construct being measured, and scoresobtained by the student on the test would not beassumed to be comparable to those obtained bystudents testing under regular conditions. In manycases, accommodations can meet student accessneeds without the use of modifications, but insome cases, modifications are the only option forproviding some students with access to an educationalassessment. As with alternate assessments, compa-rability issues arise with the use of modifications ineducational testing programs.

Modified tests should be designed and developedwith the same considerations of validity,reliability/precision, and fairness as regular assess-ments. It is not sufficient to assume that thevalidity evidence associated with a regular assessmentgeneralizes to a modified version.

An extensive discussion of modifications andaccommodations for special populations is providedin chapter 3 (“Fairness in Testing”).

Assessments for English language proficiency.An increasing focus on the measurement of Englishlanguage proficiency (ELP) for English languagelearners (ELLs) has mirrored the growing presenceof these students in U.S. classrooms. Like stan-dards-based content tests, ELP tests are based onELP standards and are held to the same standardsfor precision of scores and validity and fairness ofscore interpretations for intended uses as are otherlarge-scale tests.

ELP tests can serve a variety of purposes. Theyare used to identify students as English learnersand qualify them for special ELL programs andservices, to redesignate students as English proficient,and for purposes of diagnosis and instruction.States, districts, and schools also use ELP tests tomonitor these students’ progress and to holdschools and educators accountable for ELL learningand progress toward English proficiency.

As with any educational test, validity evidencefor measures of ELP can be provided by examiningthe test blueprint, the alignment of content withELP standards, construct comparability across

students, classification consistency, and otherclaims in the validity argument. The rationaleand evidence supporting the ELP domain definitionand the roles/relationships of the language modalities(e.g., reading, writing, speaking, listening) tooverall ELP are important considerations in artic-ulating the validity argument for an ELP test andcan inform the interpretation of test results. Sinceno single assessment is equally effective in servingall desired purposes, users should consider whichuses of ELP tests are their highest priority andchoose or develop instruments accordingly.

Accommodations associated with ELP testsshould be carefully considered, as adaptations thatare appropriate for regular content assessmentsmay compromise the ELP standards being assessed.In addition, users should establish common guidelinesfor using ELP results in making decisions aboutELL students. The guidelines should include explicitpolicies and procedures for using results in identifyingand redesignating ELL students as English proficient,an important process because of the legal and edu-cational importance of these designations. Localeducation agencies and schools should be providedwith easy access to the guidelines.

Individual assessments. Individually administeredtests are used by psychologists and other professionalsin schools and other related settings to inform de-cisions about a variety of services that may be ad-ministered to students. Services are provided forstudents who are gifted as well as for those whoencounter academic difficulties (e.g., students re-quiring remedial reading instruction). Still otherservices are provided for students who display be-havioral, emotional, physical, and/or more severelearning difficulties. Services may be provided forstudents who are taught in regular classrooms aswell as for those receiving more specialized in-struction (e.g., special education students).

Aspects of the test that may result in con-struct-irrelevant variance for students with certainrelevant characteristics should be taken into accountas appropriate by qualified testing professionalswhen using test results to aid placement decisions.For example, students’ English language proficiencyor prior educational experience may interfere with

191

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 191

Page 202: STANDARDS

their performance on a test of academic abilityand, if not taken into account, could lead to mis-classification in special education. Once a studentis placed, tests may be administered to monitorthe progress of the student toward prescribedlearning goals and objectives. Test results mayalso be used to inform evaluations of instructionaleffectiveness and determinations of whether thespecial services need to be continued, modified,or discontinued.

Many types of tests are used in individualizedand special needs testing. These include tests ofcognitive abilities, academic achievement, learningprocesses, visual and auditory memory, speechand language, vision and hearing, and behaviorand personality. These tests typically are used inconjunction with other assessment methods— such as interviews, behavioral observations, andreviews of records— for purposes of identifyingand placing students with disabilities. Regardlessof the qualities being assessed and the datacollection methods employed, assessment dataused in making special education decisions areevaluated in terms of evidence supporting intendedinterpretations as related to the specific needs ofthe students. The data must also be judged interms of their usefulness for designing appropriateeducational programs for students who have specialneeds. For further information, see chapter 10(“Psychological Testing and Assessment”).

Assessment Literacy and Professional Development

Assessment literacy can be broadly defined as knowl-edge about the basic principles of sound assessmentpractice, including terminology, the developmentand use of assessment methodologies and tech-niques, and familiarity with standards by whichthe quality of testing practices are judged. Theresults of educational assessments are used in de-cision making across a variety of settings in class-rooms, schools, districts, and states. Given therange and complexity of test purposes, it is im-portant for test developers and those responsiblefor educational testing programs to encourageeducators to be informed consumers of the testsand to fully understand and appropriately use

results that are reported to them. Similarly, as testusers, it is the responsibility of educators to pursueand attain assessment literacy as it pertains totheir roles in the education system.

Test sponsors and test developers can promoteeducator assessment literacy in a variety of ways,including workshops, development of written ma-terials and media, and collaboration with educatorsin the test development process (e.g., developmentof content standards, item writing and review,and standard setting). In particular, those responsiblefor educational testing programs should incorporateassessment literacy into the ongoing professionaldevelopment of educators. In addition, regularattempts should be made to educate other majorstakeholders in the educational process, includingparents, students, and policy makers.

Administration, Scoring, and Reporting of Educational Assessments

Administration of Educational Tests

Most educational tests involve standardized pro-cedures for administration. These include directionsto test administrators and examinees, specificationsfor testing conditions, and scoring procedures.Because educational tests typically are administeredby school personnel, it is important for the spon-soring agency to provide appropriate oversight tothe process and for schools to assign local rolesand responsibilities (e.g., testing coordination)for training those who will administer the test.Similarly, test developers have an obligation tosupport the test administration process and toprovide resources to help solve problems whenthey arise. For example, with high-stakes tests ad-ministered by computer, effective technical supportto the local administration is critical and shouldinvolve personnel who understand the context ofthe testing program as well as the technical aspectsof the delivery system.

Those responsible for educational testing pro-grams should have formal procedures for grantingtesting accommodations and involve qualifiedpersonnel in the associated decision-making process.For students with disabilities, changes in both in-

192

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 192

Page 203: STANDARDS

struction and assessment are typically specified inan individualized education program (IEP). ForEnglish language learners, schools may use guidancefrom the state or district to match students’language proficiency and instructional experiencewith appropriate language accommodations. Testaccommodations should be chosen by qualifiedpersonnel on the basis of the individual student’sneeds. It is particularly important in large-scaleassessment programs to establish clear policiesand procedures for assigning and using accom-modations. These steps help to maintain the com-parability of scores for students testing with ac-commodations on academic assessments acrossdistricts and schools. Once selected, accommoda-tions should be used consistently for both in-struction and assessment, and test administratorsshould be fully familiar with procedures for ac-commodated testing. Additional informationrelated to test administration accommodations isprovided in chapter 3 (“Fairness in Testing”).

Weighted and Composite Scoring

Scoring educational tests and assessments requiresdeveloping rules for combining scores on itemsand/or tasks to obtain a total score and, in somecases, for combining multiple scores into an overallcomposite. Scores from multiple tests are sometimescombined into linear composites using nominalweights, which are assigned to each componentscore in accordance with a logical judgment of itsrelative importance. Nominal weights may some-times be misleading because the variance of thecomposite is also determined by the variancesand covariances of the individual componentscores. As a result, the “effective weight” of eachcomponent may not reflect the nominal weighting.When composite scores are used, differences be-tween nominal and effective weights should beunderstood and documented.

For a single test, total scores are often basedon a simple sum of the item and task scores.However, differential weighting schemes may beapplied to reflect differential emphasis on specificcontent or constructs. For example, in an Englishlanguage arts test, more weight may be assignedto an extended essay because of the importance of

the task and because it is not feasible to includemore than one extended writing task in the test.In addition, scoring based on item response theory(IRT) models can result in item weights thatdiffer from nominal or desired weights. Such ap-plications of IRT should include considerationand explanation of item weights in scoring. Ingeneral, the scoring rules used for educationaltests should be documented and include a validi-ty-based rationale.

In addition, test developers should discusswith policy makers the various methods of com-bining the results from different educationaltests used to make decisions about students, andshould clearly document and communicate themethods, also known as decision rules. For example,as part of graduation requirements, a state mayrequire a student to achieve established levels ofperformance on multiple tests measuring differentcontent areas using either a noncompensatoryor a compensatory decision rule. Under a non-compensatory decision rule, the student has toachieve a determined level of performance oneach test; under a compensatory decision rule,the student may only have to achieve a certaintotal composite score based on a combination ofscores across tests. For a high-stakes decision,such as one related to graduation, the rules usedto combine scores across tests should be establishedwith a clear understanding of the associated im-plications. In these situations, important conse-quences such as passing rates and classificationerror rates will differ depending on the rules forcombining test results. Test developers shoulddocument and communicate these implicationsto policy makers to encourage policy decisionsthat are fully informed.

Reporting Scores

Score reports for educational assessments shouldsupport the interpretations and decisions of theirintended audiences, which include students, teach-ers, parents, principals, policy makers, and othereducators. Different reports may be developedand produced for different audiences, and thescore report layouts may differ accordingly. Forexample, reports prepared for individual students

193

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 193

Page 204: STANDARDS

and parents may include background informationabout the purpose of the assessment, definitionsof performance categories, and more user-friendlyrepresentations of measurement error (e.g., errorbands around graphical score displays). Thosewho develop such reports should strive to provideinformation that can help students make productivedecisions about their own learning. In contrast,reports prepared for principals and district-levelpersonnel may include more detailed summariesbut less foundational information because theseindividuals typically have a much better under-standing of assessments.

As discussed in chapter 3, when modificationshave been made to a test for some test takers thataffect the construct being measured, considerationmay be given to reporting that a modificationwas made because it affects the reliability/precisionof test scores or the validity of interpretationsdrawn from test scores. Conversely, when accom-modations are made that do not affect the com-parability of test scores, flagging those accommo-dations is not appropriate.

In general, score reports for educational testsshould be designed to provide information that isunderstandable and useful to stakeholders withoutleading to unwarranted score interpretations. Testdevelopers can significantly improve the designof score reports by conducting supporting research.For example, surveys of available reports for othereducational tests can provide ideas for effectivelydisplaying test results. In addition, usability researchwith consumers of score reports can provideinsights into report design. A number of techniquescan be used in this type of research, includingfocus groups, surveys, and analyses of verbal pro-tocols. For example, the advantages and disad-vantages of alternate prototype designs can be

compared by gathering data about the interpreta-tions and inferences made by users based on thedata presented in each report.

Online reporting capabilities give users flexibleaccess to test results. For example, the user canselect options online to break down the results bycontent or subgroup. The options provided totest users for querying the results should supportthe test’s intended uses and interpretations. Forexample, online systems may discourage or disallowviewing of results, in some cases as required bylaw, if the sample sizes of particular subgroups fallbelow an acceptable number. In addition, careshould be taken to allow access only to the appro-priate individuals. As with score reports, thevalidity of interpretations from online supportingsystems can be enhanced through usability researchinvolving the intended score users.

Technology also facilitates close alignment ofinstructional materials with the results of educationaltests. For example, results reported for an individualstudent could include not only strengths andweaknesses but direct links to specific instructionalmaterials that a teacher may use with the studentin the future. Rationales and documentation sup-porting the efficacy of the recommended inter-ventions should be provided, and users should beencouraged to consider such information in con-junction with other evidence and judgments aboutstudent instructional needs.

When results are reported for large-scale as-sessments, the test sponsors or users should prepareaccompanying guidance to promote sound useand valid interpretations of the data by the mediaand other stakeholders in the assessment process.Such communications should address likely testingconsequences (both positive and negative), as wellas anticipated misuses of the results.

194

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 194

Page 205: STANDARDS

The standards in this chapter have been separatedinto three thematic clusters labeled as follows:

1. Design and Development of Educational Assessments

2. Use and Interpretation of Educational Assessments

3. Administration, Scoring, and Reporting ofEducational Assessments

Users of educational tests for evaluation, policy,or accountability should also refer to the standardsin chapter 13 (“Uses of Tests for Program Evalua-tion, Policy Studies, and Accountability”).

Cluster 1. Design and Development ofEducational Assessments

Standard 12.1

When educational testing programs are mandatedby school, district, state, or other authorities,the ways in which test results are intended to beused should be clearly described by those whomandate the tests. It is also the responsibility ofthose who mandate the use of tests to monitortheir impact and to identify and minimizepotential negative consequences as feasible. Con-sequences resulting from the uses of the test,both intended and unintended, should also beexamined by the test developer and/or user.

Comment: Mandated testing programs are oftenjustified in terms of their potential benefits forteaching and learning. Concerns have been raisedabout the potential negative impact of mandatedtesting programs, particularly when they directlyresult in important decisions for individuals or in-stitutions. There is concern that some schools arenarrowing their curriculum to focus exclusivelyon the objectives tested, encouraging instructionalor administrative practices designed simply to raisetest scores rather than improve the quality of edu-cation, and losing higher numbers of students be-cause many drop out after failing tests. The need

to monitor the impact of educational testing pro-grams relates directly to fairness in testing, whichrequires ensuring that scores on a given test reflectthe same construct and have essentially the samemeaning for all individuals in the intended test-taker population. Consistent with appropriatetesting objectives, potential negative consequencesshould be monitored and, when identified, shouldbe addressed to the extent possible. Depending onthe intended use, the person responsible for exam-ining the consequences could be the mandatingauthority, the test developer, or the user.

Standard 12.2

In educational settings, when a test is designedor used to serve multiple purposes, evidence ofvalidity, reliability/precision, and fairness shouldbe provided for each intended use.

Comment: In educational testing, it has becomecommon practice to use the same test for multiplepurposes. For example, interim/benchmark testsmay be used for a variety of purposes, includingdiagnosing student strengths and weaknesses,monitoring individual student growth, providinginformation to assist in instructional planning forindividuals or groups of students, and evaluatingschools or districts. No test will serve all purposesequally well. Choices in test design and developmentthat enhance validity for one purpose may diminishvalidity for other purposes. Different purposesmay require different kinds of technical evidence,and appropriate evidence of validity, reliability/pre-cision, and fairness for each purpose should beprovided by the test developer. If the test userwishes to use the test for a purpose not supportedby the available evidence, it is incumbent on theuser to provide the necessary additional evidence.See chapter 1 (“Validity”).

Standard 12.3

Those responsible for the development and useof educational assessments should design all rel-evant steps of the testing process to promote

195

EDUCATIONAL TESTING AND ASSESSMENT

STANDARDS FOR EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 195

Page 206: STANDARDS

access to the construct for all individuals andsubgroups for whom the assessment is intended.

Comment: It is important in educational contextsto provide for all students— regardless of theirindividual characteristics— the opportunity todemonstrate their proficiency on the constructbeing measured. Test specifications should clearlyspecify all relevant subgroups in the target popu-lation, including those for whom the test maynot allow demonstration of knowledge and skills.Items and tasks should be designed to maximizeaccess to the test content for all individuals inthe intended test-taker population. Tools andstrategies should be implemented to familiarizeall test takers with the technology and testingformat used, and the administration and scoringapproach should avoid introducing any con-struct-irrelevant variance into the testing process.In situations where individual characteristics suchas English language proficiency, cultural orlinguistic background, disability, or age are believedto interfere with access to the construct(s) thatthe test is intended to measure, appropriate adap-tations should be provided to allow access to thecontent, context, and response formats of thetest items. These may include both accommoda-tions (changes that are assumed to preserve theconstruct being measured) and modifications(changes that are assumed to make an alteredversion of the construct accessible). Additionalconsiderations related to fairness and accessibilityin educational tests and assessments are providedin chapter 3 (“Fairness in Testing”).

Standard 12.4

When a test is used as an indicator of achievementin an instructional domain or with respect tospecified content standards, evidence of theextent to which the test samples the range ofknowledge and elicits the processes reflected inthe target domain should be provided. Both thetested and the target domains should be describedin sufficient detail for their relationship to be

evaluated. The analyses should make explicitthose aspects of the target domain that the testrepresents, as well as those aspects that the testfails to represent.

Comment: Tests are commonly developed tomonitor the status or progress of individuals andgroups with respect to local, state, national, orprofessional content standards. Rarely can a singletest cover the full range of performances reflectedin the content standards. In developing a new testor selecting an existing test, appropriate interpre-tation of test scores as indicators of performanceon these standards requires documenting andevaluating both the relevance of the test to thestandards and the extent to which the test isaligned to the standards. Such alignment studiesshould address multiple criteria, including notonly alignment of the test with the content areascovered by the standards but also alignment withthe standards in terms of the range and complexityof knowledge and skills that students are expectedto demonstrate. Further, conducting studies ofthe cognitive strategies and skills employed bytest takers, or studies of the relationships betweentest scores and other performance indicatorsrelevant to the broader target domain, enablesevaluation of the extent to which generalizationsto that domain are supported. This informationshould be made available to all who use the testor interpret the test scores.

Standard 12.5

Local norms should be developed when appropriateto support test users’ intended interpretations.

Comment: Comparison of examinees’ scores tolocal as well as more broadly representative normgroups can be informative. Thus, sample size per-mitting, local norms are often useful in conjunctionwith published norms, especially if the local pop-ulation differs markedly from the population onwhich published norms are based. In some cases,local norms may be used exclusively.

196

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 196

Page 207: STANDARDS

Standard 12.6

Documentation of design, models, and scoringalgorithms should be provided for tests admin-istered and scored using multimedia or computers.

Comment: Computer and multimedia tests needto be held to the same requirements of technicalquality as other tests. For example, the use oftechnology-enhanced item formats should be sup-ported with evidence that the formats are a feasibleway to collect information about the construct,that they do not introduce construct-irrelevantvariance, and that steps have been taken to promoteaccessibility for all students.

Cluster 2. Use and Interpretation ofEducational Assessments

Standard 12.7

In educational settings, test users should takesteps to prevent test preparation activities anddistribution of materials to students that may ad-versely affect the validity of test score inferences.

Comment: In most educational testing contexts,the goal is to use a sample of test items to makeinferences to a broader domain. When inappropriatetest preparation activities occur, such as excessiveteaching of items that are equivalent to those onthe test, the validity of test score inferences is ad-versely affected. The appropriateness of test prepa-ration activities and materials can be evaluated,for example, by determining the extent to whichthey reflect the specific test items and by consideringthe extent to which test scores may be artificiallyraised as a result, without increasing students’level of genuine achievement.

Standard 12.8

When test results contribute substantially to de-cisions about student promotion or graduation,evidence should be provided that students havehad an opportunity to learn the content andskills measured by the test.

Comment: Students, parents, and educationalstaff should be informed of the domains on whichthe students will be tested, the nature of the itemtypes, and the criteria for determining mastery.Reasonable efforts should be made to documentthe provision of instruction on the tested contentand skills, even though it may not be possible orfeasible to determine the specific content of in-struction for every student. In addition and as ap-propriate, evidence should also be provided thatstudents have had the opportunity to become fa-miliar with the mode of administration and itemformats used in testing.

Standard 12.9

Students who must demonstrate mastery ofcertain skills or knowledge before being promotedor granted a diploma should have a reasonablenumber of opportunities to succeed on alternateforms of the test or be provided with technicallysound alternatives to demonstrate mastery ofthe same skills or knowledge. In most circum-stances, when students are provided with multipleopportunities to demonstrate mastery, the timeinterval between the opportunities should allowstudents to obtain the relevant instructional ex-periences.

Comment: The number of testing opportunitiesand the time between opportunities will varywith the specific circumstances of the setting.Further, policy may dictate that some studentsshould be given opportunities to demonstratetheir achievement using a different approach. Forexample, some states that administer high schoolgraduation tests permit students who have partic-ipated in the regular curriculum but are unable todemonstrate the required performance level onone or more of the tests to show, through a struc-tured portfolio of their coursework and other in-dicators (e.g., participation in approved assistanceprograms, satisfaction of other graduation re-quirements), that they have the knowledge andskills necessary to obtain a high school diploma.If another assessment approach is used, it shouldbe held to the same standards of technical quality

197

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 197

Page 208: STANDARDS

as the primary assessment. In particular, evidenceshould be provided that the alternative approachmeasures the same skills and has the same passingexpectations as the primary assessment.

Standard 12.10

In educational settings, a decision or charac-terization that will have major impact on astudent should take into consideration not justscores from a single test but other relevant in-formation.

Comment: In general, multiple measures or datasources will often enhance the appropriateness ofdecisions about students in educational settingsand therefore should be considered by test sponsorsand test users in establishing decision rules andpolicy. It is important that in addition to scoreson a single test, other relevant information (e.g.,school coursework, classroom observation, parentalreports, other test scores) be taken into accountwhen warranted. These additional data sourcesshould demonstrate information relevant to theintended construct. For example, it may not beadvisable or lawful to automatically accept studentsinto a gifted program if their IQ is measured tobe above 130 without considering additional rel-evant information about their performance. Sim-ilarly, some students with measured IQs below130 may be accepted based on other measures ordata sources, such as a test of creativity, a portfolioof student work, or teacher recommendations. Inthese cases, other evidence of gifted performanceserves to compensate for the lower IQ test score.

Standard 12.11

When difference or growth scores are used for in-dividual students, such scores should be clearlydefined, and evidence of their validity, reliability/precision, and fairness should be reported.

Comment: The standard error of the differencebetween scores on the pretest and posttest, the re-gression of posttest scores on pretest scores, orrelevant data from other appropriate methods forexamining change should be reported.

In cases where growth scores are predicted forindividual students, results based on different ver-sions of tests taken over time may be used. Forexample, math scores in Grades 3, 4, and 5 maybe used to predict the expected math score inGrade 6. In such cases, if complex statisticalmodels are used to predict scores for individualstudents, the method for constructing the modelsshould be made explicit and should be justified,and supporting technical and interpretive infor-mation should be provided to the score users.Chapter 13 (“Uses of Tests for Program Evaluation,Policy Studies, and Accountability”) addresses theapplication of more complex models to groups orsystems within accountability settings.

Standard 12.12

When an individual student’s scores from differenttests are compared, any educational decisionbased on the comparison should take into accountthe extent of overlap between the two constructsand the reliability or standard error of the differ-ence score.

Comment: When difference scores between twotests are used to aid in making educationaldecisions, it is important that the two tests beplaced on a common scale, either by standardizationor by some other means, and, if appropriate,normed on the same population at about thesame time. In addition, the reliability and standarderror of the difference scores between the twotests are affected by the relationship between theconstructs measured by the tests as well as by thestandard errors of measurement of the scores ofthe two tests. For example, when scores on a non-verbal ability measure are compared with achieve-ment test scores, the overlapping nature of thetwo constructs may render the reliability of thedifference scores lower than test users normallywould expect. If the ability and/or achievementtests involve a significant amount of measurementerror, this will also reduce the confidence that canbe placed in the difference scores. All these factorsaffect the reliability of difference scores betweentests and should be considered when such scores

198

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 198

Page 209: STANDARDS

are used as a basis for making important decisionsabout a student. This standard is also relevant incomparisons of subscores or scores from differentcomponents of the same test, such as may be re-ported for multiple aptitude test batteries, educa-tional tests, and/or selection tests.

Standard 12.13

When test scores are intended to be used as partof the process for making decisions about edu-cational placement, promotion, implementationof individualized educational programs, or pro-vision of services for English language learners,then empirical evidence documenting the rela-tionship among particular test scores, the in-structional programs, and desired student out-comes should be provided. When adequate em-pirical evidence is not available, users should becautioned to weigh the test results accordinglyin light of other relevant information about thestudents.

Comment: The use of test scores for placementor promotion decisions should be supported byevidence about the relationship between the testscores and the expected benefits of the resultingeducational programs. Thus, empirical evidenceshould be gathered to support the use of a testby a community college to place entering studentsin different mathematics courses. Similarly, inspecial education, when test scores are used inthe development of specific educational objectivesand instructional strategies, evidence is neededto show that the prescribed instruction is (a) di-rectly linked to the test scores, and (b) likely toenhance student learning. When there is limitedevidence about the relationship among test results,instructional plans, and student achievementoutcomes, test developers and users should stressthe tentative nature of the test-based recom-mendations and encourage teachers and otherdecision makers to weigh the usefulness of thetest scores in light of other relevant informationabout the students.

Standard 12.14

In educational settings, those who supervise othersin test selection, administration, and score inter-pretation should be familiar with the evidence forthe reliability/precision, the validity of the intendedinterpretations, and the fairness of the scores.They should be able to articulate and effectivelytrain others to articulate a logical explanation ofthe relationships among the tests used, the purposesserved by the tests, and the interpretations of thetest scores for the intended uses.

Comment: Appropriate interpretations of scoreson educational tests depend on the effectivetraining of individuals who carry out test admin-istration and on the appropriate education ofthose who make use of test results. Establishingongoing professional development programs thatinclude a focus on improving the assessmentliteracy of teachers and stakeholders is one mech-anism by which those who are responsible for testuse in educational settings can facilitate the validityof test score interpretations. Establishing educationalrequirements (e.g., an advanced degree, relevantcoursework, or attendance at workshops providedby the test developer or test sponsor) are otherstrategies that might be used to provide docu-mentation of qualifications and expertise.

Standard 12.15

Those responsible for educational testing programsshould take appropriate steps to verify that theindividuals who interpret the test results to makedecisions within the school context are qualifiedto do so or are assisted by and consult withpersons who are so qualified.

Comment: When testing programs are used as astrategy for guiding instruction, the school personnelwho are expected to make inferences about in-structional planning may need assistance in inter-preting test results for this purpose. Such assistancemay consist of ongoing professional development,interpretive guides, training, information sessions,

199

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 199

Page 210: STANDARDS

and the availability of experts to answer questionsthat arise as test results are disseminated.

The interpretation of some test scores is suffi-ciently complex to require that the user haverelevant training and experience or be assisted byand consult with persons who have such trainingand experience. Examples of such tests includeindividually administered intelligence tests, interestinventories, growth scores on state assessments,projective tests, and neuropsychological tests.

Cluster 3. Administration, Scoring, andReporting of Educational Assessments

Standard 12.16

Those responsible for educational testing programsshould provide appropriate training, documen-tation, and oversight so that the individuals whoadminister and score the test(s) are proficient inthe appropriate test administration and scoringprocedures and understand the importance ofadhering to the directions provided by the testdeveloper.

Comment: In addition to being familiar with stan-dardized test administration documentation andprocedures (including test security protocols), it isimportant for test coordinators and test administratorsto be familiar with materials and procedures foraccommodations and modifications for testing.Test developers should therefore provide appropriatemanuals and training materials that specificallyaddress accommodated administrations. Test coor-dinators and test administrators should also receiveinformation about the characteristics of the studentpopulations included in the testing program.

Standard 12.17In educational settings, reports of group differencesin test scores should be accompanied by relevantcontextual information, where possible, to enablemeaningful interpretation of the differences.Where appropriate contextual information isnot available, users should be cautioned againstmisinterpretation.

Comment: Differences in test scores between rel-evant subgroups (e.g., classified by gender, race/eth-nicity, school/district, or geographical region) canbe influenced, for example, by differences instudent characteristics, in course-taking patterns,in curriculum, in teachers’ qualifications, or inparental educational levels. Differences in per-formance of cohorts of students across time maybe influenced by changes in the population ofstudents tested or changes in learning opportunitiesfor students. Users should be advised to considerthe appropriate contextual information and becautioned against misinterpretation.

Standard 12.18

In educational settings, score reports should be ac-companied by a clear presentation of informationon how to interpret the scores, including the degreeof measurement error associated with each score orclassification level, and by supplementary informationrelated to group summary scores. In addition, datesof test administration and relevant norming studiesshould be included in score reports.

Comment: Score information should be commu-nicated in a way that is accessible to personsreceiving the score report. Empirical research in-volving score report users can help to improve theclarity of reports. For instance, the degree of un-certainty in the scores might be represented bypresenting standard errors of measurement graph-ically; or the probability of misclassification asso-ciated with performance levels might be provided.Similarly, when average or summary scores forgroups of students are reported, they should besupplemented with additional information aboutthe sample sizes and the shapes or dispersions ofscore distributions. Particular care should be takento portray subscore information in score reportsin ways that facilitate proper interpretation. Scorereports should include the date of administrationso that score users can consider the validity of in-ferences as time passes. Score reports should alsoinclude the dates of relevant norming studies sousers can consider the age of the norms in makinginferences about student performance.

200

CHAPTER 12

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 200

Page 211: STANDARDS

Standard 12.19In educational settings, when score reportsinclude recommendations for instructional in-tervention or are linked to recommended plansor materials for instruction, a rationale for andevidence to support these recommendationsshould be provided.

Comment: Technology is making it increasinglypossible to assign particular instructional inter-ventions to students based on assessment results.Specific digital content (e.g., worksheets or lessons)may be made available to students using a rules-

based interpretation of their performance on astandards-based test. In such instances, documen-tation supporting the appropriateness of instruc-tional assignments should be provided. Similarly,when the pattern of subscores on a test is used toassign students to particular instructional inter-ventions, it is important to provide both a rationaleand empirical evidence to support the claim thatthese assignments are appropriate. In addition,users should be advised to consider such pedagogicalrecommendations in conjunction with other rele-vant information about students’ strengths andweaknesses.

201

EDUCATIONAL TESTING AND ASSESSMENT

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 201

Page 212: STANDARDS

ch12.qxp_AERA Standards 6/18/14 2:37 PM Page 202

Page 213: STANDARDS

Tests are widely used to inform decisions as partof public policy. One example is the use of testsin the context of the design and evaluation ofprograms or policy initiatives. Program evaluationis the set of procedures used to make judgmentsabout a program’s design, its implementation,and its outcomes. Policy studies are somewhatbroader than program evaluations; they contributeto judgments about plans, principles, or proceduresenacted to achieve broad public goals. Tests oftenprovide the data that are analyzed to estimate theeffect of a policy, program, or initiative on outcomessuch as student achievement or motivation. Asecond broad category of test use in policy settingsis in accountability systems, which attach conse-quences (e.g., rewards and sanctions) to the per-formance of institutions (such as schools or schooldistricts) or individuals (such as teachers or mentalhealth care providers). Program evaluations, policystudies, and accountability systems should notnecessarily be viewed as discrete categories. Theyare frequently adopted in combination with oneanother, as is the case when accountability systemsimpose requirements or recommendations to usetest results for evaluating programs adopted byschools or districts.The uses of tests for program evaluations,

policy studies, and accountability share severalcharacteristics, including measurement of the per-formance of a group of people and use of testscores as evidence of the success or shortcomingsof an institution or initiative. This chapter examinesthese uses of tests. The accountability discussionfocuses on systems that involve aggregates ofscores, such as school-wide or institution-wideaverages, percentages of students or patients scoringabove a certain level, or growth or value-added

modeling results aggregated at the classroom,school, or institution level. Systems or programsthat focus on accountability for individual students,such as through test-based promotion policies orgraduation exams, are addressed in chapter 12.(However, many of the issues raised in that chapterare relevant to the use of educational tests forprogram evaluation or school accountability pur-poses.) If accountability systems or programsinclude tests administered to teachers, principals,or other providers for purposes of evaluating theirpractice or performance (e.g., for teacher pay-for-performance programs that include a test ofteacher knowledge or an observation-based measureof their practices), those tests should be evaluatedaccording to the standards related to workplacetesting and credentialing in chapter 11.The contexts in which testing for evaluation

and accountability takes place vary in the stakesfor test takers and for those who are responsiblefor promoting specific outcomes (such as teachersor health care providers). Testing programs forinstitutions can have high stakes when the aggregateperformance of a sample or of the entire populationof test takers is used to make inferences about thequality of services provided and, as a result,decisions are made about institutional status, re-wards, or sanctions. For example, the quality ofreading curriculum and instruction may be judgedin part on the basis of results of testing for levelsof attainment reached by groups of students. Sim-ilarly, aggregated scores on psychological tests aresometimes used to evaluate the effectiveness oftreatment provided by mental health programs oragencies and may be included in accountabilitysystems. Even when test results are reported inthe aggregate and intended for low-stakes purposes,

203

13. USES OF TESTS FOR PROGRAMEVALUATION, POLICY STUDIES,AND ACCOUNTABILITY

BACKGROUND

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 203

Page 214: STANDARDS

the public release of data may be used to informjudgments about program quality, personnel, oreducational programs and may influence policydecisions.

Evaluation of Programs and Policy Initiatives

As noted earlier, program evaluation typically in-volves making judgments about a single program,whereas policy studies address plans, principles,or procedures enacted to achieve broad publicgoals. Policy studies may address policies at variouslevels of government, including local, state, federal,and international, and may be conducted in bothpublic and private organizational or institutionalcontexts. There is no sharp distinction betweenpolicy studies and program evaluations, and inmany instances there is substantial overlap betweenthe two types of investigations. Test results areoften one important source of evidence for theinitiation, continuation, modification, termination,or expansion of various programs and policies.Tests may be used in program evaluations or

policy studies to provide information on the statusof clients, students, or other groups before, during,or after an intervention or policy enactment, aswell as to provide score information for appropriatecomparison groups. Whereas many testing activitiesare intended to document the performance of in-dividual test takers, program evaluation and policystudies target the performance of groups or theimpact of the test results on these groups. Avariety of tests can be used for evaluating programsand policies; examples include standardized achieve-ment tests administered by states or districts, pub-lished psychological tests that measure outcomesof interest, and measures developed specificallyfor the purposes of the evaluation. In addition,evaluations of programs and policies sometimessynthesize results from multiple studies or tests.It is important to evaluate any proposed test in

terms of its relevance to the goals of the programor policy and/or to the particular questions its usewill address. It is relatively rare for a test to be de-signed specifically for program evaluation or policystudy purposes, and therefore it is often necessary

for those who conduct such studies to rely onmeasures developed for other purposes. In addition,for reasons of cost or convenience, certain testsmay be adopted for use in a program evaluationor policy study even though they were developedfor a somewhat different population of respondents.Some tests may be selected because they are wellknown and thought to be especially credible inthe view of clients or public consumers, or becauseuseful data already exist from earlier administrationsof the tests. Evidence for the validity of test scoresfor the intended uses should be provided whenevertests are used for program or policy evaluations orfor accountability purposes.Because of administrative realities, such as

cost constraints and response burden, method-ological refinements may be adopted to increasethe efficiency of testing. One strategy is to obtaina sample of participants to be evaluated from thelarger set of those exposed to a program or policy.When a sufficient number of clients are affectedby the program or policy that will be evaluated,and when there is a desire to limit the time spenton testing, evaluators can create multiple formsof short tests from a larger pool of items. By con-structing a number of test forms consisting of rel-atively few items each and assigning the test formsto different subsamples of test takers (a procedureknown as matrix sampling), a larger number ofitems can be included in the study than couldreasonably be administered to any single test taker.When it is desirable to represent a domain with alarge number of test items, this approach is oftenused. However, in matrix sample testing, individualscores usually are not created or interpreted.Because procedures for sampling individuals ortest items may vary in a number of ways, adequateanalysis and interpretation of test results dependon a clear description of how samples were formedand how the tests were designed, scored, and re-ported. Reports of test results used for evaluationor accountability should describe the samplingstrategy and the extent to which the sample isrepresentative of the population that is relevantto the intended inferences.Evaluations and policy studies sometimes rely

on secondary data analysis: analysis of data previously

204

CHAPTER 13

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 204

Page 215: STANDARDS

collected for other purposes. In some circumstances,it may be difficult to ensure a good match betweenthe existing test and the intervention or policyunder examination, or to reconstruct in detail theconditions under which the data were originallycollected. Secondary data analysis also requiresconsideration of the privacy rights of test takersand others affected by the analysis. Sometimesthis requires determining whether the informedconsent obtained from participants in the originaldata collection was adequate to allow secondaryanalysis to proceed without a need for additionalconsent. It may also require an understanding ofthe extent to which individually identifiable in-formation has been redacted from the data setconsistent with applicable legal standards. In se-lecting (or developing) a test or deciding whetherto use existing data in evaluation and policystudies, careful investigators attempt to balancethe purpose of the test, the likelihood that it willbe sensitive to the intervention under study, itscredibility to interested parties, and the costs ofadministration. Otherwise, test results may leadto inappropriate conclusions about the progress,impact, and overall value of programs and policiesunder review.Interpretation of test scores in program evalu-

ation and policy studies usually entails complexanalysis of a number of variables. For example,some programs are mandated for a broad popula-tion; others target only certain subgroups. Someare designed to affect attitudes, beliefs, or values;others are intended to have a more direct impacton behavior, knowledge, or skills. It is importantthat the participants included in any study meetthe specified criteria for participating in theprogram or policy under review, so that appropriateinterpretation of test results will be possible. Testresults will reflect not only the effects of rules forparticipant selection and the impact on the par-ticipants of taking part in programs or treatments,but also the characteristics of the participants.Relevant background information about clientsor students may be obtained to strengthen the in-ferences derived from the test results. Valid inter-pretations may depend on additional considerationsthat have nothing to do with the appropriateness

of the test or its technical quality, including studydesign, administrative feasibility, and the qualityof other available data. This chapter focuses ontesting and does not deal with these other consid-erations in any substantial way. In order to developdefensible conclusions, however, investigators con-ducting program evaluations and policy studiesshould supplement test results with data fromother sources. These data may include informationabout program characteristics, delivery, costs,client backgrounds, degree of participation, andevidence of side effects. Because test results lendimportant weight to evaluation and policy studies,it is critical that any tests used in these investigationsbe sensitive to the questions of the study and ap-propriate for the test takers.

Test-Based Accountability Systems

The inclusion of test scores in educational ac-countability systems has become common in theUnited States and in other nations. Most test-based educational accountability in the UnitedStates takes place at the K–12 level, but many ofthe issues raised in the K–12 context are relevantto efforts to adopt outcomes-based accountabilityin postsecondary education. In addition, account-ability systems may incorporate information fromlongitudinal data systems linking students’ per-formance on tests and other indicators, includingsystems that capture a cohort’s performance frompreschool through higher education and into theworkforce. Test-based accountability sometimesoccurs in sectors other than education; one exampleis the use of psychological tests to create measuresof effectiveness for providers of mental healthcare. These uses of tests raise issues similar tothose that arise in educational contexts.Test-based accountability systems take a variety

of approaches to measuring performance andholding individuals or groups accountable forthat performance. These systems vary along anumber of dimensions, including the unit of ac-countability (e.g., district, school, teacher), thestakes attached to results, the frequency of meas-urement, and whether nontest indicators are in-cluded in the accountability system. One important

205

USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 205

Page 216: STANDARDS

measurement concern in accountability stemsfrom the construction of an accountability index:a number or label that reflects a set of rules forcombining scores and other information to arriveat conclusions and inform decision making. Anaccountability index could be as simple as anaverage test score for students in a particulargrade in a particular school, but most systemsrely on more complex indices. These may involvea set of rules (often called decision rules) for syn-thesizing multiple sources of information, such astest scores, graduation rates, course-taking rates,and teacher qualifications. An accountability indexmay also be created from applications of complexstatistical models such as those used in value-added modeling approaches. As discussed inchapter 12, for high-stakes decisions, such as clas-sification of schools or teachers into performancecategories that are linked to rewards or sanctions,the establishment of rules used to create account-ability indices should be informed by a considerationof the nature of the information the system is in-tended to provide and by an understanding ofhow consequences will be affected by these rules.The implications of the rules should be commu-nicated to decision makers so that they understandthe consequences of any policy decisions basedon the accountability index.Test-based accountability systems include in-

terpretations and assumptions that go beyondthose for the interpretation of the test scores onwhich they are based; therefore, they require ad-ditional evidence to support their validity. Ac-countability systems in education typically aggregatescores over the students in a class or school, andmay use complex mathematical models to generatea summary statistic, or index, for each teacher orschool. These indices are often interpreted as esti-mates of the effectiveness of the teacher or school.Users of information from accountability systemsmight assume that the accountability indicesprovide valid indicators of the intended outcomesof education (e.g., mastery of the skills and knowl-edge described in the state content standards),that differences among indices can be attributedto differences in the effectiveness of the teacher orschool, and that these differences are reasonably

stable over time and across students and items.These assumptions must be supported by evidence.Moreover, those responsible for developing orimplementing test-based accountability systemsoften assert that these systems will lead to specificoutcomes, such as increased educator motivationor improved achievement; these assertions shouldalso be supported by evidence. In particular,efforts should be made to investigate any potentialpositive or negative consequences of the selectedaccountability system.Similarly, the choice of specific rules and data

that are used to create an accountability indexshould reflect the goals and values of those whoare developing the accountability system, as wellas the inferences that the system is designed tosupport. For example, if a primary goal of an ac-countability system is to identify teachers whoare effective at improving student achievement,the accountability index should be based on as-sessments that are closely aligned with the contentthe teacher is expected to cover, and should takeinto account factors outside the teacher’s control.The process typically involves decisions such aswhether to measure percentages above a cut scoreor an average of scale scores, whether to measurestatus or growth, how to combine informationfor multiple subjects and grade levels, and whetherto measure performance against a fixed target oruse a rank-based approach. The development ofan accountability index also involves political con-siderations, such as how to balance technical con-cerns and transparency.

Issues in Program and Policy Evaluation and Accountability

Test results are sometimes used as one way to mo-tivate program administrators or other serviceproviders as well as to infer institutional effectiveness.This use of tests, including the public reportingof results, is thought to encourage an institutionto improve its services for its clients. For example,in some test-based accountability systems, consis-tently poor results on achievement tests at theschool level may result in interventions that affect

206

CHAPTER 13

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 206

Page 217: STANDARDS

the school’s staffing or operations. The interpretationof test results is especially complex when tests areused both as an institutional policy mechanismand as a measure of effectiveness. For example, apolicy or program may be based on the assumptionthat providing clear goals and general specificationsof test content (such as the types of topics, con-structs, cognitive domains, and response formatsincluded in the test) may be a reasonable strategyto communicate new expectations to educators.Yet the desire to influence test or evaluation resultsto show acceptable institutional performance couldlead to inappropriate testing practices, such asteaching the test items in advance, modifying testadministration procedures, discouraging certainstudents or clients from participating in the testingsessions, or focusing teaching exclusively on test-taking skills. These responses illustrate that themore an indicator is used for decision making,the more likely it is to become corrupted anddistort the process that it was intended to measure.Undesirable practices such as excessive emphasison test-taking skills might replace practices aimedat helping the test takers learn the broader domainsmeasured by the test. Because results derived fromsuch practices may lead to spuriously high estimatesof performance, the diligent investigator shouldestimate the impact of changes in teaching practicesthat may result from testing in order to interpretthe test results appropriately. Looking at possibleinappropriate consequences of tests as well astheir benefits will result in more accurate assessmentof policy claims that particular types of testingprograms lead to improved performance.Investigators conducting policy studies and

program evaluations may give no clear reasons tothe test takers for participating in the testing pro-cedure, and they often withhold the results fromthe test takers. When matrix sampling is used forprogram evaluation, it may not be feasible toprovide such reports. If little effort is made tomotivate the test takers to regard the test seriously(e.g., if the purpose of the test is not explained),the test takers may have little reason to maximizetheir effort on the test. The test results thus maymisrepresent the impact of a program, institution,or policy. When there is suspicion that a test has

not been taken seriously, the motivation of testtakers may be explored by collecting additionalinformation where feasible, using observation orinterview methods. Issues of inappropriate prepa-ration and unmotivated performance raise questionsabout the validity of interpretations of test results.In every case, it is important to consider thepotential impact on the test taker of the testingprocess itself, including test administration andreporting practices.Public policy decisions are rarely based solely

on the results of empirical studies, even when thestudies are of high quality. The more expansiveand indirect the policy, the more likely it is thatother considerations will come into play, such asthe political and economic impact of abandoning,changing, or retaining the policy, or the reactionsof various stakeholders when institutions becomethe targets of rewards or sanctions. Tests used inpolicy settings may be subjected to intense anddetailed scrutiny for political reasons. When thetest results contradict a favored position, attemptsmay be made to discredit the testing procedure,content, or interpretation. Test users should beable to defend the use of the test and the interpre-tation of results but should also recognize thatthey cannot control the reactions of stakeholdergroups.It is essential that all tests used in accountability,

program evaluation, or policy contexts meet thestandards for validity, reliability, and fairness ap-propriate to the intended test score interpretationsand use. Moreover, as described in chapter 6,tests should be administered by personnel whoare appropriately trained to implement the testadministration procedures. It is also essential thatassistance be provided to those responsible for in-terpreting study results for practitioners, the laypublic, and the media. Careful communicationabout goals, procedures, findings, and limitationsincreases the likelihood that the interpretations ofthe results will be accurate and useful.

Additional Considerations

This chapter and its associated standards aredirected to users of tests in program evaluations,

207

USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 207

Page 218: STANDARDS

policy studies, and accountability systems. Usersinclude those who mandate, design, or implementthese evaluations, studies, or systems and thosewho make decisions based on the informationthey provide. Users include, among others, psy-chologists who develop, evaluate, or enforce policies,

as well as educators, administrators, and policymakers who are engaged in efforts to measureschool performance or evaluate the effectivenessof education policies or programs. In addition tothe standards below, users should consider otheravailable documents containing relevant standards.

208

CHAPTER 13

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 208

Page 219: STANDARDS

The standards in this chapter have been separatedinto two thematic clusters labeled as follows:

1. Design and Development of Testing Pro-grams and Indices for Program Evaluation,Policy Studies, and Accountability Systems

2. Interpretations and Uses of InformationFrom Tests Used in Program Evaluation, Pol-icy Studies, and Accountability Systems

Users of educational tests for evaluation, policy,or accountability should also refer to the standardsin chapter 12 (“Educational Testing and Assess-ment”) and to the other standards in this volume.

Cluster 1. Design and Development ofTesting Programs and Indices forProgram Evaluation, Policy Studies,and Accountability Systems

Standard 13.1

Users of tests who conduct program evaluations orpolicy studies, or monitor outcomes, should clearlydescribe the population that the program or policyis intended to serve and should document theextent to which the sample of test takers is repre-sentative of that population. In addition, whenmatrix sampling procedures are used, rules forsampling items and test takers should be provided,and error calculations must take the samplingscheme into account. When multiple studies arecombined as part of a program evaluation or policystudy, information about the samples included ineach individual study should be provided.

Comment: It is important to provide informationabout sampling weights that may need to beapplied for accurate inferences about performance.When matrix sampling is used, documentationshould address the limitations that stem from thissampling approach, such as the difficulty increating individual-level scores. Test developers

should also report appropriate sampling errorvariance estimates if simple random sampling wasnot used.

Standard 13.2

When change or gain scores are used, the proce-dures for constructing the scores, as well as theirtechnical qualities and limitations, should be re-ported. In addition, the time periods betweentest administrations should be reported, andcare should be taken to avoid practice effects.

Comment: The use of change or gain scores pre-sumes that the same test, equivalent forms of thetest, or forms of a vertically scaled test are usedand that the test (or form or vertical scale) is notmaterially altered between administrations. Thestandard error of the difference between scores onpretests and posttests, the error associated withregression of posttest scores on pretest scores, orrelevant data from other methods for examiningchange, such as those based on structural equationmodeling, should be reported. In addition totechnical or methodological considerations, detailsrelated to test administration may also be relevantto interpreting change or gain scores. For example,it is important to consider that the error associatedwith change scores is higher than the errorassociated with the original scores on which theyare based. If change scores are used, informationabout the reliability/precision of these scoresshould be reported. It is also important to reportthe time period between administrations of tests;and if the same test is used on multiple occasions,the possibility of practice effects (i.e., improvedperformance due to familiarity with the test items)should be examined.

Standard 13.3

When accountability indices, indicators of ef-fectiveness in program evaluations or policystudies, or other statistical models (such as

209

USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

STANDARDS FOR USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 209

Page 220: STANDARDS

value-added models) are used, the method forconstructing such indices, indicators, or modelsshould be described and justified, and their tech-nical qualities should be reported.

Comment: An index that is constructed by ma-nipulating and combining test scores should besubjected to the same validity, reliability, andfairness investigations that are expected for thetest scores that underlie the index. The methodsand rules for constructing such indices should bemade available to users, along with documentationof their technical properties. The strengths andlimitations of various approaches to combiningscores should be evaluated, and information thatwould allow independent replication of the con-struction of indices, indicators, or models shouldbe made available for use by appropriate parties.As with regular test scores, a validity argument

should be set forth to justify inferences aboutindices as measures of a desired outcome. It is im-portant to help users understand the extent towhich the models support causal inferences. Forexample, when value-added estimates are used asmeasures of teachers’ effectiveness in improvingstudent achievement, evidence for the appropri-ateness of this inference needs to be provided.Similarly, if published ratings of health careproviders are based on indices constructed frompsychological test scores of their patients, thepublic information should include informationto help users understand what inferences aboutprovider performance are warranted. Developersand users of indices should be aware of ways inwhich the process of combining individual scoresinto an index may introduce technical problemsthat did not affect the original scores. Linkingerrors, floor or ceiling effects, differences in vari-ability across different measures, and lack of aninterval scale are examples of features that maynot be problematic for the purpose of interpretingindividual test scores but can become problematicwhen scores are combined into an aggregate meas-ure. Finally, when evaluations or accountabilitysystems rely on measures that combine varioussources of information, such as when scores onmultiple forms of a test are combined or when

nontest information is included in an accountabilityindex, the rules for combining the informationneed to be made explicit and must be justified. Itis important to recognize that when multiplesources of data are collapsed into a single compositescore or rating, the weights and distributionalcharacteristics of the sources will affect the distri-bution of the composite scores. The effects of theweighting and distributional characteristics onthe composite score should be investigated.When indices combine scores from tests ad-

ministered under standard conditions with thosethat involve modifications or other changes toadministration conditions, there should be a clearrationale for combining the information into asingle index, and the implications for validity andreliability should be examined.

Cluster 2. Interpretations and Uses ofInformation From Tests Used inProgram Evaluation, Policy Studies,and Accountability Systems

Standard 13.4

Evidence of validity, reliability, and fairness foreach purpose for which a test is used in a programevaluation, policy study, or accountability systemshould be collected and made available.

Comment: Evidence should be provided of thesuitabili ty of a test for use in program evaluation,policy studies, or accountability systems, includingthe relevance of the test to the goals of theprogram, policy, or system under study and thesuitability of the test for the populations involved.Those responsible for the release or reporting oftest results should provide and explain any sup-plemental information that will minimize possiblemisinterpretations or misuse of the data. In par-ticular, if an evaluation or accountability systemis designed to support interpretations regardingthe effectiveness of a program, institution, orprovider, the validity of these interpretations forthe intended uses should be investigated and doc-umented. Reports should include cautions against

210

CHAPTER 13

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 210

Page 221: STANDARDS

making unwarranted inferences, such as holdinghealth care providers accountable for test-scorechanges that may not be under their control. Ifthe use involves a classification of persons, insti-tutions, or programs into distinct categories, theconsistency, accuracy, and fairness of the classifi-cations should be reported. If the same test isused for multiple purposes (e.g., monitoringachievement of individual students; providing in-formation to assist in instructional planning forindividuals or groups of students; evaluatingdistricts, schools, or teachers), evidence related tothe validity of interpretations for each of theseuses should be gathered and provided to users,and the potential negative effects for certain uses(e.g., improving instruction) that might resultfrom unintended uses (e.g., high-stakes account-ability) need to be considered and mitigated.When tests are used to evaluate the performanceof personnel, the suitability of the tests for differentgroups of personnel (e.g., regular teachers, specialeducation teachers, principals) should be examined.

Standard 13.5

Those responsible for the development and useof tests for evaluation or accountability purposesshould take steps to promote accurate interpre-tations and appropriate uses for all groups forwhich results will be applied.

Comment: Those responsible for measuring out-comes should, to the extent possible, design thetesting process to promote access and to maximizethe validity of interpretations (e.g., by providingappropriate accommodations) for any relevantsubgroups of test takers who participate in programor policy evaluation. Users of secondary datashould clearly describe the extent to which thepopulation included in the test-score database in-cludes all relevant subgroups. The users shouldalso document any exclusion rules that wereapplied and any other changes to the testingprocess that could affect interpretations of results.Similarly, users of tests for accountability purposesshould make every effort to include all relevantsubgroups in the testing program; provide docu-

mentation of any exclusion rules, testing modifi-cations, or other changes to the test or adminis-tration conditions; and provide evidence regardingthe validity of score interpretations for subgroups.When summaries of test scores are reported sepa-rately by subgroup (e.g., by racial/ethnic group),test users should conduct analyses to evaluate thereliability/precision of scores for these groups andthe validity of score interpretations, and shouldreport this information when publishing the scoresummaries. Analyses of complex indices used foraccountability or for measuring program effec-tiveness should address the possibility of biasagainst specific subgroups or against programs orinstitutions serving those subgroups. If bias is de-tected (e.g., if scores on the index are shown to besubject to systematic error that is related toexaminee characteristics such as race/ethnicity),these indices should not be used unless they aremodified in a way that removes the bias. Additionalconsiderations related to fairness and accessibilityin educational tests and assessments are providedin chapter 3.When test results are used to support actions

regarding program or policy adoption or change,the professionals who are expected to make inter-pretations leading to these actions may need as-sistance in interpreting test results for this purpose.Advances in technology have led to increasedavailability of data and reports among teachers,administrators, and others who may not have re-ceived training in appropriate test use and inter-pretation or in analysis of test-score data. Thosewho provide the data or tools have the responsibilityto offer support and assistance to users, and usershave the responsibility to seek guidance on ap-propriate analysis and interpretation. Those re-sponsible for the release or reporting of test resultsshould provide and explain any supplemental in-formation that will minimize possible misinter-pretations of the data. Often, the test results for program evaluation

or policy analysis are analyzed well after the testshave been given. When this is the case, the usershould investigate and describe the context inwhich the tests were given. Factors such as inclu-sion/exclusion rules, test purpose, content sampling,

211

USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 211

Page 222: STANDARDS

instructional alignment, and the attachment ofhigh stakes can affect the aggregated results andshould be made known to the audiences for theevaluation or analysis.

Standard 13.6

Reports of group differences in test performanceshould be accompanied by relevant contextualinformation, where possible, to enable meaningfulinterpretation of the differences. If appropriatecontextual information is not available, usersshould be cautioned against misinterpretation.

Comment: Observed differences in average testscores between groups (e.g., classified by gender,race/ethnicity, disability, language proficiency, so-cioeconomic status, or geographical region) canbe influenced by differences in factors such as op-portunity to learn, training experience, effort, in-structor quality, and level and type of parentalsupport. In education, differences in group per-formance across time may be influenced by changesin the population of those tested (includingchanges in sample size) or changes in their experi-ences. Users should be advised to consider the ap-propriate contextual information when interpretingthese group differences and when designing policiesor practices to address those differences. In addition,if evaluations involve comparisons of test scoresacross national borders, evidence for the compa-rability of scores should be provided.

Standard 13.7

When tests are selected for use in evaluation oraccountability settings, the ways in which thetest results are intended to be used, and the con-sequences they are expected to promote, shouldbe clearly described, along with cautions againstinappropriate uses.

Comment: In some contexts, such as evaluationof a specific curriculum program, a test may havea limited purpose and may not be intended topromote specific outcomes other than informingthe evaluation. In other settings, particularly withtest-based accountability systems, the use of tests

is often justified on the grounds that it willimprove the quality of education by providinguseful information to decision makers and bycreating incentives to promote better performanceby educators and students. These kinds of claimsshould be made explicit when the system is man-dated or adopted, and evidence to support theirvalidity should be provided when available. Thecollection and reporting of evidence for a particularvalidity claim should be incorporated into theprogram design. A given claim for the benefits oftest use, such as improving students’ achievement,may be supported by logical or theoretical argumentas well as empirical data. Due weight should begiven to findings in the scientific literature thatmay be inconsistent with the stated claim.

Standard 13.8

Those who mandate the use of tests in policy,evaluation, and accountability contexts and thosewho use tests in such contexts should monitortheir impact and should identify and minimizenegative consequences.

Comment: The use of tests in policy, evaluation,and accountability settings may, in some cases,lead to unanticipated consequences. Particularlywhen high stakes are attached, those who mandatetests, as well as those who use the results, shouldtake steps to identify potential unanticipated con-sequences. Unintended negative consequencesmay include teaching test items in advance, mod-ifying test administration procedures, and dis-couraging or excluding certain test takers fromtaking the test. These practices can lead to spuri -ously high scores that do not reflect per formanceon the underlying construct or domain of interest.In addition, these practices may be prohibited bylaw. Testing procedures should be designed tominimize the likelihood of such consequences,and users should be given guidance and encour-agement to refrain from inappropriate test-prepa-ration practices.Some consequences can be anticipated on the

basis of past research and understanding of howpeople respond to incentives. For example, research

212

CHAPTER 13

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 212

Page 223: STANDARDS

shows that educational accountability tests influencecurriculum and instruction by signaling what isimportant for students to know and be able todo. This influence can be positive if a test encouragesa focus on valuable learning outcomes, but it isnegative if it narrows the curriculum in unintendedways. These and other common negative conse-quences, such as possible motivational impact onteachers and students (even when test results areused as intended) and increasing dropout rates,should be studied and the results taken into con-sideration. The integrity of test results should bemaintained by striving to eliminate practices de-signed to raise test scores without improvingperform ance on the construct or domain measuredby the test. In addition, administering an auditmeasure (i.e., another measure of the tested con-struct) may detect possible corruption of scores.

Standard 13.9

In evaluation or accountability settings, testresults should be used in conjunction with in-formation from other sources when the use ofthe additional information contributes to thevalidity of the overall interpretation.

Comment: Performance on indicators otherthan tests is almost always useful and in manycases essential. Descriptions or analyses of suchvariables as client selection criteria, services,client characteristics, setting, and resources areoften needed to provide a comprehensive pictureof the program or policy under review and toaid in the interpretation of test results. In the

accountability context, a decision that will havea major impact on an individual such as ateacher or health care provider, or on an organ-ization such as a school or treatment facility,should take into consideration other relevantinformation in addition to test scores. Examplesof other information that may be incorporatedinto evaluations or accountability systems aremeasures of educators’ or health care providers’practices (e.g., classroom observations, checklists)and nontest measures of student attainment(course taking, college attendance). In the case of value-added modeling, some re-

searchers have argued for the inclusion of studentdemographic characteristics (e.g., race/ethnicityand socioeconomic status) as controls, whereasother work suggests that including such variablesdoes not improve the performance of the measuresand can promote undesirable consequences suchas a perception that lower standards are being setfor some students than for others. Decisions re-garding what variables to include in such modelsshould be informed by empirical evidence regardingthe effects of their inclusion or exclusion. An additional type of information that is

relevant to the interpretation of test results inpolicy settings is the degree of motivation of thetest takers. It is important to determine whethertest takers regard the test experience seriously,particularly when individual scores are not reportedto test takers or when the scores are not associatedwith consequences for the test takers. Decisioncriteria regarding whether to include scores fromindividuals with questionable motivation shouldbe clearly documented.

213

USES OF TESTS FOR PROGRAM EVALUATION, POLICY STUDIES, AND ACCOUNTABILITY

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 213

Page 224: STANDARDS

ch13.qxp_AERA Standards 6/18/14 2:37 PM Page 214

Page 225: STANDARDS

This glossary provides definitions of terms as usedin the text and standards. For many of the terms,multiple definitions can be found in the literature;also, technical usage may differ from commonusage.

ability parameter: In item response theory (IRT), atheoretical value indicating the level of a test taker onthe ability or trait measured by the test; analogous tothe concept of true score in classical test theory.

ability testing: The use of tests to evaluate the currentperformance of a person in some defined domain ofcognitive, psychomotor, or physical functioning.

accessibility:The degree to which the items or tasks ona test enable as many test takers as possible to demonstratetheir standing on the target construct without beingimpeded by characteristics of the item that are irrelevantto the construct being measured. A test that ranks highon this criterion is referred to as accessible.

accommodations/test accommodations: Adjustmentsthat do not alter the assessed construct that are appliedto test presentation, environment, content, format (in-cluding response format), or administration conditionsfor particular test takers, and that are embedded withinassessments or applied after the assessment is designed.Tests or assessments with such accommodations, andtheir scores, are said to be accommodated. Accommodatedscores should be sufficiently comparable to unaccom-modated scores that they can be aggregated together.

accountability index: A number or label that reflects aset of rules for combining scores and other informationto form conclusions and inform decision making in anaccountability system.

accountability system: A system that imposes studentperformance-based rewards or sanctions on institutionssuch as schools or school systems or on individualssuch as teachers or mental health care providers.

acculturation: A process related to the acquisition ofcultural knowledge and artifacts that is developmentalin nature and dependent upon time of exposure andopportunity for learning.

achievement levels/proficiency levels: Descriptions oftest takers’ levels of competency in a particular area ofknowledge or skill, usually defined in terms of categoriesordered on a continuum, for example from “basic” to“advanced,” or “novice” to “expert.” The categoriesconstitute broad ranges for classifying performance.See cut score.

achievement standards: See performance standards.

achievement test: A test to measure the extent ofknowledge or skill attained by a test taker in a contentdomain in which the test taker has received instruction.

adaptation/test adaptation: 1. Any change in test con-tent, format (including response format), or adminis-tration conditions that is made to increase a test’s ac-cessibility for individuals who otherwise would faceconstruct-irrelevant barriers on the original test. Anadaptation may or may not change the meaning of theconstruct being measured or alter score interpretations.An adaptation that changes score meaning is referredto as a modification; an adaptation that does not changethe score meaning is referred to as an accommodation(see definitions in this glossary). 2. Change made to atest that has been translated into the language of atarget group and that takes into account the nuancesof the language and culture of that group.

adaptive test: A sequential form of individual testingin which successive items, or sets of items, in the testare selected for administration based primarily on theirpsychometric properties and content, in relation to thetest taker’s responses to previous items.

adjusted validity or reliability coefficient: A validityor reliability coefficient— most often, a product-moment correlation— that has been adjusted to offsetthe effects of differences in score variability, criterionvariability, or the unreliability of test and/or criterionscores. See restriction of range or variability.

aggregate score: A total score formed by combiningscores on the same test or across test components. Thescores may be raw or standardized. The components ofthe aggregate score may be weighted or not, dependingon the interpretation to be given to the aggregate score.

215

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 215

Page 226: STANDARDS

alignment: The degree to which the content andcognitive demands of test questions match targetedcontent and cognitive demands described in the test specifications.

alternate assessments/alternate tests: Assessments ortests used to evaluate the performance of students in ed-ucational settings who are unable to participate in stan-dardized accountability assessments, even with accom-modations. Alternate assessments or tests typically measureachievement relative to alternate content standards.

alternate forms: Two or more versions of a test that areconsidered interchangeable, in that they measure thesame constructs in the same ways, are built to the samecontent and statistical specifications, and are administeredunder the same conditions using the same directions.See equivalent forms, parallel forms.

alternate or alternative standards: Content and per-formance standards in educational assessment forstudents with significant cognitive disabilities.

analytic scoring: A method of scoring constructed re-sponses (such as essays) in which each critical dimensionof a particular performance is judged and scoredseparately, and the resultant values are combined for anoverall score. In some instances, scores on the separatedimensions may also be used in interpreting performance.Contrast with holistic scoring.

anchor items: Items administered with each of two ormore alternate forms of a test for the purpose ofequating the scores obtained on these alternate forms.

anchor test: A set of anchor items used for equating.

assessment: Any systematic method of obtaining in-formation, used to draw inferences about characteristicsof people, objects, or programs; a systematic process tomeasure or evaluate the characteristics or performanceof individuals, programs, or other entities, for purposesof drawing inferences; sometimes used synonymouslywith test.

assessment literacy: Knowledge about testing that sup-ports valid interpretations of test scores for their intendedpurposes, such as knowledge about test developmentpractices, test score interpretations, threats to validscore interpretations, score reliability and precision,test administration, and use.

automated scoring: A procedure by which constructedresponse items are scored by computer using a rules-based approach.

battery: A set of tests usually administered as a unit.The scores on the tests usually are scaled so that theycan readily be compared or used in combination fordecision making.

behavioral science: A scientific discipline, such as soci-ology, anthropology, or psychology, in which the actionsand reactions of humans and animals are studiedthrough observational and experimental methods.

benchmark assessments: Assessments administered ineducational settings at specified times during a curriculumsequence, to evaluate students’ knowledge and skillsrelative to an explicit set of longer-term learning goals.See interim assessments or tests.

bias: 1. In test fairness, construct underrepresentationor construct-irrelevant components of test scores thatdifferentially affect the performance of different groupsof test takers and consequently the reliability/precisionand validity of interpretations and uses of their testscores. 2. In statistics or measurement, systematic errorin a test score. See construct underrepresentation, con-struct-irrelevant variance, fairness, predictive bias.

bilingual/multilingual: Having a degree of proficiencyin two or more languages.

calibration: 1. In linking test scores, the process ofrelating scores on one test to scores on another thatdiffer in reliability/precision from those on the firsttest, so that scores have the same relative meaning for agroup of test takers. 2. In item response theory, theprocess of estimating the parameters of the item responsefunction. 3. In scoring constructed response tasks, pro-cedures used during training and scoring to achieve adesired level of scorer agreement.

certification: A process by which individuals are recog-nized (or certified) as having demonstrated some levelof knowledge and skill in some domain. See licensing, credentialing.

classical test theory: A psychometric theory based onthe view that an individual’s observed score on a test isthe sum of a true score component for the test takerand an independent random error component.

classification accuracy: Degree to which the assignmentof test takers to specific categories is accurate; thedegree to which false positive and false negative classi-fications are avoided. See sensitivity, specificity.

coaching: Planned short-term instructional activitiesfor prospective test takers provided prior to the test ad-

216

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 216

Page 227: STANDARDS

ministration for the primary purpose of improving theirtest scores. Activities that approximate the instructionprovided by regular school curricula or training programsare not typically referred to as coaching.

coefficient alpha: An internal-consistency reliabilitycoefficient based on the number of parts into which atest is partitioned (e.g., items, subtests, or raters), theinterrelationships of the parts, and the total test scorevariance. Also called Cronbach’s alpha and, for dichoto-mous items, KR-20. See internal-consistency coefficient,reliability coefficient.

cognitive assessment: The process of systematicallycollecting test scores and related data to make judgmentsabout an individual’s ability to perform various mentalactivities involved in the processing, acquisition, retention,conceptualization, and organization of sensory, perceptual,verbal, spatial, and psychomotor information.

cognitive lab: A method of studying the cognitiveprocesses that test takers use when completing a tasksuch as solving a mathematics problem or interpreting apassage of text, typically involving test takers’ thinkingaloud while responding to the task and/or respondingto interview questions after completing the task.

cognitive science:The interdisciplinary study of learningand information processing.

comparability/score comparability: In test linking, thedegree of score comparability resulting from the applicationof a linking procedure. Score comparability varies alonga continuum that depends on the type of linking con-ducted. See alternate forms, equating, calibration, linking,moderation, projection, vertical scaling.

composite score: A score that combines several scoresaccording to a specified formula.

computer-administered test: A test administered by acomputer; test takers respond by using a keyboard,mouse, or other response devices.

computer-based mastery test: A test administered bycomputer that indicates whether the test taker hasachieved a specified level of competence in a certaindomain, rather than the test takers’ degree of achievementin that domain. See mastery test.

computer-based test: See computer-administered test.

computer-prepared interpretive report: A programmedinterpretation of a test taker’s test results, based on em-

pirical data and/or expert judgment using variousformats such as narratives, tables, and graphs. Sometimesreferred to as automated scoring or narrative report.

computerized adaptive test: An adaptive test administeredby computer. See adaptive test.

concordance: In linking test scores for tests that measuresimilar constructs, the process of relating a score on onetest to a score on another, so that the scores have thesame relative meaning for a group of test takers.

conditional standard error of measurement: Thestandard deviation of measurement errors that affectthe scores of test takers at a specified test score level.

confidence interval: An interval within which the pa-rameter of interest will be included with a specifiedprobability.

consequences:The outcomes, intended and unintended,of using tests in particular ways in certain contexts andwith certain populations.

construct: The concept or characteristic that a test isdesigned to measure.

construct domain: The set of interrelated attributes(e.g., behaviors, attitudes, values) that are includedunder a construct’s label.

construct equivalence: 1. The extent to which aconstruct measured by one test is essentially the sameas the construct measured by another test. 2. Thedegree to which a construct measured by a test in onecultural or linguistic group is comparable to the constructmeasured by the same test in a different cultural or lin-guistic group.

construct-irrelevant variance: Variance in test-takerscores that is attributable to extraneous factors thatdistort the meaning of the scores and thereby decreasethe validity of the proposed interpretation.

construct underrepresentation: The extent to which atest fails to capture important aspects of the constructdomain that the test is intended to measure, resulting intest scores that do not fully represent that construct.

constructed-response items, tasks, or exercises: Items,tasks, or exercises for which test takers must createtheir own responses or products rather than choose aresponse from a specified set. Short-answer items requirea few words or a number as an answer; extended-response items require at least a few sentences and may

217

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 217

Page 228: STANDARDS

include diagrams, mathematical proofs, essays, orproblem solutions such as network repairs or otherwork products.

content domain: The set of behaviors, knowledge,skills, abilities, attitudes, or other characteristics to bemeasured by a test, represented in detailed test specifi-cations and often organized into categories by whichitems are classified.

content-related validity evidence: Evidence based ontest content that supports the intended interpretationof test scores for a given purpose. Such evidence mayaddress issues such as the fidelity of test content to per-formance in the domain in question and the degree towhich test content representatively samples a domain,such as a course curriculum or job.

content standard: In educational assessment, a statementof content and skills that students are expected to learnin a subject matter area, often at a particular grade or atthe completion of a particular level of schooling.

convergent evidence: Evidence based on the relationshipbetween test scores and other measures of the same orrelated construct.

credentialing: Granting to a person, by some authority,a credential, such as a certificate, license, or diploma,that signifies an acceptable level of performance insome domain of knowledge or activity.

criterion domain: The construct domain of a variablethat is used as a criterion. See construct domain.

criterion-referenced score interpretation: The meaningof a test score for an individual or of an average scorefor a defined group, indicating the individual’s orgroup’s level of performance in relationship to somedefined criterion domain. Examples of criterion-referenced interpretations include comparisons to cutscores, interpretations based on expectancy tables, anddomain-referenced score interpretations. Contrast withnorm-referenced score interpretation.

cross-validation: A procedure in which a scoring systemfor predicting performance, derived from one sample,is applied to a second sample to investigate the stabilityof prediction of the scoring system.

cut score: A specified point on a score scale, such thatscores at or above that point are reported, interpreted,or acted upon differently from scores below that point.

differential item functioning (DIF): For a particularitem in a test, a statistical indicator of the extent towhich different groups of test takers who are at thesame ability level have different frequencies of correctresponses or, in some cases, different rates of choosingvarious item options.

differential test functioning (DTF): Differential per-formance at the test or dimension level indicating thatindividuals from different groups who have the samestanding on the characteristic assessed by a test do nothave the same expected test score.

discriminant evidence: Evidence indicating whethertwo tests interpreted as measures of different constructsare sufficiently independent (uncorrelated) that theydo, in fact, measure two distinct constructs.

documentation: The body of literature (e.g., testmanuals, manual supplements, research reports, publi-cations, user’s guides) developed by a test’s author, de-veloper, user, and/or publisher to support test score in-terpretations for their intended use.

domain or content sampling: The process of selectingtest items, in a systematic way, to represent the total setof items measuring a domain.

effort: The extent to which a test taker appropriatelyparticipates in test taking.

empirical evidence: Evidence based on some form ofdata, as opposed to that based on logic or theory.

English language learner (ELL): An individual who isnot yet proficient in English. An ELL may be an indi-vidual whose first language is not English, a languageminority individual just beginning to learn English, oran individual who has developed considerable proficiencyin English. Related terms include English learner (EL),limited English proficient (LEP), English as a secondlanguage (ESL), and culturally and linguistically diverse.

equated forms: Alternate forms of a test whose scoreshave been related through a statistical process knownas equating, which allows scale scores on equated formsto be used interchangeably.

equating: A process for relating scores on alternateforms of a test so that they have essentially the samemeaning. The equated scores are typically reported ona common score scale.

equivalent forms: See alternate forms, parallel forms.

218

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 218

Page 229: STANDARDS

error of measurement: The difference between an ob-served score and the corresponding true score. Seestandard error of measurement, systematic error, randomerror, true score.

factor: Any variable, real or hypothetical, that is anaspect of a concept or construct.

factor analysis: Any of several statistical methods ofdescribing the interrelationships of a set of variables bystatistically deriving new variables, called factors, thatare fewer in number than the original set of variables.

fairness: The validity of test score interpretations forintended use(s) for individuals from all relevant subgroups.A test that is fair minimizes the construct-irrelevantvariance associated with individual characteristics andtesting contexts that otherwise would compromise thevalidity of scores for some individuals.

fake bad: Exaggerate or falsify responses to test itemsin an effort to appear impaired.

fake good: Exaggerate or falsify responses to test itemsin an effort to present oneself in an overly positive way.

false negative: An error of classification, diagnosis, orselection leading to a determination that an individualdoes not meet the standard based on an assessment forinclusion in a particular group, when, in truth, he orshe does meet the standard (or would, absent measure-ment error). See sensitivity, specificity.

false positive: An error of classification, diagnosis, orselection leading to a determination that an individualmeets the standard based on an assessment for inclusionin a particular group, when, in truth, he or she doesnot meet the standard (or would not, absent measurementerror). See sensitivity, specificity.

field test: A test administration used to check theadequacy of testing procedures and the statistical char-acteristics of new test items or new test forms. A fieldtest is generally more extensive than a pilot test. Seepilot test.

flag: An indicator attached to a test score, a test item,or other entity to indicate a special status. A flaggedtest score generally signifies a score obtained from amodified test resulting in a change in the underlyingconstruct measured by the test. Flagged scores may notbe comparable to scores that are not flagged.

formative assessment: An assessment process used byteachers and students during instruction that providesfeedback to adjust ongoing teaching and learning withthe goal of improving students’ achievement of intendedinstructional outcomes.

gain score: In testing, the difference between two scoresobtained by a test taker on the same test or two equatedtests taken on different occasions, often before andafter some treatment.

generalizability coefficient: An index of reliability/pre-cision based on generalizability theory (G theory). Ageneralizability coefficient is the ratio of universescore variance to observed score variance, where theobserved score variance is equal to the universe scorevariance plus the total error variance. See generalizabilitytheory.

generalizability theory: Methodological framework forevaluating reliability/precision in which various sourcesof error variance are estimated through the applicationof the statistical techniques of analysis of variance. Theanalysis indicates the generalizability of scores beyondthe specific sample of items, persons, and observationalconditions that were studied. Also called G theory.

group testing: Testing for groups of test takers, usuallyin a group setting, typically with standardized adminis-tration procedures and supervised by a proctor or testadministrator.

growth models: Statistical models that measure students’progress on achievement tests by comparing the testscores of the same students over time. See value-addedmodeling.

high-stakes test: A test used to provide results thathave important, direct consequences for individuals,programs, or institutions involved in the testing.Contrast with low-stakes test.

holistic scoring: A method of obtaining a score on atest, or a test item, based on a judgment of overall per-formance using specified criteria. Contrast with analyticscoring.

individualized education program (IEP): A documentedplan that delineates special education services for aspecial-needs student and that includes any adaptationsthat are required in the regular classroom or for as-sessments and any additional special programs orservices.

219

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 219

Page 230: STANDARDS

informed consent: The agreement of a person, or thatperson’s legal representative, for some procedure to beperformed on or by the individual, such as taking a testor completing a questionnaire.

intelligence test: A test designed to measure an individual’slevel of cognitive functioning in accord with some rec-ognized theory of intelligence. See cognitive assessment.

interim assessments or tests: Assessments administeredduring instruction to evaluate students’ knowledge andskills relative to a specific set of academic goals toinform policy-maker or educator decisions at the class-room, school, or district level. See benchmark assess-ments.

internal-consistency coefficient: An index of thereliability of test scores derived from the statistical in-terrelationships among item responses or scores on sep-arate parts of a test. See coefficient alpha, split-halves re-liability coefficient.

internal structure: In test analysis, the factorial structureof item responses or subscales of a test.

interpreter: Someone who facilitates cross-cultural com-munication by converting concepts from one languageto another (including sign language).

interrater agreement/consistency: The level of consistencywith which two or more judges rate the work or per-formance of test takers. See interrater reliability.

interrater reliability: The level of consistency in rank or-dering of ratings across raters. See interrater agreement.

intrarater reliability: The level of consistency amongrepetitions of a single rater in scoring test takers’responses. Inconsistencies in the scoring process resultingfrom influences that are internal to the rater ratherthan true differences in test takers’ performances resultin low intrarater reliability.

inventory: A questionnaire or checklist that elicits in-formation about an individual’s personal opinions, in-terests, attitudes, preferences, personality characteristics,motivations, or typical reactions to situations and prob-lems.

item: A statement, question, exercise, or task on a testfor which the test taker is to select or construct aresponse, or perform a task. See prompt.

item characteristic curve (ICC): A mathematicalfunction relating the probability of a certain itemresponse, usually a correct response, to the level of the

attribute measured by the item. Also called item responsecurve, item response function.

item context effect: Influence of item position, otheritems administered, time limits, administration conditions,and so forth, on item difficulty and other statisticalitem characteristics.

item pool/item bank: The collection or set of itemsfrom which a test or test scale’s items are selectedduring test development, or the total set of items fromwhich a particular subset is selected for a test takerduring adaptive testing.

item response theory (IRT): A mathematical model ofthe functional relationship between performance on atest item, the test item’s characteristics, and the testtaker’s standing on the construct being measured.

job analysis: The investigation of positions or jobclasses to obtain information about job duties andtasks, responsibilities, necessary worker characteristics(e.g. knowledge, skills, and abilities), working conditions,and/or other aspects of the work. See practice analysis.

job/job classification: A group of positions that aresimilar enough in duties, responsibilities, necessaryworker characteristics, and other relevant aspects thatthey may be properly placed under the same job title.

job performance measurement: Measurement of anincumbent’s observed performance of a job as evaluatedby a job sample test, an assessment of job knowledge,or ratings of the incumbent’s actual performance onthe job. See job sample test.

job sample test: A test of the ability of an individual toperform the tasks comprised by a job. See job performancemeasurement.

licensing: The granting, usually by a governmentagency, of an authorization or legal permission topractice an occupation or profession. See certification,credentialing.

linking/score linking:The process of relating scores ontests. See alternate forms, equating, calibration, moderation,projection, vertical scaling.

local evidence: Evidence (usually related to reliability/pre-cision or validity) collected for a specific test and aspecific set of test takers in a single institution or at aspecific location.

local norms: Norms by which test scores are referred toa specific, limited reference population of particular in-

220

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 220

Page 231: STANDARDS

terest to the test user (e.g., population of a locale, or-ganization, or institution). Local norms are not intendedto be representative of populations beyond that limitedsetting.

low-stakes test: A test used to provide results that haveonly minor or indirect consequences for individuals,programs, or institutions involved in the testing.Contrast with high-stakes test.

mastery test: A test designed to indicate whether a testtaker has attained a prescribed level of competence, ormastery, in a domain. See cut score, computer-basedmastery test.

matrix sampling: A measurement format in which alarge set of test items is organized into a number ofrelatively short item sets, each of which is randomlyassigned to a subsample of test takers, thereby avoidingthe need to administer all items to all test takers.Equivalence of the short item sets, or subsets, is notassumed.

meta-analysis: A statistical method of research in whichthe results from independent, comparable studies arecombined to determine the size of an overall effect orthe degree of relationship between two variables.

moderation: A process of relating scores on differenttests so that scores have the same relative meaning.

moderator variable: A variable that affects the directionor strength of the relationship between two other vari-ables.

modification/test modification: A change in testcontent, format (including response formats), and/oradministration conditions that is made to increase ac-cessibility for some individuals but that also affects theconstruct measured and, consequently, results in scoresthat differ in meaning from scores from the unmodifiedassessment.

neuropsychological assessment: A specialized type ofpsychological assessment of normal or pathologicalprocesses affecting the central nervous system and theresulting psychological and behavioral functions ordysfunctions.

norm-referenced score interpretation: A score inter-pretation based on a comparison of a test taker’s per-formance with the distribution of performance in aspecified reference population. Contrast criterion-referenced score interpretation.

norms: Statistics or tabular data that summarize thedistribution or frequency of test scores for one or morespecified groups, such as test takers of various ages orgrades, usually designed to represent some larger popu-lation, referred to as the reference population. See localnorms.

operational use: The actual use of a test, after initialtest development has been completed, to inform an in-terpretation, decision, or action, based in part or whollyon test scores.

opportunity to learn: The extent to which test takershave been exposed to the tested constructs throughtheir educational program and/or have had exposure toor experience with the language or the majority culturerequired to understand the test.

parallel forms: In classical test theory, strictly paralleltest forms that are assumed to measure the sameconstruct and to have the same means and the samestandard deviations in the populations of interest. Seealternate forms.

percentile: The score on a test below which a givenpercentage of scores for a specified population occurs.

percentile rank: The rank of a given score based on thepercentage of scores in a specified score distributionthat are below the score being ranked.

performance assessments: Assessments for which thetest taker actually demonstrates the skills the test is in-tended to measure by doing tasks that require thoseskills.

performance level: Label or brief statement classifyinga test taker’s competency in a particular domain, usuallydefined by a range of scores on a test. For example,labels such as “basic” to “advanced,” or “novice” to “ex-pert,” constitute broad ranges for classifying proficiency.See achievement levels, cut score, performance-level descriptor,standard setting.

performance-level descriptor: Descriptions of what testtakers know and can do at specific performance levels.

performance standards: Descriptions of levels of knowl-edge and skill acquisition contained in content standards,as articulated through performance-level labels (e.g.,“basic,” “proficient,” “advanced”); statements of whattest takers at different performance levels know andcan do; and cut scores or ranges of scores on the scaleof an assessment that differentiate levels of performance.

221

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 221

Page 232: STANDARDS

See cut score, performance level, performance-level de-scriptor.

personality inventory: An inventory that measures oneor more characteristics that are regarded generally aspsychological attributes or interpersonal tendencies.

pilot test: A test administered to a sample of test takersto try out some aspects of the test or test items, such asinstructions, time limits, item response formats, oritem response options. See field test.

policy study: A study that contributes to judgmentsabout plans, principles, or procedures enacted to achievebroad public goals.

portfolio: In assessment, a systematic collection of ed-ucational or work products that have been compiled oraccumulated over time, according to a specific set ofprinciples or rules.

position: In employment contexts, the smallest organi-zational unit, a set of assigned duties and responsibilitiesthat are performed by a person within an organization.

practice analysis: An investigation of a certain occupationor profession to obtain descriptive information aboutthe activities and responsibilities of the occupation orprofession and about the knowledge, skills, and abilitiesneeded to engage successfully in the occupation or pro-fession. See job analysis.

precision of measurement: The impact of measurementerror on the outcome of the measurement. See standarderror of measurement, error of measurement, reliability/precision.

predictive bias: The systematic under- or over-predictionof criterion performance for people belonging to groupsdifferentiated by characteristics not relevant to thecriterion performance.

predictive validity evidence: Evidence indicating howaccurately test data collected at one time can predictcriterion scores that are obtained at a later time.

proctor: In test administration, a person responsiblefor monitoring the testing process and implementingthe test administration procedures.

program evaluation: The collection and synthesis ofevidence about the use, operation, and effects of a pro-gram; the set of procedures used to make judgmentsabout a program’s design, implementation, and out-comes.

projection: A method of score linking in which scoreson one test are used to predict scores on another testfor a group of test takers, often using regression method-ology.

prompt/item prompt/writing prompt: The question,stimulus, or instruction that elicits a test taker’s response.

proprietary algorithms: Procedures, often computercode, used by commercial publishers or test developersthat are not revealed to the public for commercial rea-sons.

psychodiagnosis: Formalization or classification offunctional mental health status based on psychologicalassessment.

psychological assessment: An examination of psycho-logical functioning that involves collecting, evaluating,and integrating test results and collateral information,and reporting information about an individual.

psychological testing: The use of tests or inventories toassess particular psychological characteristics of an in-dividual.

random error: A nonsystematic error; a component oftest scores that appears to have no relationship to othervariables.

random sample: A selection from a defined populationof entities according to a random process with theselection of each entity independent of the selection ofother entities. See sample.

raw score: A score on a test that is calculated bycounting the number of correct answers, or moregenerally, a sum or other combination of item scores.

reference population: The population of test takers towhich individual test takers are compared through thetest norms. The reference population may be definedin terms of test taker age, grade, clinical status at thetime of testing, or other characteristics. See norms.

relevant subgroup: A subgroup of the population forwhich a test is intended that is identifiable in some waythat is relevant to the interpretation of test scores fortheir intended purposes.

reliability coefficient: A unit-free indicator that reflectsthe degree to which scores are free of random measure-ment error. See generalizability theory.

reliability/precision: The degree to which test scoresfor a group of test takers are consistent over repeated

222

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 222

Page 233: STANDARDS

applications of a measurement procedure and henceare inferred to be dependable and consistent for an in-dividual test taker; the degree to which scores are freeof random errors of measurement for a given group.See generalizability theory, classical test theory, precisionof measurement.

response bias: A test taker’s tendency to respond in aparticular way or style to items on a test (e.g., acquiescence,choice of socially desirable options, choice of “true” ona true-false test) that yields systematic, construct-irrelevant error in test scores.

response format: The mechanism that a test taker usesto respond to a test item, such as selecting from a list ofoptions (multiple-choice question) or providing awritten response (fill-in or written response to an open-ended or constructed-response question); oral response;or physical performance.

response protocol: A record of the responses given by atest taker to a particular test.

restriction of range or variability: Reduction in theobserved score variance of a test-taker sample, comparedwith the variance of the entire test-taker population, asa consequence of constraints on the process of samplingtest takers. See adjusted validity or reliability coefficient.

retesting: A repeat administration of a test, using eitherthe same test or an alternate form, sometimes with ad-ditional training or education between administrations.

rubric: See scoring rubric.

sample: A selection of a specified number of entities,called sampling units (test takers, items, etc.), from alarger specified set of possible entities, called thepopulation. See random sample, stratified random sam-ple.

scale: 1. The system of numbers, and their units, bywhich a value is reported on some dimension of meas-urement. 2. In testing, the set of items or subtests usedto measure a specific characteristic (e.g., a test of verbalability or a scale of extroversion-introversion).

scale score: A score obtained by transforming rawscores. Scale scores are typically used to facilitate interpretation.

scaling: The process of creating a scale or a scale scoreto enhance test score interpretation by placing scoresfrom different tests or test forms on a common scale or

by producing scale scores designed to support score in-terpretations. See scale.

school district: A local education agency administeredby a public board of education or other public authoritythat oversees public elementary or secondary schools ina political subdivision of a state.

score: Any specific number resulting from the assessmentof an individual, such as a raw score, a scale score, anestimate of a latent variable, a production count, anabsence record, a course grade, or a rating.

scoring rubric: The established criteria, includingrules, principles, and illustrations, used in scoring con-structed responses to individual tasks and clusters oftasks.

screening test: A test that is used to make broad cate-gorizations of test takers as a first step in selectiondecisions or diagnostic processes.

selection: The acceptance or rejection of applicants fora particular educational or employment opportunity.

sensitivity: In classification, diagnosis, and selection,the proportion of cases that are assessed as meeting orpredicted to meet the criteria and which, in truth, domeet the criteria.

specificity: In classification, diagnosis, and selection,the proportion of cases that are assessed as not meetingor predicted to not meet the criteria and which, intruth, do not meet the criteria.

speededness: The extent to which test takers’ scoresdepend on the rate at which work is performed as wellas on the correctness of the responses. The term is notused to describe tests of speed.

split-halves reliability coefficient: An internal-consistencycoefficient obtained by using half the items on a test toyield one score and the other half of the items to yielda second, independent score. See internal-consistencycoefficient, coefficient alpha.

stability: The extent to which scores on a test are es-sentially invariant over time, assessed by correlating thetest scores of a group of individuals with scores on thesame test or an equated test taken by the same group ata later time. See test-retest reliability coefficient.

standard error of measurement: The standard deviationof an individual’s observed scores from repeated ad-ministrations of a test (or parallel forms of a test)

223

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 223

Page 234: STANDARDS

under identical conditions. Because such data generallycannot be collected, the standard error of measurementis usually estimated from group data. See error ofmeasurement.

standard setting: The process, often judgment based,of setting cut scores using a structured procedure thatseeks to map test scores into discrete performancelevels that are usually specified by performance-level descriptors.

standardization: 1. In test administration, maintaininga consistent testing environment and conducting testsaccording to detailed rules and specifications, so thattesting conditions are the same for all test takers on thesame and multiple occasions. 2. In test development,establishing a reporting scale using norms based on thetest performance of a representative sample of individualsfrom the population with which the test is intended tobe used.

standards-based assessment: Assessment of an individual’sstanding with respect to systematically described contentand performance standards.

stratified random sample: A set of random samples,each of a specified size, from each of several differentsets, which are viewed as strata of a population. Seerandom sample, sample.

summative assessment: The assessment of a test taker’sknowledge and skills typically carried out at the com-pletion of a program of learning, such as the end of aninstructional unit.

systematic error: An error that consistently increasesor decreases the scores of all test takers or some subsetof test takers, but is not related to the construct thatthe test is intended to measure. See bias.

technical manual: A publication prepared by test de-velopers and/or publishers to provide technical andpsychometric information about a test.

test: An evaluative device or procedure in which a sys-tematic sample of a test taker’s behavior in a specifieddomain is obtained and scored using a standardizedprocess.

test design: The process of developing detailed specifi-cations for what a test is to measure and the content,cognitive level, format, and types of test items to beused.

test developer: The person(s) or organization responsiblefor the design and construction of a test and for the

documentation regarding its technical quality for anintended purpose.

test development: The process through which a test isplanned, constructed, evaluated, and modified, includingconsideration of content, format, administration, scoring,item properties, scaling, and technical quality for thetest’s intended purpose.

test documents: Documents such as test manuals,technical manuals, user’s guides, specimen sets, and di-rections for test administrators and scorers that provideinformation for evaluating the appropriateness and tech-nical adequacy of a test for its intended purpose.

test form: A set of test items or exercises that meet re-quirements of the specifications for a testing program.Many testing programs use alternate test forms, eachbuilt according to the same specifications but withsome or all of the test items unique to each form. Seealternate forms.

test format/mode: The manner in which test contentis presented to the test taker: with paper and pencil, viacomputer terminal or Internet, or orally by an examiner.

test information function: A mathematical functionrelating each level of an ability or latent trait, as definedunder item response theory (IRT), to the reciprocal ofthe corresponding conditional measurement error vari-ance.

test manual: A publication prepared by test developersand/or publishers to provide information on test ad-ministration, scoring, and interpretation and to provideselected technical data on test characteristics. See user’sguide, technical manual.

test modification: Changes made in the content, format,and/or administration procedure of a test to increasethe accessibility of the test for test takers who areunable to take the original test under standard testingconditions. In contrast to test accommodations, testmodifications change the construct being measured bythe test to some extent and hence change score inter-pretations. See adaptation/test adaptation, modification/testmodification. Contrast with accommodations/test accom-modations.

test publisher: An entity, individual, organization, oragency that produces and/or distributes a test.

test-retest reliability coefficient: A reliability coefficientobtained by administering the same test a second time

224

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 224

Page 235: STANDARDS

to the same group after a time interval and correlatingthe two sets of scores; typically used as a measure ofstability of the test scores. See stability.

test security: Protection of the content of a test fromunauthorized release or use, to protect the integrity ofthe test scores so they are valid for their intended use.

test specifications: Documentation of the purpose andintended uses of a test as well as of the test’s content,format, length, psychometric characteristics (of theitems and test overall), delivery mode, administration,scoring, and score reporting.

test-taking strategies: Strategies that test takers mightuse while taking a test to improve their performance,such as time management or the elimination of obviouslyincorrect options on a multiple-choice question beforeresponding to the question.

test user: A person or entity responsible for the choiceand administration of a test, for the interpretation oftest scores produced in a given context, and for any de-cisions or actions that are based, in part, on test scores.

timed test: A test administered to test takers who areallotted a prescribed amount of time to respond to thetest.

top-down selection: Selection of applicants on thebasis of rank-ordered test scores from highest to lowest.

true score: In classical test theory, the average of thescores that would be earned by an individual on an un-limited number of strictly parallel forms of the sametest.

unidimensional test: A test that measures only one di-mension or only one latent variable.

universal design: An approach to assessment developmentthat attempts to maximize the accessibility of a test forall of its intended test takers.

universe score: In generalizability theory, the expectedvalue over all possible replications of a procedure forthe test taker. See generalizability theory.

user norms: Descriptive statistics (including percentileranks) for a group of test takers that does not representa well-defined reference population, for example, allpersons tested during a certain period of time, or a setof self-selected test takers. See local norms, norms.

user’s guide: A publication prepared by test developersand/or publishers to provide information on a test’s

purpose, appropriate uses, proper administration, scoringprocedures, normative data, interpretation of results,and case studies. See test manual.

validation: The process through which the validity of aproposed interpretation of test scores for their intendeduses is investigated.

validity: The degree to which accumulated evidenceand theory support a specific interpretation of testscores for a given use of a test. If multiple interpretationsof a test score for different uses are intended, validityevidence for each interpretation is needed.

validity argument: An explicit justification of the degreeto which accumulated evidence and theory support theproposed interpretation(s) of test scores for their intendeduses.

validity generalization: Application of validity evidenceobtained in one or more situations to other similarsituations on the basis of methods such as meta-analysis.

value-added modeling: Estimating the contribution ofindividual schools or teachers to student performanceby means of complex statistical techniques that usemultiple years of student outcome data, which typicallyare standardized test scores. See growth models.

variance components: Variances accruing from theseparate constituent sources that are assumed to contributeto the overall variance of observed scores. Such variances,estimated by methods of the analysis of variance, oftenreflect situation, location, time, test form, rater, andrelated effects. See generalizability theory.

vertical scaling: In test linking, the process of relatingscores on tests that measure the same construct butdiffer in difficulty. Typically used with achievementand ability tests with content or difficulty that spans avariety of grade or age levels.

vocational assessment: A specialized type of psychologicalassessment designed to generate hypotheses and inferencesabout interests, work needs and values, career develop-ment, vocational maturity, and indecision.

weighted scores/scoring: A method of scoring a test inwhich a different number of points is awarded for acorrect (or diagnostically relevant) response for differentitems. In some cases, the scoring formula awardsdiffering points for each different response to the sameitem.

225

GLOSSARY

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 225

Page 236: STANDARDS

ch14-glossary.qxp_AERA Standards 6/18/14 2:38 PM Page 226

Page 237: STANDARDS

Abbreviated test form, 44–45, 107Accommodations, 45, 59–61

appropriateness, 62, 67–69, 115, 145, 190documenting, 67, 88English language learners (ELL), 191meaning, 58, 190score comparability, 59(see also Modifications)

Accountabilityindex, 206, 209–211measures, reliability/precision, 40opportunity to learn, 57systems, 203

Achievement standards (see Performance standards)Adaptations, 50, 58–59

alternate assessments, 189–190employment testing, 177test-taker responsibilities, 132test-user responsibilities, 144translations, 60(see also Accommodations, Modifications)

Adaptive testingitem selection, 81, 89, 98reliability/precision, 43score comparability, 106specifications, 80–81, 86

Admissions testing, 186–187Aggregate scores, 71, 119–120, 190, 210Alignment, 15, 26, 87–89, 185, 196Alternate assessments, 189–190Anchor test design, 98, 105–106Assessment

formative, 184meaning, 2, 183psychological, 151summative, 184

Assessment literacy, 192Attenuation, 29, 47, 180

Bias, 49, 51–54, 211cultural, 52–53, 55–56, 60, 64predictive, 51–52, 66(see also Differential item functioning, Differentialprediction, Differential test functioning)

Certification testing, 136, 169, 174–175Change scores (see Growth measures)Cheating, 116–117, 132, 136–137

Classical test theory, 33–35, 37, 88Classification, 30, 181

decision consistency, 40–41, 46, 136score labels, 136

Clinical assessment (see Psychological assessment)Coaching (see Practice effects)Cognitive labs, 82Collateral information, 155, 167Composite scores, 27, 43, 93, 182, 193, 210Computer adaptive testing (see Adaptive testing)Computer-administered tests, 83, 112, 116, 145, 153,166, 188, 197

Concordance (see Score linking)Consequential evidence (see Validation evidence, Unintended consequences)

Construct, 11Construct irrelevance, 12, 54–56, 64, 67, 90, 154Construct underrepresentation, 12, 154

accommodations, 60Content standards, 185Content validation evidence, 14–15Context effects, 45Copyright protection, 147–148Credentialing test (see Licensing, Certification testing)Criterion variable, 17, 172, 180Criterion-referenced interpretation, 96Cross-validation, 28, 89Cut scores, 46, 96, 100–101, 107–109, 129, 176

adjusting, 177, 182standard setting, 176

Decision accuracy, 40, 136Decision consistency, 40–41, 44

estimating, 46reporting, 46, 136, 182

Difference scores, 43Differential item functioning (DIF), 16, 51, 82Differential prediction, 18, 30, 51–52, 66Differential test functioning (DTF), 51, 65, 70–71Dimensionality, 16, 27, 43Disattenuated correlations, 29Documentation, 123–126

availability, 129cut scores, 107–109equating procedures, 105forms differences, 86–87norming procedures, 104psychometric item properties, 88–89

227

INDEX

ch15-index-revised.qxp_AERA Standards 6/18/14 2:38 PM Page 227

Page 238: STANDARDS

rater qualifications, 92rater scoring, 92reliability/precision, 126research studies, 126–127score interpretation, 92score linking, 106score scale development, 102scoring procedures, 118, 197test administration, 127–128test development, 126test revision, 129

Educational testingaccountability, 126, 147, 203–207, 209–213admissions, 186–187placement, 187purposes, 184–187, 195

Effect size, 29Employment testing

contextual factors, 170–171job analysis, 173, 175, 182validation, 175–176validation process, 171–174, 178–179, 181

English language proficiency, 191Equating (see Score linking)Errors of measurement, 33–34Expert review, 87–88

Fairnessaccessibility, 49, 52–53, 77educational tests, 186meaning, 49score validity, 53–54, 63universal design, 50, 57–58, 187(see also Bias)

Faking, 154–155Field testing, 83, 88Flagging test scores (see Adaptations)

Gain scores (see Difference scores, Growth measures)Generalizability theory framework, 34Group performance

interpretation, 66, 200, 207, 212norms, 104reliability/precision, 40, 46–47, 119subgroups, 72, 145, 165(see also Aggregate scores)

Growth measures, 185, 198, 209

High-stakes tests, 189, 203

Informed consent, 131, 134–135

Item formataccessibility, 77adaptations, 77performance assessments, 77–78portfolios, 78simulations, 78

Item response theory (IRT), 38information function, 34, 37–38

Item tryout, 82, 88Item weights, 93

Language proficiency, 53, 55, 68–69, 146, 156–157,191 (see also Translated tests)

Licensing, 169, 175Linking tests (see Score linking)Local scoring, 128

Mandated tests, 195, 212–213Matrix sampling, 47, 119–120, 204, 209Meta-analysis, 29–30, 173–174, 209Modifications, 24, 45, 67

appropriateness, 62, 69documenting, 68meaning, 58, 190score interpretations, 68, 191(see also Accommodations)

Multi-stage testing, 81 (see also Adaptive testing)

Norm-referenced interpretation, 96–97, 186Norms, 96–97, 104, 126, 186

local, 196updating, 104–105user, 97, 186

Observed score, 34Opportunity to learn, 56–57, 72, 197

Parallel tests, 35Passing score (see Cut scores)Performance standards, 185Personality measures, 43, 142, 155, 158, 164Personnel selection testing (see Employment testing)Placement tests, 169, 187Policy studies, 203, 204Practice effects, 24–25Practice material, 91, 116, 131Program evaluation, 203–204Psychological assessment

batteries, 155, 165–167collateral information, 155, 167diagnosis, 159–160, 165, 167interpretation, 155

228

INDEX

ch15-index-revised.qxp_AERA Standards 6/18/14 2:38 PM Page 228

Page 239: STANDARDS

interventions, 161meaning, 151personality, 158process, 151–152purposes, 159–163qualifications, 164types of, 155–157vocational, 158–159

Random errors, 36Rater agreement, 25, 39, 44, 118Rater training (see Scorer training)Raw scores, 103Records retention, 120–121, 146Reliability coefficient

interpretation, 44meaning, 33–35

Reliability/precisiondocumentation, 126meaning, 33

Reliability/precision estimatesadjustments with, 29, 47interpretations, 38–39reporting of results, 40–45reporting subscores, 43

Reliability/precision estimation procedures, 36–37alternate forms, 34–35, 37, 95generalizability coefficient, 37–38group means, 40, 46–47internal consistency, 35–37reporting, 47scorer consistency, 37, 44, 92test-retest, 36–38

Replications, 35–37Response bias, 154Restriction of range, 29, 47, 180Retention of records, 120–121, 146Retesting, 114–115, 132, 146–147, 152, 197

Scale drift, 107Scale scores

appropriate use, 102documentation, 102drift, 107interpretation, 102–103

Scale stability, 103Score comparability

adaptive testing, 106evidence, 60, 103, 105, 106interpretations, 61, 71, 95, 111, 116translations, 69

Score interpretation, 23–25absolute, 39automated, 119, 144, 168case studies, 128–129composite scores, 27, 43, 182documentation, 92inappropriate, 23, 27, 124, 143–144, 166meta-analysis, 30, 173–174multiple indicators, 71, 140–141, 145, 154–155,166–167, 179, 198, 213

qualifications, 139–142, 199–200relative, 39reliability/precision, 33–34, 42, 119, 198–199subgroups, 65, 70–72, 211subscores, 27, 176, 201test batteries, 155validation, 23, 27, 85, 199

Score linking, 99–100documentation, 106equating meaning, 97equating methods, 98, 105–106meaning, 95

Score reporting, 135adaptations, 61automated, 119, 144, 168, 194errors, 120, 143flagging, 61, 194release, 135, 146–147, 211–212supporting materials, 119, 144, 166, 194, 200timelines, 136–137, 146transmission, 121, 135

Scorer training, 112, 118Scoring

analytic, 79holistic, 79

Scoring algorithms, 66–67, 91–92, 118documenting, 92

Scoring bias, 66Scoring errors, 143Scoring portfolios, 78, 187Scoring rubrics, 79, 82, 92, 118

bias, 57Security, 117, 120–121, 128, 132, 147–148, 168Selection, 169Sensitivity reviews, 64Short forms of tests (see Abbreviated test form)Standard error of measurement (SEM), 34, 37, 39–40, 45–46conditional, 34, 39, 46, 176, 182

Standard setting (see Cut scores)Standardized test, 111

229

INDEX

ch15-index-revised.qxp_AERA Standards 6/18/14 2:38 PM Page 229

Page 240: STANDARDS

Systematic errors, 36

Technical manuals (see Documentation)Test

classroom, 183meaning, 2, 183

Test administration, 114, 192directions, 83, 90–91, 112documentation, 127–128interpreter use, 69–70qualifications, 127, 139, 142, 153, 164, 199–200security, 128standardized, 65, 115variations, 87, 90, 115

Test bias (see Bias)Test developer, 23

meaning, 3, 76Test development

accessibility, 195–196design, 75documentation, 126meaning, 75(see also Universal design)

Test manuals (see Documentation)Test preparation, 24–25, 134, 165, 197Test publisher, 76Test revisions, 83–84, 93, 107, 176–177

documentation, 129Test security procedures, 83Test selection, 72, 139, 142–143, 204, 212

psychological, 152, 164–165Test specifications, 85–86

adaptive testing, 80–81administration, 80content, 76, 85employment testing, 175item formats, 77–78length, 79meaning, 76portfolios, 78purpose, 76scoring, 79–80

Test standardsapplicability, 2–3, 5–6cautions, 7enforcement, 2legal requirements, 1, 7purposes, 1

Test users, 139–141responsibilities, 142, 153

Testing environment, 116Testing irregularities, 136–137, 146

Test-taker responsibilities, 131–132adaptations, 132

Test-taker rights, 131–133, 162informed consent, 131, 134–135irregularities, 137research instrument, 91test preparation, 133

Time limits, appropriateness, 90Translated tests, 60–61, 68–69, 127True score, 34

Unintended consequences, 12, 19–20, 30–31, 124,189, 196, 207, 212

Universal design, 50, 57–58, 63, 77, 187Universe score, 34

Validationmeaning, 11process, 11–12, 19–21, 23, 85, 171–174, 210samples, 25, 126–127

Validation evidence, 13–19absence of, 143, 164concurrent, 17–18consequential, 19–21, 30–31construct-related, 27–28, 66content-oriented, 14, 26, 54–55, 87–89, 172,175–176, 178, 181–182, 196

convergent, 16–17criterion variable, 28, 172, 180criterion-related, 17–19, 29, 66, 167, 172, 175–176

data collection, 26discriminant, 16–17integration of, 21–22internal structure, 16, 26–27interrelationships, 16, 27–29predictive, 17–18, 28, 129, 167, 172, 179rater variables, 25–26ratings, 25–26relations to other variables, 16–18, 172response processes, 15–16, 26statistical, 26, 28–29, 126subgroups, 64validity generalization, 18, 173, 180

Validityfairness, 49–57meaning, 11, 14process, 13reliability/precision implications, 34–35

Validity generalization, 18, 173, 180Vertical scaling, 95, 99, 185

230

INDEX

ch15-index-revised.qxp_AERA Standards 6/18/14 2:38 PM Page 230