advanced numerical reasoning appraisal ... - pearson clinical1).pdf · pearson’s talent...

50
Copyright © 2006 NCS Pearson, Inc. All rights reserved. Advanced Numerical Reasoning Appraisal TM (ANRA) Manual John Rust 888-298-6227 TalentLens.com

Upload: others

Post on 02-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Copyright © 2006 NCS Pearson, Inc. All rights reserved.

    Advanced Numerical Reasoning Appraisal TM

    (ANRA)

    Manual

    John Rust

    888-298-6227 • TalentLens.com

    888-298-6227 • TalentLens.com

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the copyright owner. The Pearson and TalentLens logos, and Advanced Numerical Reasoning Appraisal are trademarks,in the U.S. and/or other countries, of Pearson Education, Inc. or its affiliate(s). Portions of this work were previously published. Printed in the United States of America.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved.

    Table of Contents

    Acknowledgements Chapter 1 Introduction............................................................................................. 1

    Numerical Reasoning and Critical Thinking ............................................................ 2

    Chapter 2 History and Development of ANRA....................................................... 4 Description of the Test ............................................................................................ 4

    Adapting RANRA .................................................................................................... 4

    Development of RANRA ......................................................................................... 5

    Chapter 3 Directions for Administration ................................................................ 6 General Information ................................................................................................ 6

    Preparing for Administration.................................................................................... 6

    Testing Conditions .................................................................................................. 7

    Answering Questions .............................................................................................. 7

    Administering the Test ............................................................................................ 7

    Scoring and Reporting ............................................................................................ 8

    Test Security ........................................................................................................... 8

    Concluding Test Administration .............................................................................. 8

    Administering ANRA and Watson-Glaser Critical Thinking Appraisal® in a Single Testing Session..................................................................................... 9

    Accommodating Examinees with Disabilities .......................................................... 9

    Chapter 4 ANRA Norms Development.................................................................... 10 Using ANRA as a Norm- or Criterion-Referenced Test........................................... 10

    Using Norms to Interpret Scores............................................................................. 11

    Converting Raw Scores to Percentile Ranks .......................................................... 12

    Using Standard Scores to Interpret Performance ................................................... 12

    Converting z Scores to T Scores....................................................................... 13

    Using ANRA and Watson-Glaser Critical Thinking Appraisal Together .................. 14

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved.

    Chapter 5 Evidence of Reliability............................................................................ 15 Reliability Coefficients and Standard Error of Measurement................................... 15

    RANRA Reliability Studies ..................................................................................... 17

    ANRA Reliability Studies......................................................................................... 17

    Evidence of Internal Consistency ...................................................................... 18

    Evidence of Test-Retest Stability ...................................................................... 20

    Chapter 6 Evidence of Validity ................................................................................ 20 Face Validity............................................................................................................ 20

    Evidence Based on Test Content............................................................................ 21

    Evidence Based on Test-Criterion Relationships.................................................... 22

    Correlations Between ANRA Test1 and Test 2....................................................... 25

    Evidence of Convergent and Discriminant Validity ................................................. 25

    Correlations Between ANRA and Watson-Glaser

    Critical Thinking Appraisal—Short Form ........................................................... 25

    Correlations Between ANRA and Other Tests .................................................. 26

    Chapter 7 Using ANRA as an Employment Selection Tool .................................. 27 Employment Selection ............................................................................................ 27

    Using ANRA in Making a Hiring Decision ............................................................... 27

    Differences in Reading Ability, Including the Use of English

    as a Second Language .......................................................................................... 29

    Using ANRA as a Guide for Training, Learning, and Education.............................. 29

    Fairness in Selection Testing .................................................................................. 30

    Legal Considerations......................................................................................... 30

    Group Differences and Adverse Impact ............................................................ 30

    Monitoring the Selection System....................................................................... 31

    References................................................................................................................... 32

    Appendices Appendix A Description of the Normative Sample .................................................... 35

    Appendix B ANRA Total Raw Scores, Mid-Point Percentile Ranks, and T Scores by Norm Group....................................................................... 37

    Appendix C Combined Waston-Glaser and ANRA T Scores and Percentile Ranks by Norm Group.......................................................... 39

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved.

    Tables Table 5.1 Coefficient Alpha, Odd-Even Split-Half Reliability, and

    Standard Error of Measurement (SEM) for RANRA (from Rust, 2002, p.85).......................................................................... 17

    Table 5.2 ANRA Means, Standard Deviations (SD), Standard Errors of Measurement (SEM), and Internal Consistency Reliability Coefficients (Alpha) ............................................................................... 18

    Table 5.3 ANRA Test-Retest Stability (N = 73)...................................................... 19

    Table 6.1 Evidence of ANRA Criterion-Related Validity (Total Raw Score) of Job Incumbents in Various Finance-Related Occupations and Position Levels ............................................................................... 24

    Table 6.2 Correlations Between Watson-Glaser Critical Thinking Appraisal—Short Form and ANRA (N = 452) ........................................ 25

    Table 6.3 Correlations Between ANRA, the Miller Analogies Test for Professional Selection (MAT for PS), and the Differential Aptitude Tests for Personnel and Career Assessment—Numerical Ability (DAT for PCA—NA) .................................................................... 26

    Figure Figure 4.1 The Relationship of Percentiles to T Scores ......................................... 14

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved.

    Acknowledgements

    Pearson’s Talent Assessment group would like to recognize and thank Professor John Rust, Director of the Psychometrics Center at the University of Cambridge, United Kingdom, for his seminal efforts that led to his development of the Rust Advanced Numerical Reasoning Appraisal (RANRA). This manual details our adaptation of RANRA for use in the United States—the Advanced Numerical Reasoning Appraisal (ANRA).

    We are indebted to numerous professionals and organizations for their assistance during several phases of our work—project design, data collection, statistical data analyses, editing, and publication.

    We acknowledge the efforts of Julia Kearney, Sampling Projects Coordinator; Jane McDonald, Sampling Recruiter; Terri Garrard, Study Manager; David Quintero, Clinical Handscoring Supervisor; Hector Solis, Sampling Manager, and Victoria Locke, Director, Field Research, in driving the data collection activities. Nishidha Goel helped to collate and prepare the data.

    We thank Zhiming Yang, PhD, Psychometrician, and JJ Zhu, PhD, Director of Psychometrics, Clinical Products. Dr. Yang’s technical expertise in analyzing the data and Dr. Zhu's psychometric leadership ensured the high level of psychometric integrity of the results.

    Our thanks also go to Toby Mahan and Troy Beehler, Project Managers, for diligently managing the logistics of this project. Toby and Troy worked with several team members from the Technology Products Group, Pearson to ensure the high quality and accuracy of the computer interface. These dedicated individuals included Paula Oles, Manager, Software Quality Assurance; Christina McCumber, Software Quality Assurance Analyst; Matt Morris, Manager, System Development; Maurya Buchanan, Technical Writer; and Alan Anderson, Director, Technology Products Group. Dawn Dunleavy, Senior Managing Editor, Konstantin Tikhonov, Project Editor; and Marion Jones, Director, Mathematics, provided editorial guidance. Mark Cooley assisted with the design of the cover.

    Finally, we wish to acknowledge the leadership, guidance, support, and commitment of the following people through all the phases of this project: Jenifer Kihm, PhD, Senior Product Line Manager, Talent Assessment; John Toomey, Director, Talent Assessment; Paul McKeown, International Product Development Director; Judy Chartrand, PhD, Director, Test Development; Gene Bowles, Vice President, Publishing and Technology; Larry Weiss, PhD, Vice President, Psychological Assessment Products Group; and Aurelio Prifitera, PhD, Group President and CEO of Clinical Assessment/Worldwide.

    Kingsley C. Ejiogu, PhD, Research Director

    John Trent, M.S., Research Director

    Mark Rose, PhD, Research Director

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 1

    Chapter 1

    Introduction

    The Advanced Numerical Reasoning Appraisal (ANRA) measures the ability to recognize,

    understand, and apply mathematical and statistical reasoning. Specifically, ANRA measures

    numerical reasoning abilities that involve deduction, interpretation, and evaluation. Numerical

    reasoning, as measured by ANRA, is operationally defined as the ability to correctly perform the

    domain of tasks represented by two sets of items—Comparison of Quantities and Sufficiency of

    Information. Both require the use of analytical skills rather than straightforward computational

    skills. The key attribute ANRA measures is an individual’s ability to apply numerical reasoning

    to everyday problem solving in professional and business settings.

    Starkey (1992) describes numerical reasoning as comprising “a set of abilities that are used to

    operate upon or mentally manipulate representations of numerosity” (p. 94). Research suggests

    that numerical reasoning abilities exist even in infancy, before children begin to receive explicit

    instruction in mathematics in school (Brannon, 2002; Feigenson, Dehaene, & Spelke, 2004;

    Spelke, 2005; Starkey, 1992; Wynn, Bloom, & Chiang, 2002). As Spelke (2005) observed,

    children harness these core abilities when they learn mathematics, and adults use the core abilities

    to engage in mathematical and scientific thinking.

    The numerical reasoning skill is the foundation of all other numerical ability (Rust, 2002). This

    skill enables individuals to learn how to evaluate situations, how to select and apply strategies for

    problem-solving, how to draw logical conclusions using numerical data, how to describe and

    develop solutions, and to recognize when and how to apply the solutions. Eventually, one is able

    to reflect on solutions to problems and determine whether the solutions make sense.

    The nature of work is changing significantly and there is an increased demand for a new kind of

    worker—the knowledge worker (Hunt, 1995). As Facione (2006) observed, though the ability to

    think critically and make sound decisions does not absolutely guarantee a life of happiness and

    economic success, having this ability equips an individual to improve his or her future and

    contribute to society. As the Internet has transformed home life and leisure time, people have

    been deluged with data of ever-increasing complexity. They must select, interpret, digest,

    evaluate, learn, and apply information.

    Employers are typically interested in tests that measure candidates' ability to apply constructively

    and critically, rather than by rote, what they have learned. A person can be trained or educated to

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 2

    engage in numerical reasoning; as a result, tests that measure the ability to use mathematical

    reasoning within the context of work have an important function in career development. Such

    tests enable an organization to identify candidates who may need to improve their skills to

    enhance their work effectiveness and career success.

    Numerical Reasoning and Critical Thinking In a skills search of the O*Net OnLine database for “Mathematics” (defined by O*Net OnLine as

    “using mathematics to solve problems”) and “Critical Thinking” (defined by O*Net OnLine as

    “using logic and reasoning to identify the strengths and weaknesses of alternative solutions,

    conclusions, or approaches to problems”), both of these skills were rated as “Very Important” for

    as many as 99 occupations (accountant, actuary, auditor, financial analyst, government service

    executive, management analyst, occupational health and safety specialist, etc.). Numerical

    reasoning and critical thinking are essential parts of the cognitive complexity that is a basic factor

    for understanding group differences in work performance (Nijenhuis & Flier, 2005).

    Both numerical reasoning and critical thinking are higher-order thinking skills—“fundamental

    skills that are essential to being a responsible, decision-making member of the work-place”

    (Paul & Nosich, 2004, p. 5). Paul and Nosich contrasted the higher-order thinking skills with such

    lower-order thinking skills as rote memorization and recall, and they noted that critical thinking

    could be applied to any subject matter and any situation where reasoning is relevant. Such a

    subject matter or situation could range from accounting (Kealy, Holland, & Watson, 2005;

    American Institute of Certified Public Accountants, 1999), through medicine (Vandenbroucke,

    1998), to truck driving (Nijenhuis & Flier, 2005). As Paul and Nosich (2004) stated, in any

    context where we are thinking well, we are thinking critically.

    The enhancement of critical thinking in U.S. college students is a national priority (National

    Educational Goals Panel, 1991). In a paper commissioned by the United States Department of

    Education, Paul and Nosich (2004) highlighted what the National Council for Excellence in

    Critical Thinking Instruction regarded as a basic principle of critical thinking instruction as

    applied to subject-matter teaching: “to achieve knowledge in any domain, it is essential to think

    critically” (Paul & Nosich, p. 33). Critical thinking is the skill that is required to increase the

    probability of desirable outcomes in our lives, such as making the right career choice, using

    money wisely, or planning our future. Such critical thinking is reasoned, purposeful, and goal

    directed. At the cognitive level, such critical thinking involves solving problems, formulating

    inferences, calculating likely outcomes and decision-making. Once people have developed this

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 3

    critical thinking skill, they are able to apply it in a wide variety of circumstances. Critical

    thinking can involve proper language use, applied logic, and practical mathematics.

    Because ANRA items require higher-order numerical reasoning skills, rather than rote calculation

    to solve, using the Watson-Glaser Critical Thinking Appraisal® (a reliable and valid test of verbal

    critical thinking) in conjunction with ANRA provides a demanding, high-level measurement of

    numerical reasoning and verbal critical thinking skills, respectively. These two skills are

    important when recruiting in the competitive talent assessment market.

    In response to requests from Watson-Glaser Critical Thinking Appraisal customers in the United

    Kingdom, The Psychological Corporation (now Pearson) in the UK developed the Rust Advanced

    Numerical Reasoning Appraisal (RANRA) in 2000 as a companion numerical reasoning test for

    the Watson-Glaser Critical Thinking Appraisal. In 2006, Pearson adapted RANRA to enhance

    the suitability and applicability of the test in the United States. This manual contains detailed

    information on the U.S. adaptation—ANRA.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 4

    Chapter 2

    History and Development of ANRA

    Description of the Test

    ANRA consists of a set of two tests: Test 1—Comparison of Quantities and Test 2—Sufficiency

    of Information. The candidate must apply his or her numerical reasoning skills to decisions that

    reflect the wide variety of numerical estimation and analytic tasks frequently encountered in

    many everyday situations at work or in a learning environment.

    The two ANRA tests are designed to measure different, but interdependent, aspects of numerical

    reasoning. The tests require the candidate to consider alternatives (either by comparing quantities

    or judging information to be sufficient) in relation to given problems. The examinee's task is to

    study each problem and to evaluate the appropriateness or validity of the alternatives. The ANRA

    maximum total raw score is 32.

    Because ANRA is intended as a test of numerical reasoning power rather than speed, there is no

    rigid time limit for taking the test. Candidates should be given as much time as they reasonably

    need to finish the test. An individual typically completes the test in about 45 minutes. About 90%

    of the 452 individuals in the normative group who were employed in professional, management,

    and higher-level positions completed the test within 75 minutes.

    Adapting RANRA The Rust Advanced Numerical Reasoning Appraisal (RANRA) was adapted to reflect U.S.

    English and U.S. measurement units. Because RANRA measures reasoning more than

    computation, only the measurement units were changed and the original numbers were kept,

    except in cases where it affected the realism of the situation. For example, “82 kilograms” was

    changed to “82 pounds,” though 82 kg = 180.4 lbs. Similarly, “5,000 British pounds sterling” was

    changed to “5,000 U.S. dollars,” though 5,000 British pounds sterling ≠ 5,000 U.S. dollars.

    ANRA contains the original 32 RANRA items plus additional items for continuous test

    improvement purposes. All the items were reviewed by a group comprising 16 individuals—

    researchers in test development, financial analysts, business development professionals,

    industrial/organizational psychologists, and editors in test publishing. Item sentence construction

    was modified in some items, based on input from the American reviewers.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 5

    Development of RANRA In developing RANRA (2002), Rust first did a conceptual analysis of the role of critical thinking

    in the use of mathematics. Through this conceptual analysis, he identified the two subdomains of

    comparison of quantities and sufficiency of information as the key concepts in developing an

    assessment of mathematical reasoning. Rust then constructed 80 items and had a panel of

    educators and psychologists evaluate and modify them, and then generated the pilot version of

    RANRA. This pilot version of RANRA was administered to 76 students and staff from diverse

    subject backgrounds within the University of London. The data were subjected to detailed

    analysis at the item level. Distractor analysis led to the modification of some items. Item-

    difficulty values were calculated for each item, based on the proportion of examinees passing

    each item. The discrimination index was also calculated, and those items that showed they were

    measuring a common quality in numerical reasoning were identified and retained. This approach

    led to the development of the 32-item RANRA.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 6

    Chapter 3

    Directions for Administration

    General Information

    ANRA is administered through the online testing platform at TalentLens.com, an Internet-based

    testing system designed by Pearson, for the administration, scoring, and reporting of professional

    assessments. Instructions for administrators on how to order and access the test online are

    provided at TalentLens.com. Instructions for accessing ANRA interpretive reports are provided

    on the website. After a candidate has taken ANRA online, the test administrator can use the link

    Pearson provides to review the candidate’s results in an interpretive report.

    Preparing for Administration Being thoroughly prepared before administering the test results in a more efficient administration

    session. Test administrators should take ANRA prior to administering the test and comply with

    the directions. Candidates are not allowed to use calculators or similar calculation devices while

    completing the test. Test administrators should provide candidates with pencils, an eraser, and a

    sheet of paper to write their calculations if needed.

    Test administration must comply with the code of practice of the testing organization, applicable

    government regulations, and the recommendations of the test publisher. Candidates should be

    informed before the testing session about the nature of the assessment, why the test is being used,

    the conditions under which they will be tested, and the nature of any feedback they will receive.

    Test administrators need to assure candidates that their test results will remain confidential.

    The test administrator must obtain informed consent from the candidate before testing. The

    informed consent is a written statement that explains the type of test to be administered, the

    purpose of the test, as well as who will have access to the test data, signed by the candidate. It is

    the responsibility of the test user to ensure that candidates understand the testing procedure. The

    test administrator should also ensure that all relevant background information from the candidate

    is collected and verified (e.g., name, gender, educational level, current employment, occupational

    history, and so on).

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 7

    Testing Conditions The test administrator has a significant responsibility to ensure that the conditions under which

    the test is taken do not contain undesirable influences on the test performance of candidates. Such

    undesirable influences can either inflate or reduce the test scores of candidates. Poor

    administration of a test undermines the value of test scores and makes an accurate interpretation

    of results very difficult, if not impossible.

    It is important to ensure that the test is administered in a quiet, well-lit room. The following

    conditions are necessary for accurate scores and for maintaining the cooperation of the examinee:

    good lighting, comfortable seating, adequate desk or table space, comfortable positioning of the

    computer screen, keyboard and mouse, and freedom from noise and other distractions.

    Interruptions and distractions from outside should be kept to a minimum, if not eliminated.

    Answering Questions The test administrator may answer examinees' questions about the test before giving the signal to

    begin. To maintain standard testing conditions, answer such questions by re-reading the

    appropriate section of these directions. Do not volunteer new explanations or examples. The test

    administrator is responsible for ensuring that examinees understand the correct way to indicate

    their answers and what is required of the examinees. The question period should never be rushed

    or omitted.

    If any examinees have routine questions after the testing has started, try to answer them without

    disturbing the other examinees. However, questions about the test items should be handled by

    telling the examinee to do his or her best.

    Administering the Test After the examinee is seated at the computer and the initial instruction screen for ANRA appears,

    say,

    The on-screen directions will take you through the entire process that begins with some demographic questions. After you have completed these questions, the test will begin. You will have as much time as you reasonably need to complete the test items. The test ends with a few additional demographic questions. Do you have any questions before starting the test?

    Answer any questions and say, Please begin the test.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 8

    Once the examinee clicks the “Start Your Test” button, administration begins with the first page

    of questions. The examinee may review test items at the end of the test. Allow examinees as

    much time as they reasonably need to complete the test. Average completion time is about 45

    minutes. About 90% of candidates are finished with the test within 75 minutes.

    If an examinee’s computer develops technical problems during testing, the test administrator

    should move the examinee to another suitable computer location. If the technical problems cannot

    be solved by moving to another computer location, the administrator should contact Pearson’s

    Technical Support at 1-888-298-6227 for assistance.

    Scoring and Reporting Scoring is automatic, and the report is typically available within a minute after the test is

    completed. A link to the report will be available on the online testing platform at TalentLens.com.

    Adobe® Acrobat Reader® is required to open the report. The test administrator may view, print, or

    save the candidate’s report.

    Test Security ANRA scores are confidential and should be stored in a secure location accessible only to

    authorized individuals. It is unethical and poor test practice to allow test-score access to

    individuals who do not have a legitimate need for the information. Storing test scores in a locked

    cabinet or password-protected file that can only be accessed by designated test administrators will

    help ensure the security of the test scores. The security of testing materials (e.g., access to online

    tests) and protection of copyright must also be maintained by authorized individuals. Avoid

    disclosure of test access information such as usernames or passwords, and only administer ANRA

    in proctored environments. All the computer stations used in administering ANRA must be in

    locations that can be easily supervised and with adequate level of security.

    Concluding Test Administration At the end of the testing session, thank the candidate(s) for his or her participation and check the

    computer station(s) to ensure that the test is closed.

    ANRA can be a demanding test for some candidates. It may be constructive to clarify what part

    the test plays within the context of the selection or assessment procedures. It is also constructive

    to reassure candidates about the confidentiality of their test scores.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 9

    Administering ANRA and Watson-Glaser Critical Thinking Appraisal in a Single Testing Session When administering the ANRA and the Watson-Glaser in a single testing session, administer the

    Watson-Glaser first. Just as ANRA is intended as a test of numerical reasoning power rather than

    speed, the Watson-Glaser is intended as a test of critical thinking power rather than speed. Both

    tests are untimed; administration of ANRA and the Watson-Glaser Short Form in one session

    should take about 1 hour and 45 minutes.

    Accommodating Examinees With Disabilities The Americans with Disabilities Act (ADA) of 1990 requires an employer to reasonably

    accommodate the known disability of a qualified applicant, provided such accommodation would

    not cause an “undue hardship” to the operation of the employer’s business.

    The test administrator should provide reasonable accommodations to enable candidates with

    special needs to comfortably take the test. Reasonable accommodations may include, but are not

    limited to, modifications to the test environment (e.g., high desks) and medium (e.g., having a

    reader read questions to the examinee, or increasing the font size of questions) (Society for

    Industrial and Organizational Psychology, 2003). In situations where an examinee’s disability is

    not likely to impair his or her job performance, but may hinder the examinee’s performance on

    ANRA, the organization may want to consider waiving the test or de-emphasizing the score in

    lieu of other application criteria. Interpretive data as to whether scores on ANRA are comparable

    for examinees who are provided reasonable accommodations are not available at this time due to

    the small number of examinees who have requested such accommodations.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 10

    Chapter 4

    ANRA Norms Development

    Norms provide a basis for evaluating an individual's score relative to the scores of other

    individuals who took the same test. Norms allow for the conversion of raw scores to more useful

    comparative scores, such as percentile ranks. Typically, norms are constructed from the scores of

    a large sample of other individuals who took the test under similar conditions. This group of

    individuals is called the norm group.

    The characteristics of the sample used for preparing norms are critical in determining the

    usefulness of those norms. For such purposes as selecting from among applicants to fill a

    particular job, normative information derived from a specific, relevant, well-defined group might

    be most useful. However, the composition of the sample of job applicants is influenced by a

    variety of situational factors, including the job demands and local labor market conditions.

    Because such factors can vary across jobs, locations, and over time, the limitations on the

    usefulness of any set of published norms should be recognized.

    When a test is used to make employment decisions, the most appropriate norm group is one that

    is representative of those who will be taking the test in the local situation. It is best, whenever

    possible, to prepare local norms by accumulating the test scores of applicants, trainees, or

    employees. One of the factors that must be considered in establishing norms is sample size. Data

    from small samples tend to be unstable and the presentation of percentile ranks for all possible

    scores is imprecise. As a result, the use of in-house norms is only recommended when the sample

    is sufficiently large (about 100 or more people). Until a sufficient and representative number of

    cases has been collected, the test user should consider norms based on other similar groups rather

    than from local data with a small sample size. In the absence of adequate local norms, the norms

    provided in Appendixes A and B should be used to guide the interpretation of scores.

    Using ANRA as a Norm- or Criterion-Referenced Test

    ANRA may be used as a norm-referenced or as a criterion-referenced instrument. A norm-

    referenced test enables a human resource professional to interpret an individual's test performance

    in comparison to a particular normative group. An individual's performance on a criterion-

    referenced instrument can only indicate whether or not that individual meets certain, predefined

    criteria. It is appropriate to use ANRA as a norm-referenced instrument in the process of

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 11

    employment selections. For optimal results in such decisions, the overall total score, rather than

    the subtest scores should be used. Subtest scores represent fewer items and, therefore, are less

    stable than the total score. However, as a criterion-referenced measure, it is feasible to use subtest

    scores to analyze the numerical reasoning abilities of a class or larger group and to determine the

    types of numerical reasoning or critical thinking training that may be most appropriate.

    In norm-referenced situations, raw scores need to be converted before they can be compared.

    Though raw scores may be used to rank candidates in order of performance, little can be inferred

    from raw scores alone. There are two main reasons for this. First, raw scores cannot be treated as

    having equal intervals. For example, it would be incorrect to assume that the difference between

    raw scores of, say, 20 and 21 is of the same significance as the difference between raw scores of

    30 and 31. Second, ANRA raw scores may not be normally distributed. Hence, they are not

    subject to the psychometric principles of parametric statistics required for the proper evaluation

    of validity.

    Using Norms to Interpret Scores The ANRA norms presented in Appendix B and Appendix C were derived from data collected

    February 2006 through June 2006, from 452 adults in a variety of employment settings. The

    tables in Appendix B (Tables B.1 and B.2) show the ANRA total raw scores with corresponding

    percentile ranks and T scores for the identified norm groups.

    When using the norms tables in Appendix B, look for a group that is similar to the individual or

    group tested. For example, you would compare the test score of a person who applied for a

    Manager position with norms derived from the scores of other managers. When using the norms

    in Appendix B to interpret candidates’ scores, keep in mind that norms are affected by the

    composition of the groups that participated in the normative study. Therefore, it is important to

    examine specific position level and occupational characteristics of a norm group.

    By comparing an individual’s raw score to the data in a norms table, it is possible to determine

    the percentile rank corresponding to that score. The percentile rank indicates an individuals’

    relative position in the norm group. Percentiles should not be confused with percentage scores

    that represent the percentage of correct items. Percentiles are derived scores that are expressed in

    terms of the percent of people in the norm group scoring equal to or below a given raw score.

    Percentiles have the advantage of being readily understood and universally applicable. However,

    although percentiles are useful for expressing an examinee’s performance relative to other

    candidates, percentiles have limitations. For example, percentile ranks do not have equal

    intervals. While percentiles indicate the relative position of each candidate in relation to the

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 12

    normative sample, they do not show the amount of difference between scores. In a normal

    distribution of scores, percentile ranks tend to cluster around the 50th percentile. This clustering

    affects scores in the average range the most because a difference of one or two raw score points

    may change the percentile rank. Extreme scores are affected less; a change in one or two raw

    score points at the extremes typically does not produce a large change in percentile ranks. These

    factors should be considered when interpreting percentile ranks.

    Converting Raw Scores to Percentile Ranks To find the percentile rank of a candidate’s raw score, locate the ANRA total raw score in Table

    B.1 or B.2. The corresponding percentile rank is read from the selected norm group column. For

    example, if a person applying for a job as a Director had a score of 25 on ANRA, it is appropriate

    to use the Executives/Directors norms in Table B.1 for comparison. In this case, the percentile

    rank corresponding to a raw score of 25 is 67. This percentile rank indicates that about 67% of the

    people in the norm group scored lower than or equal to a score of 25 on ANRA, and about 33%

    scored higher than a score of 25 on ANRA. The lowest raw score will lie at the 1st percentile; the

    median raw score will fall at the 50th percentile, and the highest raw score will lie at the 99th

    percentile.

    Each group’s size (N), raw score mean, and raw score standard deviation (SD) are shown at the

    bottom of the norms tables. The group raw score mean or average is calculated by summing the

    raw scores and dividing the sum by the total number of examinees. The standard deviation

    indicates the amount of variation in a group of scores. In a normal distribution, approximately

    two-thirds (68.26%) of the scores are within the range of 1 SD below the mean to 1 SD above the

    mean. These statistics are often used in describing a sample and setting cut scores. For example, a

    cut score may be set as one SD below the mean. In compliance with the Civil Rights Act of 1991,

    Section 5 (a) (1), as amended, the norms provided in Appendix B and Appendix C combine data

    for males and females, and for white and minority candidates.

    Using Standard Scores to Interpret Performance

    Test results can be reported in many different formats. Examples of these formats include raw

    scores, percentiles, and various forms of standard scores. Standard scores express the score of

    each individual in terms of its distance from the mean. Examples of standard scores are z scores

    and T scores. Standard scores do not suffer from the drawbacks associated with percentiles. The

    advantage of percentiles is that they are readily understood and, therefore, immediately

    meaningful. As indicated above, however, there is a risk of percentiles being confused with

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 13

    percentage scores, or of percentiles being interpreted as an interval scale. Standard scores avoid

    the unequal clustering of scores by adopting a scale based on standard deviation units.

    The basic type of standard score is the z score, which is a raw score converted to a standard

    deviation unit. Thus a raw score that is 0.53 standard deviations below the mean score for the

    group receives a z score of –0.53. z scores are generally in the –3.00 to + 3.00 range. However,

    there are certain disadvantages in saying that a person has a score of –0.53 on a test. From the

    point of view of presentation, the use of decimal points and the negative symbol is unappealing.

    Hence, certain transformation algorithms have become available that enable a more user-friendly

    image for standard scores.

    Converting z Scores to T Scores To convert a z score to a T score, multiply the z score by 10 and add 50. Thus, a z score of –0.53

    becomes a T score of 44.7, which is then rounded, as a matter of convention, to the nearest whole

    number, that is, 45. A set of T scores has a mean of 50 and at each standard deviation point there

    is a score difference of 10. Thus, a T score of 30 is at two standard deviations below the mean,

    while a T score of 60 is one standard deviation above the mean. The T score transformation

    results in a scale that runs from 10 to 90, with each 10th interval coinciding with a standard

    deviation point. Appendix B shows ANRA T scores. Appendix C shows the sum of Watson-

    Glaser and ANRA T scores and their corresponding percentiles. Because the Watson-Glaser and

    ANRA do not measure identical constructs, their combined T scores must be derived by first

    transforming separate Watson-Glaser and ANRA raw score pairs to their respective T scores, and

    then summing the T scores. Figure 4.1 illustrates the relationship between percentiles and

    T scores.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 14

    Figure 4.1 The Relationship of Percentiles to T Scores

    Using ANRA and Watson-Glaser Critical Thinking Appraisal Together The ANRA and Watson-Glaser combined score provides a broader range of critical reasoning

    skills than would be obtained by the use of each test alone. Scores from ANRA and the Watson-

    Glaser can be combined by first converting each total raw score to a T score and then adding the

    two T scores together. The sum of the T scores can also be converted to percentile ranks.

    Appendix C (Tables C.1 and C.2) shows the percentile ranks of the sum of ANRA and

    Watson-Glaser Short Form T scores.

    Another potential benefit from using ANRA and the Watson-Glaser together is in the expected

    difference between scores on the two tests. This expected difference depends on the type of norm

    group to which the candidate belongs. Generally speaking, candidates in financial or scientific

    occupations are expected to score higher on ANRA than on the Watson-Glaser. On the other

    hand, managers, particularly in fields where critical thinking using language is a key skill, and

    employees in occupations that do not require a great deal of numeracy, will be expected to

    perform better on the Watson-Glaser than on ANRA. By examining the difference between a

    candidate’s Watson-Glaser and ANRA scores, the user can make appropriate development

    suggestions to the candidate.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 15

    Chapter 5

    Evidence of Reliability

    The reliability of a measurement instrument refers to the accuracy, consistency, and precision of

    test scores across situations (Anastasi & Urbina, 1997). Test theory posits that a test score is an

    estimate of an individual’s hypothetical true score, or the score an individual would receive if the

    test were perfectly reliable. In actual practice, however, some measurement error is to be

    expected. A reliable test has relatively small measurement error.

    The methods most commonly used to estimate test reliability are test–retest (the stability of test

    scores over time), alternate forms (the consistency of scores across alternate forms of a test), and

    internal consistency of the test items (e.g., Cronbach’s alpha coefficient, Cronbach 1970).

    Decisions about the form of reliability to be used in comparing tests depend on a consideration of

    the nature of the error that is involved in each form. Different types of error can be operating at

    the same time, so it is to be expected that reliability coefficients will differ in different situations

    and on different groupings and samplings of respondents. An appropriate estimate of reliability

    can be obtained from a large representative sample of the respondents to whom the test is

    generally administered.

    Reliability Coefficients and Standard Error of Measurement The reliability of a test is expressed as a correlation coefficient, which represents the consistency

    of scores that would be obtained if a test could be given an infinite number of times. Reliability

    coefficients are a type of estimate of the amount of error associated with test scores and can range

    from .00 to 1.00. The closer the reliability coefficient is to 1.00, the more reliable the test. A

    perfectly reliable test would have a reliability coefficient of 1.00 and no measurement error. A

    completely unreliable test would have a reliability coefficient of .00. The U.S. Department of

    Labor (1999) provides the following general guidelines for interpreting a reliability coefficient:

    above .89 is considered “excellent,” .80–.89 is “good,” .70–.79 is considered “adequate,” and

    below .70 “may have limited applicability.”

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 16

    Repeated testing leads to some variation. Consequently, no single test event effectively measures

    an examinee’s actual ability with complete accuracy. Therefore, an estimate of the possible

    amount of error present in a test score, or the amount that scores would probably vary if an

    examinee were tested repeatedly with the same test is necessary. This estimate of error is known

    as the standard error of measurement (SEM). The SEM decreases as the reliability of a test

    increases. A large SEM denotes less reliable measurement and less reliable scores. The standard

    error of measurement is calculated with the formula:

    xxrSDSEM −= 1

    In this formula, SEM represents the standard error of measurement, SD represents the standard

    deviation of the distribution of obtained scores, and rxx represents the reliability coefficient of the

    test (Cascio, 1991, formula 7-11).

    The SEM is a quantity that is added to and subtracted from an examinee’s standard test score to

    create a confidence interval or band of scores around the obtained standard score. The confidence

    interval is a score range that, in all likelihood, includes the examinee’s hypothetical “true” score

    that represents the examinee’s actual ability. A true score is a theoretical score entirely free of

    error. Since the true score is a hypothetical value that can never be obtained because testing

    always involves some measurement error, the score obtained by an examinee on any test will vary

    somewhat from administration to administration. As a result, any obtained score is considered

    only an estimate of the examinee’s “true” score. Approximately 68% of the time, the observed

    standard score will lie within +1.0 and –1.0 SEM of the true score; 95% of the time, the observed

    standard score will lie within +1.96 and –1.96 SEM of the true score; and 99% of the time, the

    observed standard score will lie within +2.58 and –2.58 SEM of the true score.

    Using the SEM means that standard scores are interpreted as bands or ranges of scores, rather

    than as precise points (Nunnally, 1978). To illustrate the use of SEM with an example, assume a

    director candidate obtained a total raw score of 25 on ANRA, with SEM = 2.32. From the

    information in Table B.1, the standard score (T score) for this candidate is 57. We can, therefore,

    infer that if this candidate were administered a large number of alternative forms of ANRA, 95%

    of this candidate’s T scores would lie within the range between 57 –1.96 x 2.32�52 T score

    points and 57 + 1.96 x 2.32�62 T score points. We can further infer that the expected average of

    this person’s T scores from a large number of alternate forms of ANRA would be 57.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 17

    Thinking in terms of score ranges serves as a check against overemphasizing small differences

    between scores. The SEM may be used to determine if an individual’s score is significantly

    different from a cut score, or if the scores of two individuals differ significantly. An example of

    one general rule of thumb is that the difference between two scores on the same test should not be

    interpreted as significant unless the difference is equal to at least twice the standard error of the

    difference (SED), where SED = SEM 2 (Gulliksen, as cited in Cascio, 1991, p.143).

    RANRA Reliability Studies Because ANRA is a U.S. adaptation of RANRA, the information on previous studies refers to

    RANRA. For the sample used in the initial development of RANRA in the United Kingdom

    (N = 1546), Cronbach’s alpha coefficient and split-half reliability were .78 for the overall

    RANRA score (Rust, 2002). The reliability coefficients of RANRA for both Test 1 and Test 2

    and for the overall RANRA score are shown in Table 5.1.

    Table 5.1 Coefficient Alpha, Odd-Even Split-Half Reliability, and Standard Error of Measurement (SEM) for RANRA (from Rust, 2002, p. 8.5)

    Alpha Split-Half SEM

    Test 1: Comparison of Quantities .63 .60 6.32 Test 2: Sufficiency of Information .70 .71 5.39 RANRA Score .78 .78 4.69

    The RANRA score reported in Table 5.1 is a T score transformed from the total raw score, while

    the standard error of measurement reported in the table was based on the split-half reliability

    (Rust, 2002).

    ANRA Reliability Studies Evidence of Internal Consistency Cronbach’s alpha and the standard error of measurement (SEM) were calculated for the sample

    used for the ANRA norm groups reported in this manual. The internal consistency reliability

    estimates for ANRA total raw score and ANRA subtests are shown in Table 5.2.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 18

    Table 5.2 ANRA Means, Standard Deviations (SD), Standard Errors of Measurement (SEM), and Internal Consistency Reliability Coefficients (Alpha)

    ANRA Total Raw Score Norm Group N Mean SD SEM Alpha Executives/Directors 91 21.3 6.0 2.32 .85 Managers 88 20.1 5.6 2.38 .82 Professionals/Individual Contributors 200 22.1 6.4 2.22 .88 Employees in Financial Occupations 198 21.9 6.4 2.22 .88 ANRA Test 1: Comparison of Quantities Norm Group N Mean SD SEM Alpha Executives/Directors 91 10.9 3.4 1.63 .77 Managers 88 10.3 3.4 1.70 .75 Professionals/Individual Contributors 200 11.4 3.6 1.53 .82 Employees in Financial Occupations 198 11.3 3.5 1.57 .80 ANRA Test 2: Sufficiency of Information Norm Group N Mean SD SEM Alpha Executives/Directors 91 10.4 3.3 1.60 .75 Managers 88 9.9 2.9 1.67 .67 Professionals/Individual Contributors 200 10.7 3.3 1.62 .76 Employees in Financial Occupations 198 10.6 3.3 1.58 .77

    The values in Table 5.2 show that the ANRA total raw score possesses good internal consistency

    reliability. The ANRA subtests showed lower internal consistency reliability estimates than the

    ANRA total raw score. Consequently, the ANRA total score, not the subtest scores, should be

    used for optimal hiring results.

    Evidence of Test-Retest Stability

    ANRA was administered on two separate occasions to determine the stability of performance on

    the test over time. A sample of 73 job incumbents representing various occupations and

    organizational levels took the test twice. The average test-retest interval was two weeks. The test-

    retest stability was evaluated using Pearson’s product-moment correlation of the standardized T

    scores from the first and second testing occasions. The test-retest correlation coefficient was

    corrected for the variability of the sample (Allen & Yen, 1979). Furthermore, the standard

    difference (i.e., effect size) was calculated using the mean score difference between the first and

    second testing occasions divided by the pooled standard deviation (Cohen, 1996, Formula 10.4).

    This difference (d), proposed by Cohen (1988), is useful as an index to measure the magnitude of

    the actual difference between two means. The corrected test-retest stability coefficient was .85.

    The difference in mean scores between the first testing and the second testing was statistically

    small (d = –0.03). As the data in Table 5.3 indicate, ANRA demonstrates good test-retest stability

    over time.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 19

    Table 5.3 ANRA Test-Retest Stability (N = 73)

    First Testing Second Testing Mean SD Mean SD r12

    Corrected r12

    Standard Difference (d)

    ANRA Standardized T score 50.1 9.2 49.8 10.0 .82 .85 –0.03

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 20

    Chapter 6

    Evidence of Validity

    Validity refers to the degree to which specific data, research, or theory support the interpretation

    of test scores entailed by proposed uses of tests (American Educational Research Association

    [AERA], American Psychological Association [APA], & National Council on Measurement in

    Education [NCME], 1999). Cronbach (1970) observed that validity is high if a test gives the

    information the decision maker needs. Several sources of validity evidence are discussed next in

    relation to ANRA.

    Face Validity Face validity refers to a test's appearance and what the test seems to measure, rather than what the

    test actually measures. Face validity is not validity in any technical sense and should not be

    confused with content validity. Face validity refers to whether or not a test looks valid to

    candidates, administrators and other observers. If test content does not seem relevant to the

    candidate, the result may be lack of cooperation, regardless of the actual validity of the test. For a

    test to function effectively in practical situations, such a test not only has to be objectively valid

    but also face valid.

    However, a test cannot be judged solely on whether it “looks right.” Appearance and graphic

    design of a test are no guarantee of quality. Face validity should not be considered a substitute for

    objectively determined validity. As mentioned in the chapter on the development of ANRA,

    ANRA items were reviewed by a group of individuals who provided feedback on the test. The

    reviewers provided their feedback regarding issues like clarity of the items, the extent to which

    items appeared to measure numerical reasoning, extent to which test content appeared relevant to

    jobs that required numerical reasoning, and to what extent they thought the test would yield

    useful information. From the responses by this group, it was evident that ANRA had high face

    validity and participants recognized its relevance to the skills required by employees who deal

    with numbers or project planning. Although the item content of ANRA could not reflect every

    work situation for which the test would be appropriate, the operations and processes required in

    each subtest represent abilities that are valued and readily appreciated.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 21

    Evidence Based on Test Content Evidence based on the content of a test exists when the test includes a representative sample of

    tasks, behaviors, knowledge, skills, abilities, or other characteristics necessary to perform the job.

    Evidence of content validity is usually gathered through job analysis and is most appropriate for

    evaluating knowledge and skills tests.

    Evaluation of content-related evidence is usually a rational, judgmental process

    (Cascio & Aguinis, 2005). In employment settings, the principal concern is with making

    inferences about how well the test samples a job performance domain—a segment or aspect of

    the job performance universe that has been identified and about which inferences are to be made

    (Lawshe, 1975). Because most jobs have several performance domains, a standardized test

    generally applies only to one segment of the job performance universe (e.g., a typing test

    administered to a secretary applies to typing—one job performance domain in the job

    performance universe of a secretary). Thus, the judgment of whether content-related evidence

    exists depends on an evaluation of whether the same capabilities are required in both the job

    performance domain and the test (Cascio & Aguinis, 2005).

    When considering content validity, it is important to recognize that a test attempts to sample the

    area of behavior being measured. It is rarely the purpose of a test to be exhaustive in assessing

    every possible manifestation of a domain. While content exhaustiveness may seem feasible in

    some highly specific areas of achievement, in other measurement situations it would simply not

    be possible. Aptitude, ability and personality tests always aim to achieve representative sampling

    of the behaviors in question, and the evaluation of content validity relates to the degree to which

    this representation has been achieved.

    Evidence of content validity is most easily shown with reference to achievement tests where the

    relationship between the items and the expected manifestation of that ability in real-life situations

    is very clear. Achievement tests are designed to measure how well an individual has mastered a

    particular skill or course of study. From this perspective, it might seem that an informed

    inspection of the contents of a test would be sufficient to establish its validity for such a purpose.

    For example, a test of spelling should consist of spelling items. A careful analysis of the domain

    will be necessary to ensure that all the important features are covered by the test items, and that

    the features are appropriately represented in the test according to their significance.

    The effect of speed on test scores also needs to be checked. Participants may perform differently

    under the additional pressure of a timed test. There are also implications for test design and

    scoring arising from the interaction of speed and accuracy and from situations where candidates

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 22

    fail to finish a timed test. In any case, ANRA is not a speed test and it is unlikely that anyone

    failing to complete the test within a reasonable amount of time would improve his or her score

    significantly if given extra time.

    In an employment setting, evidence of ANRA content-related validity should be established by

    demonstrating that the jobs require the numerical reasoning skills measured by ANRA. Content-

    related validity in instructional settings may be examined for the extent to which ANRA measures

    a sample of the specified objectives of such instructional programs.

    Evidence Based on Test-Criterion Relationships One of the primary reasons for using tests is to be able to make an informed prediction about an

    examinee’s potential for future success. For example, selection tests are used to hire or promote

    individuals most likely to be productive employees. The rationale behind using selection tests is

    the better an individual performs on the test, the better this individual will perform as an

    employee.

    Evidence of criterion-related validity addresses the inference that individuals who score better on

    tests will be successful on some criterion of interest. Criterion-related validity evidence indicates

    the statistical relationship (e.g., for a given sample of job applicants or incumbents) between

    scores on the test and one or more criteria, or between scores on the test and independently

    obtained measures of subsequent job performance. By collecting test scores and criterion scores

    (e.g., job performance results, grades in a training course, supervisor ratings), one can determine

    how much confidence may be placed in using test scores to predict job success. Typically,

    correlations between criterion measures and scores on the test serve as indicators of criterion-

    related validity evidence. Provided the conditions for a meaningful validity study have been met

    (e.g., sufficient sample size, and adequate criteria), these correlation coefficients are important

    indicators of the utility of the test.

    The conditions for evaluating criterion-related validity evidence are often difficult to fulfill in the

    ordinary employment setting. Studies of test-criterion relationships should involve a sufficiently

    large number of persons hired for the same job and evaluated for success using a uniform

    criterion measure. The criterion itself should be reliable and job-relevant, and should provide a

    wide range of scores. In order to evaluate the quality of studies of test-criterion relationships, it is

    essential to know at least the size of the sample and the nature of the criterion.

    Assuming that the conditions for a meaningful evaluation of criterion-related validity evidence

    had been met, Cronbach (1970) characterized validity coefficients of .30 or better as having

    “definite practical value.” The U.S. Department of Labor (1999) provides the following general

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 23

    guidelines for interpreting validity coefficients: above .35 are considered “very beneficial,” .21–

    .35 are considered “likely to be useful,” .11–.20 “depends on the circumstances,” and below .11

    “unlikely to be useful.” It is important to point out that even relatively lower validities (e.g., .20)

    may justify the use of a test in a selection program (Anastasi & Urbina, 1997). This suggestion is

    because the practical value of the test depends not only on the validity, but also other factors,

    such as the base rate for success on the job (i.e., the proportion of people who would be

    successful in the absence of any selection procedure). If the base rate for success on the job is low

    (i.e., few people would be successful on the job), tests with low validity can have considerable

    utility or value. When the base rate is high (i.e., selected at random, most people would succeed

    on the job), even highly valid tests may not contribute significantly to the selection process.

    In addition to the practical value of validity coefficients, the statistical significance of coefficients

    should be noted. Statistical significance refers to the odds that a non-zero correlation could have

    occurred by chance. If the odds are 1 in 20 that a non-zero correlation could have occurred by

    chance, then the correlation is considered statistically significant. Some experts prefer even more

    stringent odds, such as 1 in 100, although the generally accepted odds are 1 in 20. In statistical

    analyses, these odds are designated by the lower case p (probability) to signify whether a non-

    zero correlation is statistically significant. When p is less than or equal to .05, the odds are

    presumed to be 1 in 20 (or less) that a non-zero correlation of that size could have occurred by

    chance. When p is less than or equal to .01, the odds are presumed to be 1 in 100 (or less) that a

    non-zero correlation of that size occurred by chance.

    In a study of ANRA criterion-related validity, we examined the relationship between ANRA

    scores and on-the-job performance of job incumbents in various occupations (mostly finance-

    related occupations) and position levels (mainly professionals, managers, and directors). Job

    performance was defined as supervisory ratings on behaviors determined through research to be

    important to most professional, managerial, and executive jobs. The study found that ANRA

    scores correlated .32 with supervisory ratings on a dimension made up of Analysis and Problem

    Solving behaviors, and .36 with supervisory ratings on a dimension made up of Judgment and

    Decision Making behaviors (see Table 6.1). Furthermore, ANRA scores correlated .36 with

    supervisory ratings on a dimension composed of job behaviors dealing with Quantitative/

    Professional Knowledge and Expertise. Supervisory ratings from the sum of ratings on 24 job

    performance behaviors (“Total Performance”), as well as ratings on a single-item measure of

    “Overall Potential” were also obtained. The ANRA scores correlated .44 with Total Performance

    and .31 with ratings of Overall Potential. The correlation between ANRA scores and a single-item

    supervisory rating of “Overall Performance” was .38.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 24

    Table 6.1 Evidence of ANRA Criterion-Related Validity (Total Raw Score) of Job Incumbents in Various Finance-Related Occupations and Position Levels

    Criterion N Mean SD r

    Analysis and Problem Solving 89 37.6 7.0 .32**

    Judgment and Decision Making 91 32.2 5.9 .36**

    Quantitative/Professional Knowledge and Expertise

    59 53.6 8.9 .36**

    Total Performance (24 items) 58 127.0 22.0 .44**

    Overall Performance (single item) 94 5.6 1.1 .38**

    Overall Potential 94 3.4 1.1 .31**

    ** p < .01

    In Table 6.1, the column entitled N details the number of cases having valid supervisory ratings

    for every single job behavior contained in the specified criterion. The means and standard

    deviations refer to the criteria ratings shown in the table. The validity coefficients appear in the

    last column.

    The criterion-related validity coefficients reported in Table 6.1 apply to the specific sample of job

    incumbents mentioned in the table. These validity coefficients clearly indicate that ANRA is

    likely to be very beneficial as an indicator of the criteria shown in Table 6.1. However, test users

    should not automatically assume that these data constitute sole and sufficient justification for use

    of ANRA. Inferring validity for one group of employees or candidates from data reported for

    another group is not appropriate unless the organizations and job categories being compared are

    demonstrably similar.

    Careful examination of Table 6.1 can help test users make an informed judgment about the

    appropriateness of ANRA for their own organization. However, the data presented here are not

    intended to serve as a substitute for locally obtained validity data. Local validity studies, together

    with locally derived norms, provide a sound basis for determining the most appropriate use of

    ANRA. Hence, whenever technically feasible, test users should study the validity of ANRA, or

    any selection test, at their own location or organization.

    Sometimes it is not possible for a test user to conduct a local validation study. There may be too

    few incumbents in a particular job, an unbiased and reliable measure of job performance may not

    be available, or there may not be a sufficient range in the ratings of job performance to justify the

    computation of validity coefficients. In such circumstances, evidence of a test’s validity reported

    elsewhere may be relevant, provided that the data refer to comparable jobs.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 25

    Correlations Between ANRA Test 1 and Test 2

    The correlation between Test 1 (Comparison of Quantities) and Test 2 (Sufficiency of

    Information) of ANRA was 0.71 (N = 452, p < .0001). This correlation is clearly significant and

    also lower than the reliabilities of either test shown in Table 5.2, chapter 5. This evidence

    suggests that ANRA effectively samples both of these reasoning domains within the broader

    conception of numerical reasoning (Rust, 2002).

    Evidence of Convergent and Discriminant Validity Convergent evidence is provided when scores on a test relate to scores on other tests or variables

    that purport to measure similar traits or constructs. Evidence of relations with other variables can

    involve experimental (or quasi-experimental) as well as correlational evidence (AERA et al.,

    1999). Discriminant evidence is provided when scores on a test do not relate closely to scores on

    tests or variables that measure different traits or constructs.

    Correlations Between ANRA and Watson-Glaser Critical Thinking Appraisal—Short Form Correlations between ANRA and the Watson-Glaser Critical Thinking Appraisal®—Short Form

    (see Table 6.2) suggest that the tests are measuring a common general ability. Evidence for the

    validity of the Watson-Glaser as a measure of critical thinking and reasoning appears in the

    Watson-Glaser Short Form Manual (Watson & Glaser, 2006). The data in Table 6.2 suggest that

    ANRA also measures reasoning ability.

    The fact that the correlations between ANRA and the Watson-Glaser Short Form tests are lower

    than the inter-correlation between the two ANRA tests suggests that ANRA also measures some

    distinct aspect of reasoning that is not measured by the Watson-Glaser (Rust, 2002).

    Table 6.2 Correlations Between Watson-Glaser Critical Thinking Appraisal—Short Form and ANRA (N = 452)

    Watson-Glaser

    ANRA Test 1: Comparison of

    Quantities

    ANRA Test 2: Sufficiency of Information

    ANRA Total Raw Score

    Watson-Glaser Short Form Total Raw Score .65 .61 .68 Test 1: Inference .48 .47 .52 Test 2: Recognition of Assumptions .40 .36 .41 Test 3: Deduction .53 .51 .56 Test 4: Interpretation .60 .51 .60 Test 5: Evaluation of Arguments .35 .36 .39 Note. For all the correlations, p < .001

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 26

    Correlations Between ANRA and Other Tests

    In addition to the correlations with the Watson-Glaser, we also examined the correlations between

    ANRA and two other tests: Miller Analogies Test for Professional Selection (N = 67), and the

    DAT for Personnel and Career Assessment–Numerical Ability (N = 80). As would be expected,

    ANRA correlated higher with the Numerical Ability test of the DAT for PCA (r = .70, p < .001)

    than with the MAT for PS (r = .57, p = < .001). Details of these results, which suggest convergent

    as well as discriminant validity, are shown in Table 6.3.

    Table 6.3 Correlations Between ANRA, the Miller Analogies Test for Professional Selection (MAT for PS), and the Differential Aptitude Tests for Personnel and Career Assessment—Numerical Ability (DAT for PCA—NA)

    ANRA MAT for PS

    (N = 67) DAT for PCA—NA

    (N = 80) ANRA Total Raw Score .57 .70 ANRA Test 1: Comparison of Quantities .50 .69 ANRA Test 2: Sufficiency of Information .50 .57

    Note. For all the correlations, p < .001

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 27

    Chapter 7

    Using ANRA as an Employment Selection Tool

    ANRA was developed for use in adult employment selection. It may be used to predict success in

    jobs that require application of numerical reasoning skills. ANRA can also be useful in

    monitoring the effectiveness of numerical reasoning instruction and training programs, and in

    researching the relationship between numerical reasoning and other abilities or skills.

    Employment Selection

    Many organizations use testing as a component of their employment selection process.

    Employment selection programs typically use cognitive ability tests, aptitude tests, personality

    tests, basic skills tests, and work values tests to screen out unqualified candidates, to categorize

    prospective employees according to their probability of success on the job, or to rank order a

    group of candidates according to merit.

    ANRA was designed to assist in the selection of employees for jobs that require numerical

    reasoning. Many finance-related, project-management, and technical professions require the type

    of numerical reasoning ability measured by ANRA. The test is useful to assess applicants for a

    variety of jobs, such as Accountant, Account Manager, Actuary, Banking Manager, Business

    Analyst, Business Development Manager, Business Unit Leader, Finance Analyst, Loan Officer,

    Project Manager, Inventory Planning Analyst, Procurement or Purchasing Manager, and

    leadership positions with financial responsibilities.

    It should not be assumed that the type of numerical reasoning required in a particular job is

    identical to that measured by ANRA. Job analysis and local validation of ANRA for selection

    purposes should follow accepted human resource research procedures, and conform to existing

    guidelines concerning fair employment practices. In addition, no single test score can possibly

    suggest all of the requisite knowledge and skills necessary for success in a job.

    Using ANRA in Making a Hiring Decision It is ultimately the responsibility of the hiring authority to determine how it uses ANRA scores.

    We recommend that if the hiring authority establishes a cut score, examinees’ scores should be

    considered in the context of appropriate measurement data for the test, such as the standard error

    of measurement and data regarding the predictive validity of the test. In addition, we recommend

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 28

    that selection decisions be based on multiple job-relevant tools rather than relying on any single

    test (e.g., using only ANRA scores to make employment decisions).

    Human resource professionals can look at the percentile rank that corresponds to the candidate’s

    raw score in several ways. Candidates’ scores may be rank ordered by percentiles so that those

    with the highest scores are considered further. Alternatively, a cut score (e.g., the 50th percentile)

    may be established so that candidates who score below the cut score are not considered further. In

    general, the higher the cut score is set, the higher the likelihood that a given candidate who scores

    above that cut score will be successful. However, the need to select high scoring candidates

    typically needs to be balanced with situational factors, such as the need to keep jobs filled and the

    supply of talent in the local labor market.

    When interpreting ANRA scores, it is useful to know the specific behaviors that an applicant with

    a high ANRA score may be expected to exhibit. These behaviors, as rated by supervisors, were

    consistently found to be related to ANRA scores across different occupations requiring numerical

    reasoning. In general, candidates who score low on ANRA may find it challenging to effectively

    demonstrate these behaviors. Conversely, candidates who score high on ANRA are likely to

    display a higher level of competence in the following behaviors:

    • Uses quantitative reasoning to solve job-related problems.

    • Learns new numerical concepts quickly.

    • Applies sound logic and reasoning when making decisions.

    • Demonstrates knowledge of financial indicators and their implications.

    • Breaks down information into essential parts or underlying principles.

    • Readily integrates new information into problem-solving and decision-making

    processes.

    • Recognizes differences and similarities in situations or events.

    • Engages in a broad analysis of relevant information before making decisions.

    • Probes deeply to understand the root causes of problems.

    • Reviews financial statements, sales reports, and/or other financial data when

    planning.

    • Accurately assesses the financial value of things (e.g., worth of assets) or people

    (e.g., credit worthiness).

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 29

    Human resource professionals who use ANRA should document and examine the relationship

    between applicants’ scores and their subsequent performance on the job. Using locally obtained

    criterion-related validity information provides the best foundation for interpreting scores and

    most effectively differentiating examinees who are likely to be successful from those who are not.

    Pearson does not establish or recommend a passing score for ANRA.

    Differences in Reading Ability, Including the Use of English as a Second Language Though ANRA is a mathematical test, a level of reading proficiency in the English language is

    assumed and reflected in the items. Where ANRA is being used to measure the numerical

    reasoning capabilities of a group, for some of whom English is not their first language, reasonable

    precautions need to be taken. If a candidate experiences difficulty with the language or the

    reading level of the test, note this information and consider it when interpreting the test scores. In

    some cases, it may be more appropriate to test such individuals with another assessment

    procedure that fully accommodates their language of preference or familiarity.

    Using ANRA as a Guide for Training, Learning, and Education

    Critical thinking, numerical or otherwise, is trainable (Halpern, 1998; Paul & Nosich, 2004).

    Thus, when interpreting test scores on ANRA, it is important to bear in mind the extent to which

    training may have influenced the scores. The ability to think critically has long been recognized

    as a desirable educational objective and studies that have been done in educational settings

    demonstrate that critical thinking can be improved as a result of training directed to this end (Hill,

    1959; Kosonen & Winne, 1995; Nisbett, 1993, Perkins & Grotzer, 1997).

    Scores on ANRA are likely to be influenced by factors associated with training. Typically,

    individuals will differ in the extent to which such training has been made available to them.

    Although traditional classes in math and science in school are important, many of these classes

    involve computational arithmetic and other lower order-thinking skills, such as the rote

    application of rules that have been learned. Training in higher-order numerical reasoning during

    the school years will often have been indirect and largely dependent on the overall quality of

    education available to the individual. Consequently, this indirect training would likely depend on

    the amount of time spent in education or learning. Furthermore, the extent to which numerical

    reasoning skills are trainable will likely differ between individuals.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 30

    Fairness in Selection Testing Fair employment regulations and their interpretation are continuously subject to changes in the

    legal, social, and political environments. Therefore, ANRA users should consult with qualified

    legal advisors and human resources professionals as appropriate.

    Legal Considerations Governmental and professional regulations cover the use of all personnel selection procedures.

    Relevant source documents that the user may wish to consult include the Standards for

    Educational and Psychological Testing (AERA et al., 1999); the Principles for the Validation and

    Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology,

    2003); and the federal Uniform Guidelines on Employee Selection Procedures (Equal

    Employment Opportunity Commission, 1978). For an overview of the statutes and types of legal

    proceedings that influence an organization’s equal employment opportunity obligations, the user

    is referred to Cascio and Aguinis (2005) or the U.S. Department of Labor’s (1999) Testing and

    Assessment: An Employer’s Guide to Good Practices.

    Group Differences and Adverse Impact Local validation is particularly important when a selection test may have adverse impact.

    According to the Uniform Guidelines on Employee Selection Procedures (Equal Employment

    Opportunity Commission, 1978), adverse impact is indicated when the selection rate for one

    group is less than 80% (or 4 out of 5) of another group. Adverse impact is likely to occur with

    cognitive ability tests such as ANRA.

    Although it is within the law to use a test with adverse impact (Equal Employment Opportunity

    Commission, 1978), the testing organization must be prepared to demonstrate that the selection

    test is job-related and consistent with business necessity. The Civil Rights Act of 1991, as

    amended, defined “business necessity” to mean that, “in the case of employment practices

    involving selection …, the practice or group of practices must bear a significant relationship to

    successful performance of the job” (Section 3 (o) (1) (A)). In deciding whether the standards for

    business necessity have been met, the Civil Rights Act of 1991 states that “demonstrable

    evidence is required”. The Act provides examples of “demonstrable evidence” as “statistical

    reports, validation studies, expert testimony, prior successful experience and other evidence as

    permitted by the Federal Rules of Evidence” (Section 3 (o) (1) (B)).

    A local validation study, in which ANRA scores are correlated with job performance indicators,

    can provide evidence to support the use of the test in a particular job context. An evaluation that

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 31

    demonstrates that ANRA (or any employment assessment tool) is equally predictive for protected

    subgroups, as outlined by the Equal Employment Opportunity Commission, will assist in the

    demonstration of fairness of the test. For example, from the results of their review of 22 cases in

    U.S. Appellate and District Courts involving cognitive ability testing in class-action suits,

    Shoenfelt and Pedigo (2005, p. 6) reported that “organizations that utilize professionally

    developed standardized cognitive ability tests that are validated and that set cutoff scores

    supported by the validation study data are likely to fare well in court.”

    Monitoring the Selection System An organization’s abilities to evaluate selection strategies and to implement fair employment

    practices depend on its awareness of the demographic characteristics of applicants and

    incumbents. Monitoring these characteristics and accumulating test score data are clearly

    necessary for establishing legal defensibility of a selection system, including those systems that

    incorporate ANRA. The most effective use of ANRA is with a local norms database that is

    regularly updated and monitored.

    The hiring organization should ensure that its selection process is clearly job related and focuses

    on characteristics that are important to job success. Good tests that are appropriate to the job in

    question can contribute a great deal towards monitoring and minimizing the major sources of bias

    in the selection procedures. ANRA is a reliable and valid instrument for the assessment of

    numerical reasoning. When used for the assessment of candidates or incumbents for work that

    requires this skill, ANRA can be useful in the selection of the better candidates. However, where

    candidates drawn from different sub-groups of the population are more or less deficient in

    numerical reasoning skills as a result of the failure to provide the necessary educational

    environment during schooling, then there is the risk of overlooking candidates who can develop

    this skill but who have not had the opportunity to do so. Employers can reasonably expect that

    candidates should have achieved all the necessary basic skills before applying for the job.

    However, in circumstances where adverse impact is manifest, an organization might wish to

    consider ways in which it can contribute to the reduction of adverse impact. This approach might

    take the form of providing training courses to employees in the deficient skill areas, or of

    increasing involvement with the local community to identify ways in which the community might

    assist, or of re-evaluating recruitment strategy, for example, by advertising job positions more

    widely or through different media.

  • Copyright © 2006 by NCS Pearson, Inc. All rights reserved. 32

    References

    Allen, M.J., & Yen, W.M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole.

    American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: Author.

    American Institute of Certified Public Accountants, AICPA (1999).