rsch 6109: assessment & evaluation methods classification of items response formats scoring...

RSCH 6109: Assessment & Evaluation Methods

Classification of Items

Response Formats

Scoring Procedures

Select Item Writing Guidelines

MCQs

Likert Rating Scales

1. Purpose & Framework

2. Test Specifications or Blueprint

3. Item Construction

4. Field Testing

5. Evaluation & Revision

Test Design & Construction

Test Design & ConstructionAERA, APA, & NCME (1999). Standards for Educational and Psychological Testing; Washington, DC: American Educational Research Association

Drummond, R.J. (2004). Appraisal Procedures for Counselors and Helping Professionals, 5th Ed.; New Jersey: Pearson Publishing

Step 1 Delineate Purpose Phase 1 Establish the need

Phase 2 Define the objectives & test parameters

Step 2 Develop test specs or blueprint

Phase 3 Seek advisory committee input

Step 3 Develop items or tasks & field test

Phase 4 Write questions

Phase 5 Field test

Phase 6 Review items

Step 4 Assemble & evaluate test Phase 7 Assemble final version

Phase 8 Secure technical data

Step 1: Delineate the Purpose & Framework

The purpose and framework delineate what the test is intended to measure.

Step 2: Prepare the Table of Specifications

The table of specifications typically describes the specific format of the items, the response format, and the type of scoring procedures.

Step 3: Develop Test Items or Tasks



1. Like a mission statement for the test

2. Define the construct to be measured

3. Define the population with whom the test is to be used

4. Determine the target audience for the information the test provides, the test users

5. Define the nature of the decisions to be made based on the information the test provides

Defining the Purpose


1. A construct is an unobservable quality, ability, or attribute

2. We believe from theory that each person possesses some “amount” of the construct

3. We can’t directly observe or measure the “amount” or level

4. We rely on outward behaviors as indicators of the latent, or underlying construct

5. Contrast Blood Pressure and Depression

What is a Construct?


1. Theory

2. Literature

3. Expert opinion

4. Qualitative research

5. The goal is to include all aspects of the construct you intend to measure

Defining the Content Domain

Step 1: Delineate the Purpose & Framework

Example:

(Optimal) The purpose of the Counselor Achievement Test (CAT) is to assess counseling students’ knowledge, skills, and abilities for effective counseling services. The framework of the CAT is modeled after the National Counselor’s Exam (NCE) and includes eight content areas. The CAT will consist of 24-32 selected- and constructed-response items, as well as performance tasks. The CAT will be a criterion referenced measure.

(Typical)The purpose of the Study Habits Scale (SHS) is to assess college students’ habits of study. The SHS includes (between 18 and 30) items. The framework of the SHS is based on the work of Blai (1993). The SHS is a self-report measure designed to identify students’ study attitudes and behaviors.


Step 2: Develop the test specifications or blueprint


The table of specifications or test blueprint typically describes the number of items, the specific classification of the items and response format, and the type of scoring procedures.

Sample Table of Specifications for CATTTL# Content Area Item Classification* Format (#of Items)

K C AP AN S E

3 Human growth and development

1 1 1 MCQ (2)Constructed Response (1)

3 Social and cultural foundations


3 Helping relationships 1 1 1 MCQ (2)Constructed Response (1)

3 Group Work 1 1 1 MCQ (2)Constructed Response (1)

3 Career and lifestyle development


3 Appraisal 1 1 1 MCQ (2)Constructed Response (1)

3 Research and program evaluation


3 Professional orientation & ethics


24 6 6 6 2 2 2

*Refers to Bloom’s Taxonomy of Educational Objectives (1956). K=knowledge, C=comprehension, A=application, A=analysis, S=synthesis, and E=evaluation


1. Determine the target length in time to administer and number of items

2. Consider intended use and practical constraints – cost, complexity of scoring, etc.

3. Consider the purpose and the stakes involved in decision making

4. Initially write at least twice as many items as needed

5. Contrast a screening test with a diagnostic test

Developing Items


1. Short

2. Easy to administer

3. Inexpensive

4. Easy to score

5. Maximizes Sensitivity

6. Makes the correct decision when the condition of interest is present – Minimizes false negatives.

Screening Tests


1. Longer

2. More complex to administer

3. More expensive

4. Harder to score

5. Maximizes Specificity

6. Makes the correct decision when the condition of interest is not present – Minimizes false positives.

Diagnostic Tests



The table of specifications typically describes the specific classification of the items, the response format, and the type of scoring procedures.

Item Classifications: Bloom and Krathwohl (1956)

Knowledge Comprehension Application Analysis Synthesis Evaluation

Define, Identify, List, Name

Convert, Explain, Summarize

Compute, Determine, Solve

Analyze, differentiate, Relate

Design, Devise, Formulate, Plan

Compare, Critique, Evaluate, Judge

Bloom, et al’s Taxonomy of Educational Objectives (Cognitive Domain)

Knowledge Remembering previously learned material. Requires recall of facts, procedures,

Define, Recall, Identify, List, Name rules or events.

Comprehension Grasping the meaning of material. Requires reformulation, restatement, translation, Convert, Explain, Summarize or interpretation of content or identification of relationships.

Application Using information in concrete situations. Requires use of information in a setting Compute, Demonstrate, Solve or context other than where it was learned.

Analysis Breaking down material into parts. Requires recognition of logical errors, Analyze, Infer, Differentiate, Relate comparison of components, or differentiation between components.

Synthesis Putting parts together into whole. Requires production of something original, Design, Construct, Combine, Formulate solution to an unfamiliar problem, or combination of parts in an unusual way.

Evaluation Judging the value of a thing for a given purpose using definitive criteria. Requires

Discriminate, Critique, Evaluate,Judge formation of judgements about the worth or value of ideas, products, or procedures that have a specific purpose.

Response Formats:

Selected-Response

Response sets are provided and the user is forced to select among the choices. Examples include: MCQ, T/F, Yes/No, Matching, and Likert Ratings

Constructed-Response

No response sets are provided and the user is forced to provide a unique response. Examples include: Short Answer & Extended Answer.

Performance Tasks

No response sets are provided and the user is required to develop a product or perform some task or set of tasks. Examples include: Restricted and Extended Performance Tasks.


Selected-Response Formats

1. Multiple Choice Questions (MCQ)

Multiple choice items include a question or STEM followed by a number of possible responses or OPTIONS. These options

make-up the RESPONSE SET of the item.

2. True – False Questions

True – false items include a stem and two discrete options. These options can be “True-False”, “Yes-No”, “Always-Never”, etc.

3. Matching Items

Matching exercises consist of two columns of information. The student is required to select the item in the second column which best

reflects the item in the first column.

4. Likert Rating Scale Items

Likert ratings include a scale ranging from one extreme to another. The anchors of the scale vary depending on the nature of the

statement.

Constructed-Response Formats (Optimal Performance)

1. Short Answer Questions

Completion or short answer formats consist of questions that can be answered with a word or short phrase, or a statement having one or more omitted words.

2. Limited Essay Questions

Limited essay questions consist of tasks or items requiring students to give brief, concise responses.

3. Extended Essay Questions

Extended essay questions consist of tasks or items that allow students freedom to choose the form and scope of their responses.

Format Advantages Disadvantages

MCQ Assesses broad range of skills in a limited amount of time. Scoring can be done quickly and objectively.

True-False Numerous items can be administered in a brief amount of time. Easy to write and objective to score.

Matching Assessed a broad range of skills in a limited time. Scoring can be done quickly and objectively.

Short Answer Numerous items can be administered in a short time. Moderately easy to write and score items. Guessing is difficult.

Essay Assesses broad range of skills, particularly higher order cognitive skills. Guessing is difficult.

Difficult and time consuming to write higher order cognitive items. Most items assess knowledge thru comprehension. Guessing reduces validity of scores.

Limited in complexity. Guessing reduces validity of scores. Not appropriate for optimal performance measures.

Higher order cognitive skills are difficult to assess. Guessing reduces validity of scores.

Limited to items that require very few words. Spelling errors can make scoring difficult.

Time consuming to administer and score. Limited content can be sampled during a test period. Scoring can be subjective.



Scoring Procedures:

Selected-Response

Typically, selected response items include 1 correct answer (a.k.a., dichotomous scoring). However, some tests may weigh responses differently.

Rating scale items are typically added together for a total score. For example, ten 5-point Likert rating scale items would yield a score range from 10 to 50. Typically, a higher score denotes stronger agreement, satisfaction, etc. with the overall construct.



Scoring Procedures (continued):

Constructed-Response

These formats are relatively more subjective, time consuming, and expensive to score.

Short-answer items require a list of acceptable answers.

Extended response items typically require a scoring rubric. A scoring rubric is a table describing the criteria for scoring, including detailed descriptions for varying degrees of performance. The scoring rubric may yield a holistic or analytic score. Holistic scores refer to the overall impression of the response (or behavior) and analytic scores refer to the discrete dimensions of the response (or behavior). Holistic scores yield one overall score and analytic scores typically yield sub-scores as well as an overall score.

Performance tasks vary depending on the nature and complexity of the tasks. Scoring procedures may require a checklist, Likert rating scale, or rubric.

1. Confidence Weighting Student is asked to indicate what he believes is the correct answer and how certain he is it is correct. Confident items are weighed more heavily than less confident items.

2. Answer Until Correct (AUC)

Student chooses alternatives until the correct response is selected. Once selected, the student moves on to the next item.

Supplemental Information: MCQ Alternatives

3. Elimination & Inclusion Scoring

Student is asked to either cross out all the alternatives that are incorrect (elimination) or circle the alternatives that are most likely correct (inclusion).

4. Multiple-Answer Format

Student is told that any number of the options might be correct. Each item is scored by subtracting the number of incorrect answers from the number of correct answers.

Supplemental Information: MCQ Alternatives

Sample Item: Confidence Weighting Please respond to the following items by circling the letter that corresponds to the correct response. In addition, please rate your level of confidence with your response to each item by circling the corresponding confidence level.

What is the main advantage of using a table of specifications when preparing an achievement test?

A. It reduces the amount of time required. (+0)

B. It improves the sampling of content. (+1)

C. It makes the construction of test items easier. (+0)

D. It increases the objectivity of the test. (+0)

Please circle the number that corresponds to the best descriptor for your level of confidence with the answer chosen:

5 4 3 2 1

Extremely Fairly Neutral Fairly Extremely

Confident Confident Unconfident Unconfident

Scoring Guide: Multiply the correct answer by the level of confidence. For this example, the student would receive 4 out of a possible 5 points.

Sample Item: AUC Please answer the following items by removing the overlay that corresponds to your response. If the answer chosen reveals an “INCORRECT” response, continue selecting until you reveal the “CORRECT” response. Once you have identified the “CORRECT” response you have completed the item and should move on to the next question.


A. It reduces the amount of time required.

B. It improves the sampling of content.

C. It makes the construction of test items easier.

D. It increases the objectivity of the test.

Scoring Guide:

1st Attempt = 100% 2nd Attempt = 66%

3rd Attempt = 33% 4th Attempt = 0%

Sample Item: Elimination Scoring

Please respond to the following items by circling the letter that corresponds to the correct response. In addition, please draw a line through those items that you confidently believe are incorrect.


Scoring Guide

A. It reduces the amount of time required. (+05)

B. It improves the sampling of content. (+85%)

C. It makes the construction of test items easier.

D. It increases the objectivity of the test.


Classification of Items

Response Formats

Scoring Procedures

Select Item Writing Guidelines

MCQs

Likert Rating Scales

1. Purpose & Framework

2. Test Specifications or Blueprint

3. Item Construction

4. Field Testing

5. Evaluation & Revision


rsch 6109: assessment & evaluation methods classification of items response formats scoring...

Documents

test items

test specifications

test specs

test users

tasks field test phase

purpose slide

counselor achievement

response items