examinations: amee guide no. 85 how to set standards on

15
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=imte20 Download by: [Sistema Integrado de Bibliotecas USP] Date: 27 March 2016, At: 18:55 Medical Teacher ISSN: 0142-159X (Print) 1466-187X (Online) Journal homepage: http://www.tandfonline.com/loi/imte20 How to set standards on performance-based examinations: AMEE Guide No. 85 Danette W. McKinley & John J. Norcini To cite this article: Danette W. McKinley & John J. Norcini (2014) How to set standards on performance-based examinations: AMEE Guide No. 85, Medical Teacher, 36:2, 97-110, DOI: 10.3109/0142159X.2013.853119 To link to this article: http://dx.doi.org/10.3109/0142159X.2013.853119 Published online: 20 Nov 2013. Submit your article to this journal Article views: 1670 View related articles View Crossmark data Citing articles: 2 View citing articles

Upload: others

Post on 18-Dec-2021

3 views

Category:

Documents


5 download

TRANSCRIPT

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=imte20

Download by: [Sistema Integrado de Bibliotecas USP] Date: 27 March 2016, At: 18:55

Medical Teacher

ISSN: 0142-159X (Print) 1466-187X (Online) Journal homepage: http://www.tandfonline.com/loi/imte20

How to set standards on performance-basedexaminations: AMEE Guide No. 85

Danette W. McKinley & John J. Norcini

To cite this article: Danette W. McKinley & John J. Norcini (2014) How to set standards onperformance-based examinations: AMEE Guide No. 85, Medical Teacher, 36:2, 97-110, DOI:10.3109/0142159X.2013.853119

To link to this article: http://dx.doi.org/10.3109/0142159X.2013.853119

Published online: 20 Nov 2013.

Submit your article to this journal

Article views: 1670

View related articles

View Crossmark data

Citing articles: 2 View citing articles

2014

2014; 36: 97–110

AMEE GUIDE

How to set standards on performance-basedexaminations: AMEE Guide No. 85

DANETTE W. MCKINLEY & JOHN J. NORCINI

FAIMER, Research and Data Resources, USA

Abstract

This AMEE Guide offers an overview of methods used in determining passing scores for performance-based assessments. A

consideration of various assessment purposes will provide context for discussion of standard setting methods, followed by a

description of different types of standards that are typically set in health professions education. A step-by-step guide to the

standard setting process will be presented. The Guide includes detailed explanations and examples of standard setting methods,

and each section presents examples of research done using the method with performance-based assessments in health professions

education. It is intended for use by those who are responsible for determining passing scores on tests and need a resource

explaining methods for setting passing scores. The Guide contains a discussion of reasons for assessment, defines standards, and

presents standard setting methods that have been researched with performance-based tests. The first section of the Guide

addresses types of standards that are set. The next section provides guidance on preparing for a standard setting study. The

following sections include conducting the meeting, selecting a method, implementing the passing score, and maintaining the

standard. The Guide will support efforts to determine passing scores that are based on research, matched to the assessment

purpose, and reproducible.

Introduction

Standard setting is the process of defining or judging the level

of knowledge and skill required to meet a typical level of

performance and then identifying a score on the examination

score scale that corresponds to that performance standard.

Standard setting procedures are employed to provide a

conceptual definition of competence for an occupation or

educational domain and to operationalise the concept. When

considering the conceptual definition of competence, it is

helpful to think about the criteria developed in competency-

based medical education. The descriptive information pro-

vided in the development of milestones or benchmarks

(Holmboe et al. 2010) can be helpful in defining the

performance standard. The standard setting process is

designed to translate a conceptual definition of competence

to an operational version, called the passing score (Kane 1994;

Norcini 1994). Verification that the passing score is appropriate

is another critical element in collecting evidence to support the

validity of test score interpretation (American Educational

Research Association et al. 1999; Kane 2006). Various

approaches to determining passing scores for examinations

have been developed and researched. In this Guide, an

overview to the methods that have been typically used with

performance-based assessments will be provided. A consider-

ation of various assessment purposes will provide context for

discussion of standard setting methods, followed by a

description of different types of standards that are typically

set in health professions education. A step-by-step guide to the

standard setting process will be presented.

Assessment purposes

In education, it is often necessary to evaluate whether trainees

are attaining the knowledge, skills, and attitudes needed to

perform in the field of endeavour. In order to determine

whether ‘‘sufficient knowledge, skills and attitudes’’ are

Practice points

. Although there is extensive research on standard setting

with both multiple-choice and performance-based tests,

there is no ‘‘right’’ passing score, and no ‘‘best’’ method

. Different methods yield different results

. Pre-fixed passing scores set without consideration of test

content or examinee performance can vary greatly due

to test difficulty and content, affecting the appropriate-

ness of the decisions made

. Selecting a method depends on the purpose of the

examination and the resources available for the standard

setting effort

. The passing score should be determined by a group

(e.g. faculty members) familiar with the assessment

purpose and the domain assessed

. The standard setting method selected should: be closely

aligned with assessment goals; be easy to explain and

implement; require judgments that are based on per-

formance data; entail thoughtful effort; and be based on

research

Correspondence: Danette W. McKinley, PhD, Foundation for the Advancement of International Medical Education and Research, 3624 Market

Street, 4th Floor, Philadelphia, PA 19104, USA. Tel: 1 215 823 2231; fax: 1 215 386 3309; email: [email protected]

ISSN 0142–159X print/ISSN 1466–187X online/14/020097–14 � 2014 Informa UK Ltd. 97DOI: 10.3109/0142159X.2013.853119

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

present, different methods are typically employed as part of a

programme of assessment (Dijkstra et al. 2010). In health

professions education, there are many approaches to assessing

the knowledge, skills, attitudes and abilities of applicants,

students, graduates, and practitioners. For some time, health

professions educators used any available assessment method

to evaluate the competencies of a doctor, even if they were not

appropriate (Norcini & McKinley 2007). For example, although

it is important for a doctor to be able to communicate

effectively with the healthcare team, assessment of this aspect

is not appropriately tested through the use of written exam-

inations. Several methods of assessment have been developed

and implemented, with a movement towards assessment

based on performance that is tied to what is expected in

practice. In the education of health professionals, standardised

patients (SPs), lay people trained to portray complaints of

‘‘real’’ patients, are frequently used (e.g. Patil et al. 2003). This

type of assessment provides examinees the opportunity to

show what they can do (e.g. correctly perform a physical

examination, communicate with a patient), rather than what

they know (Miller 1990).

In developing methods that assess what examinees do,

other methods, or even combinations of modalities, have also

been used (Nestel et al. 2006). In many healthcare professions,

the use of various workplace-based assessments, including

chart reviews and 360 degree evaluations, have been instituted

(Norcini & Burch 2007). These assessments are usually

employed as quality improvement measures, and involve

peer review, evaluation of practice outcomes, and patient or

client satisfaction measures (Norcini 2003).

The varieties of instruments that are available have been

developed, at least in part, to meet different assessment goals.

Assessment goals may include determining who is eligible to

enter medical school (e.g. admissions testing); whether course

requirements are satisfied (e.g. classroom testing); if a student

is ready to advance to the next level of training (e.g. end-of-

year testing); whether the examinee is ready to enter a

profession (e.g. licensure and certification testing); or whether

the examinee has shown evidence of expertise (e.g. mainten-

ance of certification). The assumption is usually made that

scores obtained from assessments provide an indication of a

student’s ability to ‘‘use the appropriate knowledge, skills, and

judgment to provide effective professional services over the

domain of encounters defining the area of practice’’ (Kane

1992, p. 167). Scores are often used to make decisions (or

interpretations) regarding whether students (or graduates)

have sufficiently acquired the knowledge and skills to enter, or

continue practice in, a profession. In this manner, test scores

are used to classify people into two or more groups (e.g.

passing or failing an examination). Test scores can be used to

make decisions about who needs additional educational help,

whether the test-taker will go on to the next level of training, or

whether the test-taker has achieved mastery in the domain of

interest.

Kane (1992) defined competence as the extent to which the

individual can handle the various situations that arise in that

area of practice (p. 165). In education, licensure, and

certification, assessments are used to determine the profi-

ciency of candidates. For example, when performance

measures such as the objective structured clinical examination

(OSCE) are developed, principles associated with domain-

based assessment, including definition of characteristics

denoting adequate performance, can be employed (Pell

et al. 2010). Whether the test consists of multiple-choice

items or task performance, it is important to consider whether

the goal is to compare an individual with others taking the

same assessment, or to determine the level of proficiency. In

the next section of the Guide, we discuss types of standards

related to these different purposes.

Types of standards

Standards can be categorised as relative (sometimes called

norm-referenced) or absolute (sometimes called criterion-

referenced) (Livingston & Zieky 1982). Standards that are

established based on a comparison of those who take the

assessment to each other are relative standards. For example,

when the passing score is set based on the number or

percentage of examinees that will pass, the standard is relative.

This type of standard setting is typically used in selection for

employment or admission to educational programmes where

the positions available are limited. With high stakes examin-

ations (e.g. graduation, certification, licensure), relative stand-

ards are not typically used, because the ability of the groups of

test takers could vary over time and the content of the

assessment may also vary over time. When pre-determined

passing scores are used with educational (classroom) tests, the

ability of the students in the class and the difficulty of the test

given are not considered (Cohen-Schotanus & van der Vleuten

2010). Examinee ability and test difficulty are factors that could

adversely affect evidence of the appropriateness of the passing

score. To avoid the disadvantages associated with the relative

standard setting method, absolute standard setting approaches

are more commonly used in credentialing examinations (i.e.

licensure or certification).

Standards set by determining the amount of test material

that must be answered (or performed) correctly in order to

pass are absolute standards. For example, if the examinee must

answer 75% of the items on a multiple-choice test correctly in

order to pass, the standard is absolute. When absolute

standards are used, it is possible that decisions made will

result in all examinees passing or failing the examination.

Standard setting and performance-based assessments

Several standard setting methods have been used with

performance-based assessments. Because relative standards

are set based on the desired result (e.g. admitting the top 75

candidates to a school) and are, therefore, determined more

easily, this Guide will focus on absolute standard setting

methods (See Table 1 for comparisons). These methods

involve review of examination materials or examinee per-

formance, and the resulting passing scores can be derived

from the judgments of a group of subject matter experts.

Absolute standard setting approaches have been referred to as

either test-centred or examinee-centred (Livingston & Zieky

1982). When using test-centred methods, the judges focus on

D. W. McKinley & J. J. Norcini

98

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

exam content. The Angoff (1971), Ebel (1972), and Nedelsky

(1954) methods are examples of test-centred standard setting

methods. In contrast, when using examinee-centred methods,

judges focus on the performance of examinees. These

methods include contrasting groups, borderline group, and

borderline regression methods (Livingston & Zieky 1982;

Wood et al. 2006). The judges’ task in examinee-centred

methods is to determine whether the performance they review

depicts someone possessing the knowledge and skills needed

to meet the standard (e.g. are minimally competent).

Aspects of test-centred and examinee-centred approaches,

related to examinations containing multiple choice questions

(MCQs), were presented in a previous AMEE guide

(Bandaranayake 2008), and the application of some of these

approaches to the OSCE are presented here. Specifically, we

will provide guidance for the use of the Angoff, borderline

group, borderline regression, contrasting group, and com-

promise methods of standard setting using simulated OSCE

data. The remainder of the Guide is divided into four sections:

preparing for the standard setting study, conducting the

standard setting study, generating the passing score, and

implementing and maintaining standards.

Preparing for the standard settingstudy

Determining the passing score is typically accomplished by a

group familiar with the assessment purpose and the domain

assessed. Before this group meets, there are a number of steps

that should be completed. First, panellists need to be recruited.

To increase objectivity and to derive plausible passing scores,

panellists should be knowledgeable about the examination

content area and the purpose of the test (Jaeger 1991;

Raymond & Reid 2001). In addition, they should be familiar

with the qualifications of the students being tested. Finally,

experience with the assessment method is essential. Panellists

who make judgments about clinical proficiency should be

experts in the profession and should have familiarity with

expectations of trainees at various stages of education,

because they will readily understand the consequences

associated with passing the test. Because OSCEs are often

used to test clinical and communication skills, faculty members

who have participated in the OSCE as examiners or those who

have assisted in the development of materials used in the

administration (e.g. checklists) would be ideal for recruiting as

panellists in the standard setting meeting. To set standards for

communication skills of physicians, for example, other mem-

bers of the health care team could be considered (e.g. nursing

staff, standardised patient trainers) as judges. Because they are

familiar with the expected performance of physicians in

clinical settings, they would also be appropriate participants

in the standard setting process. A suitable mix of panellists

based on gender, discipline (e.g. paediatrics, general medi-

cine), and professional activity (e.g. faculty vs. practicing

physician) should be considered. This is particularly important

when the passing score identifies those candidates who will

enter a profession (e.g. licensure). The more panellists there

are, the more likely it is that the resulting passing score will be

Table 1. Features of standard setting methods.

Standard setting method Purpose Description Comments

Relative or norm-referenced Ranking examinees � Standard reflects group performance.

� Passing score is set based on test results (i.e.

percent passing).

Useful when a predetermined number (or per-

centage) of examinees should pass the

examination, e.g. admissions testing,

employment testing.

Passing score may fluctuate due to examinee

ability or test difficulty.

Pass rate (number or percentage passing) is

stable.

Absolute or

criterion-referenced

Determining mastery � Standard reflects a conceptual definition of

mastery, proficiency, or competence.

� Passing score is set independent of test

results.

Useful when determining whether examinees

meet requirements defined by the standard.

All examinees could pass or fail using this type

of standard.

Test-centred � Standard is translated to a passing score

based on review of test materials (e.g.

multiple-choice questions, cases, tasks).

� Standard is defined, and judges focus on

exam content.

Examinee-centred � Standard is translated to a passing score

based on the review of examinees’ per-

formance on tasks.

� Judges determine whether the performance

they review depicts someone possessing the

knowledge and skills needed to meet the

standard.

Compromise � Judges consider both the passing score and

the passing (or failing) rate.

� Judges give an opinion about what consti-

tutes an ‘‘acceptable’’ cut score and pas-

sing rate.

Can be a useful alternative when resources are

limited.

The approach assumes that a passing score

has been determined and test results are

available.

Useful when a balance between absolute and

relative methods is needed.

Standard setting on performance-based exams

99

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

stable (Jaeger 1991; Ben-David 2000). However, consideration

of management of a large group is also important. The number

of panellists needed should be balanced by a number of

factors: the consequences associated with examination deci-

sions, the number of performances to review a reasonable time

frame for completion of the standard setting meeting, and the

resources available for the meeting. Successful standard setting

meetings can be conducted with as few as four panellists or as

many as 20. Having a large group will provide the meeting

facilitator with the opportunity to assign panellists to smaller

groups so that more material can be covered. The deciding

factor will be resources (e.g. space, number of facilitators)

available for the meeting.

The next step in organising the meeting is preparing

materials to be used in training the panellists. Panellist training

is very important; developing a clear understanding of the

performance standard (e.g. student in need of remediation,

graduate ready for supervised practice, practitioner ready for

unsupervised practice) is essential (Ben-David 2000;

Bandaranayake 2008). To promote understanding of the

performance standard, criteria that are typically part of a

competency-based curriculum can be very useful. This type of

information can assist in the delineation of the knowledge,

skills, and abilities that comprise the performance standard in a

particular context, and at a particular stage in a health

professional’s career (Frank et al. 2010).

To support training, examination materials may be used as

part of the orientation. For example, as part of the orientation

to standard setting for an OSCE, panellists could complete

some of the stations as examinees. This allows them to

experience the examination from the perspective of the

examinee. Next, a discussion of the characteristics defining

with the performance standard would be conducted. Finally,

the panellists would be afforded the opportunity to practice

the method selected.

In the next section of the Guide, we present methods that

can be used to set passing scores for performance assessments,

using the OSCE as an example. These methods are commonly

used to derive passing scores for OSCE and standardised

patient examinations (Norcini et al. 1993; Boulet et al. 2003;

Downing et al. 2006). In each section, research regarding the

method as used with OSCEs or standardised patient examin-

ations is cited. In order to provide detailed guidelines for

deriving passing scores for an OSCE, we generated data for an

end-of-year examination for 50 students, using five OSCE

stations. Data for a multiple-choice examination consisting of

50 items was also generated. This simulated data set will be

used to illustrate a number of methods that can be used to

develop passing scores for an OSCE examination.

Conducting the standard settingmeeting: modified Angoff

Checklist items

Several studies have been conducted where the modified

Angoff was used to set standards for each item in a checklist

(e.g. Downing et al. 2003, 2006). For this process, it is

necessary to prepare case materials for all panellists, including

checklists and a form for providing ratings. The meeting

facilitator will need a way to display the ratings (flip chart,

projector) and should prepare forms for data entry and, if

possible, set up spreadsheets for each case checklist. This will

permit simple analyses of the data, calculating averages for

items, judges, and cases. In these studies, panellists review the

items on the checklist, and the task is to estimate the

percentage of examinees meeting the performance standard

(e.g. minimally competent) who will correctly perform the

action described in the OSCE checklist.

Table 2 shows a sample checklist for a case of a fall in an

elderly patient. Following a discussion of the performance

standard, and presentation of case materials, it is recom-

mended that the panellists review the first five checklist items

as a group. In the example, there is a large difference between

the ratings of raters 10 and 13 on item 3. Discussion of

discrepancies of 15% or more would be facilitated, with the

goal of reaching consensus on the definition and the rating

task. If there are items with discrepancies of 15% or more, they

should be reviewed after the meeting to determine if there is a

Table 2. Modified Angoff: checklist items.

Item Text Rater 10 Rater 11 Rater 12 Rater 13 Rater 14 Average

1 Describe the fall in more detail OR Tell me what happened. 65 80 65 80 75 73

2 Hit head when you fell? 75 70 60 65 70 68

3 Loss of consciousness? 65 72 75 85 68 73

4 Felt faint OR lightheaded OR dizzy recently or just before the fall? 70 80 85 80 55 74

5 Since fall, any new symptoms other than hip pain, OR double vision

OR speech problems OR weakness of an arm/hand?

75 65 60 70 45 63

6 Palpitations before fall? 80 55 40 62 60 59

7 What makes pain worse? 65 60 65 57 65 62

8 Other falls? 75 75 55 85 67 71

9 Past medical history? 65 68 75 50 70 66

10 Medications? 70 65 45 55 55 58

11 Checks symmetry of strength in lower extremities

(must include hip flexion on right)

67 68 70 65 65 67

12 Palpates over area(s) of pain and trochanteric area of hip. 65 70 68 75 60 68

13 Observes gait 60 80 65 70 72 69

14 Orientation to time OR assessment of recent memory. 65 75 78 60 70 70

15 Auscultates heart 60 50 50 65 65 58

Passing score for case 67

D. W. McKinley & J. J. Norcini

100

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

content-related problem with the checklist item. This item may

need to be removed from determination of the passing score,

and any discussion the panellists have can be useful in making

that decision. There may be some items where discrepancies

of 15% or more cannot be resolved through discussion, but

there should not be many of these items. After group rating

and discussion, panellists can proceed through the remaining

checklist items independently.

The table used for this example shows the average rating

across all panellists. To derive the passing score for the case,

the average rating across all items and panellists was calculated.

In this example, the passing score would be 67%. If examinees

correctly complete 67% or more of the items on the checklist,

they would achieve a passing score for that case. These are the

steps associated with using the modified Angoff method to

determine the passing score for a single case. This process

would be repeated with all checklists for stations included in

the OSCE administration. Once panellists have completed their

rating process, data entry can begin. Data entry can be

facilitated by using a spreadsheet for each case. Calculating

the mean across items and judges produces a cut score for each

case. Depending on the scoring and reporting of assessment

results, passing scores can be calculated for the OSCE by

averaging case passing scores, or OSCE content domains (e.g.

history taking, physical examination, communication).

Cases

Although the modified Angoff has been used by gathering

judgments at the checklist item level, it is more common to

have panellists make their judgments at the case or station

level. One rationale for this approach is that the items in a

checklist are inter-related; the likelihood of asking a question

or performing a manoeuvre is dependent on asking other

questions or other findings in a physical examination (Ben-

David 2000; Boulet et al. 2003). If the Angoff is conducted at

the case level, the panellists’ task is making an estimation of

the percentage of examinees who will meet the performance

standard based on the content of the case, rather than the

individual items within the checklist for the OSCE station (e.g.

Kaufman et al. 2000). Alternately, the panellists can be asked

to review the checklist items and estimate the percentage of

items for which the examinees meeting the performance

standard (e.g. minimally qualified) will get credit.

Case materials are reviewed by the panellists, and the

preparation for data entry is similar. However, a single

spreadsheet may be used, depending on the number of

cases to be analysed as part of the standard setting process.

Table 3 provides a sample spreadsheet format for the

simulation we mentioned earlier, with five OSCE stations for

an end-of-year test of 50 students. For this example, there are

five judges. Once again, the characteristics of the examinees

meeting the performance standard are discussed, test materials

are presented and reviewed, and panellists begin the rating

task. Practicing the method is essential. Using our example, a

sixth case would be used for practice and discussion amongst

panellists. The ratings from this case would not be used to

generate the passing score. The case scores (in percent correct

metric) for all students are presented in the last column of the

table. If time permits, panellists could provide their ratings and

then be given information on how all students performed on

the stations. They can then change their estimates. Calculating

the mean across judges and stations provides the passing score

for the OSCE.

Conducting the standard setting meeting: borderlinegroup

While both modified Angoff approaches (checklist items and

case level) are commonly used to determine passing scores in

OSCE and standardised patient examinations, panellists may

find the task of estimating the percentage of examinees

meeting the performance standard who will receive credit for

items or who will correctly manage the case challenging

(Boulet et al. 2003). A method that focuses on examinee

performance rather than examination materials that is fre-

quently used with OSCEs is the borderline group method.

The borderline group method requires the identification of the

characteristics (e.g. knowledge, skills, and abilities) of the

‘‘borderline’’ examinee. The ‘‘borderline’’ examinee is one

whose knowledge and skills are not quite adequate, but are

not inadequate (Livingston & Zieky 1982). Assessment mater-

ials (or actual examinee performances) are categorised by

panellists as clear fail, borderline, or clear pass. The passing

score is then set at the median (i.e. 50th percentile) score of the

borderline group (e.g. Rothman & Cohen 1996).

One modification to this method that has been used with

OSCEs is to use the judgments gathered during the OSCE

administration (e.g. see Reznick et al. 1996). In this modifica-

tion, a panel of judges is not used; instead, observers provide

information used to derive the passing score for each station. If

experts (e.g. physicians, faculty members) are used to score

the OSCE stations, they can be asked whether the performance

they have seen would be considered ‘‘borderline.’’ This

approach can save time by gathering examiners’ judgments

while the examination is being administered. Once examiners

Table 3. Modified Angoff: station-level passing score.

Station Rater 10 Rater 11 Rater 12 Rater 13 Rater 14 Rater Average Average Score (All students)

1 45 55 45 60 45 50 59.3

2 65 60 70 55 65 63 74.8

3 65 60 65 50 55 59 73.1

4 70 65 60 55 60 62 71.9

5 65 60 65 55 60 61 70.1

Passing score 59 69.8

Standard setting on performance-based exams

101

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

have identified examinees whose performance is considered

borderline, the passing score can be calculated by finding the

median score of all examinees who were classified as

‘‘borderline.’’

To illustrate this approach, we will use the same data set

used for the modified Angoff approach, which is presented in

Table 4. Fifty students were tested, and examiners provided a

rating indicating whether the observed performance was a

‘‘clear fail,’’ ‘‘borderline,’’ ‘‘clear pass,’’ or ‘‘superior’’ at each

station. Of the 50 students, nine were thought to have

demonstrated ‘‘borderline’’ performance at the OSCE station.

To derive the passing score for the station, the median (i.e.

50th percentile) score was calculated. For this example, the

passing score was identified using spreadsheet software:

MEDIAN (C23,C26,C29,C36,C37,C42,C43,G2,G6); where ‘‘C’’

and ‘‘G’’ indicate the columns where OSCE scores are located

and the numbers indicate the rows for the scores of the

borderline examinees. In this example, the median score is

64%, so examinees with scores of 64% or higher would pass

the station.

A modification of this approach is used by the Medical

Council of Canada (Smee & Blackmore 2001), where a six-

point rating scale is used: inferior, poor, borderline unsatis-

factory, borderline satisfactory, good, and excellent. The mean

station score for those examinees rated borderline unsatisfac-

tory and borderline satisfactory is calculated to derive the

passing score for the station. This Modified Borderline Group

Method works well when there are enough examinees who

were rated ‘‘borderline’’ for the station. However, the stability

of the passing score is dependent on the number of examinees

in the borderline unsatisfactory and borderline satisfactory

categories. If few examinees are rated ‘‘borderline’’, then

calculating the passing score based on the mean of their station

scores are not likely to be stable. That is, the reliability

associated with a cut score derived from two or three scores is

likely to be very low.

To overcome this potential disadvantage, a regression

approach was studied by Wood et al. (2006). Using the entire

range of OSCE scores can be particularly useful if only a small

number of examinees have participated. Because the number

of examinees classified as borderline could be very small, the

resulting passing score could be less precise than if all scores

were used (Wood et al. 2006). In this modification, the

checklist score is the dependent variable; the rating is the

independent variable. The goal of the regression analysis is to

predict the checklist score of the examinees classified as

‘‘borderline’’ for the station.

The borderline regression method is straightforward, and

can be done using a Microsoft Excel worksheet. Details on the

method are provided in Figures 1–5, which depict a series of

seven steps.

Step 1: Prepare a spreadsheet of OSCE scores and examiner

ratings.

Step 2: Click on the tab labelled ‘‘Data,’’ and when the pop-up

window appears, select ‘‘Data Analysis.’’ The analysis

tool you will select is ‘‘Regression.’’

Step 3: Identify the ‘‘Input Y Range’’ – what will be predicted.

In this case, it is the OSCE scores in Column C.

Step 4: Identify the ‘‘Input X Range’’ – what will be used to

predict scores. In this case, the ratings provided by

‘‘Examiner PF1’’ in Column D will be selected.

Step 5: Identify the location for analysis results. In the

example, we gave the spreadsheet the name

‘‘Sheet 3.’’ Click OK in the ‘‘Regression’’ window

(upper right side).

Table 4. Example of individual results for borderline group method.

Date Student number Case 6 score Examiner rating Date Student number Case 6 score Examiner rating

12-Apr-10 15 75 Clear Pass 14-Apr-11 12 64 Borderline

12-Apr-10 20 83 Superior 14-Apr-11 39 50 Clear Fail

12-Apr-10 43 75 Clear Pass 14-Apr-11 28 57 Clear Fail

12-Apr-10 42 100 Superior 14-Apr-11 50 43 Clear Fail

12-Apr-10 13 75 Clear Pass 14-Apr-11 10 64 Borderline

12-Apr-10 38 92 Superior 14-Apr-11 48 71 Clear Pass

12-Apr-10 5 92 Superior 14-Apr-11 30 71 Clear Pass

12-Apr-10 14 83 Superior 14-Apr-11 23 71 Clear Pass

12-Apr-10 29 83 Superior 13-Apr-11 19 89 Superior

10-Apr-11 37 60 Clear Fail 13-Apr-11 8 79 Clear Pass

10-Apr-11 36 40 Clear Fail 14-Apr-11 1 64 Borderline

10-Apr-11 16 50 Clear Fail 14-Apr-11 41 64 Borderline

10-Apr-11 47 60 Clear Fail 13-Apr-11 19 89 Superior

10-Apr-11 24 70 Clear Pass 13-Apr-11 17 58 Clear Fail

10-Apr-11 40 80 Superior 13-Apr-11 6 74 Clear Pass

10-Apr-11 33 70 Clear Pass 13-Apr-11 32 74 Clear Pass

10-Apr-11 35 90 Superior 13-Apr-11 27 95 Superior

10-Apr-11 26 70 Clear Pass 13-Apr-11 11 68 Borderline

10-Apr-11 7 80 Superior 13-Apr-11 3 68 Borderline

11-Apr-11 4 56 Clear Fail 13-Apr-11 34 89 Superior

11-Apr-11 18 75 Clear Pass 13-Apr-11 46 84 Superior

11-Apr-11 21 69 Borderline 11-Apr-11 9 94 Superior

11-Apr-11 2 50 Clear Fail 11-Apr-11 25 69 Borderline

11-Apr-11 31 81 Superior 11-Apr-11 22 75 Clear Pass

11-Apr-11 45 63 Borderline 12-Apr-11 44 92 Superior

11-Apr-11 49 50 Clear Fail

D. W. McKinley & J. J. Norcini

102

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

Figure 1. Spreadsheet of OSCE scores and examiner ratings.

Figure 2. Data analysis ‘‘regression.’’

Standard setting on performance-based exams

103

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

Figure 3. Define variables for analysis.

Figure 4. Location for analysis results.

D. W. McKinley & J. J. Norcini

104

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

Step 6: The output from the regression (‘‘Summary Output’’) is

in ‘‘Sheet 3.’’

Step 7: The formula for deriving the passing score is:

passing score¼ (median of ratings* ‘‘�Variable 1’’)þIntercept.

For this example, the passing score would be

75.4¼ (2� 11.561)þ 52.326

where 2 is the median of the ratings, 11.561 is the

‘‘�Variable 1’’, and 52.326 is the intercept.

The passing score could be adjusted by the standard error

of estimation (labelled ‘‘Standard Error’’ in the Summary

Output), if review leads to the conclusion that the examiner

at a station was particularly harsh (or lenient).

Conducting the standard setting meeting: contrast-ing groups

The Contrasting Groups method requires panellists to review

examinee work and classify the performance as acceptable or

unacceptable (Livingston & Zieky 1982). In education, infor-

mation external to the test is used to classify the examinees in

these categories (Hambleton et al. 2000). When other meas-

ures with similar content are available, two groups of

examinees are identified. Then, scores from the test on

which performance standards are being established are used

to generate distributions (one for each group), and the

distributions are compared to determine their degree of

overlap. This is done by tabulating the percentage of test-

takers in each category and at each score level who are

considered ‘‘competent’’. The passing score is the point at

which about 50% of the test-takers are considered competent.

For examination programmes in health professions education,

it is difficult to find an external measure that assesses the same

skills as those measured in the OSCE. The variation most

Figure 5. Output from regression analysis.

Table 5. Contrasting groups example.

Examiner decision

Score range Fail Pass Total Pass rate (%)

0–49a 3 1 4 92

50–54 3 0 3 86

55–59 5 2 7 72

60–64 2 4 6 60

65–69 0 5 5 50

70–74 0 13 13 44

75–79 0 3 3 38

85–89 0 4 4 32

90–94 0 4 4 24

95–100a 0 1 1 2

a‘‘Smoothed’’ averages cannot be created for the highest and lowest score

ranges.

Standard setting on performance-based exams

105

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

commonly used in medical education is to have panellists

decide whether the performance they review on the measure

of interest (i.e. the OSCE or standardised patient examination)

meets the characteristics associated with the performance

standard. One example of a variation on this approach derived

the passing score by regressing the number of judges rating the

performance as ‘‘competent’’ to the test scores, and set the

passing score at the point at which 50% of the panellists rated

the performance as competent (Burrows et al. 1999).

In another variation of the contrasting groups method,

panellists judged the performance of examinees on the test of

interest without knowledge of their test scores (Clauser &

Clyman 1994). The passing score was then identified as the

intersection of the two score distributions. We will illustrate the

use of this approach using our earlier example. Fifty students

were tested, and the examiner provided a rating indicating

whether the observed performance was considered ‘‘failing’’ or

‘‘passing.’’ Table 5 shows the data set with score ranges and

counts of examinees rated ‘‘fail’’ or ‘‘pass’’ by the examiner. In

addition to the columns labelled ‘‘Examiner’s Decision,’’ the

total number of examinees with scores within that range is

provided. In this example, the examiner’s ratings are separate

from the score range; instead, imagine that the scores are based

on a checklist completed by another rater. The results show

that the rater identified examinees considered passing even in

the lowest part of the score range. The column labelled ‘‘Pass

rate’’ is an indication of the percentage of examinees that would

pass if the passing score was set just above the score range.

For example, if the passing score was set at 50% correct,

46 examinees would pass, and the pass rate would be 92%.

Figure 6 illustrates the overlap in the score distribution.

Considering the approach studied by Clauser & Clyman

(1994), the point of intersection would generate a recom-

mended passing score of 65%. Examinees with score of 65% or

higher would pass, and those with score of 64% or lower

would fail.

Determining examination levelpassing scores

Although we have reviewed methods that would derive

passing scores for each task performed in the performance-

based assessment, a passing score at the examination level is

often needed. If examinees must receive a passing score on

each task or skill tested, the standard is conjunctive. If

performance across all tasks or skills is considered, the

standard is compensatory (Haladyna & Hess 1999; Ben-

David 2000). When deciding whether to require passing

each task or considering performance overall, there are several

factors to consider. First, examinee performance is likely to be

variable from task to task. That is, there is likely to be

inconsistency in the performance of each examinee. On some

tasks, an examinee will have a better performance than on

others. In addition, the reliability of the individual tasks is likely

to be much lower than the reliability across all tasks. Because

conjunctive standards are likely to result in a higher number of

students failing, the consequences of failing the examination

and the logistics of resitting or repeating instruction must be

considered (Ben-David 2000). Compensatory standard setting

involves averaging (or summing) performances across all tasks

to derive the examination score. Compensatory standards

allow examinees to compensate for poor performance on one

task with better performance on another. The degree to which

the tasks (or skills) correlate with each other can provide

0

2

4

6

8

10

12

14

Num

ber of Exam

inees

Fail

Pass

Figure 6. Example of contrasting groups.

D. W. McKinley & J. J. Norcini

106

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

support to the compensatory vs. conjunctive decision (Ben-

David 2000). Another option is to consider how perform-

ance is reported to examinees. This decision is important

because those managing the standard setting process will

need to decide how the derived passing scores will be

used. In the conjunctive model, examinees must meet or

exceed the score for each task in order to pass the

examination. For the compensatory model, the average

passing score across tasks can be used to set the

examination-level passing score. With OSCE stations, it

may be that decisions can be made across tasks (compen-

satory) but each skill (e.g. communications, clinical deci-

sion-making) must be passed in order to pass the

examination. Consideration of the feedback provided to

the examinees will play an important role in determining

whether compensatory, conjunctive, or a combination will

be used to set the examination level passing score.

Compromise methods

Although results from the standard setting panellists are the

most important elements in determining the passing score,

additional information is often used to determine the final

passing score that will be applied to examinations (Geisinger

1991; Geisinger & McCormick 2010). One type of information

that is considered is the pass–fail rate for the passing score.

The compromise approaches proposed by Hofstee (1983),

Beuk (1984), and De Gruijter (1985) explicitly ask the

panellists to consider both the passing score and the passing

(or failing) rate. Each approach assumes that the judges have

an opinion about what constitutes an ‘‘acceptable’’ passing

score and passing rate.

Hofstee suggested that the chosen passing score was

only one out of a universe of possible passing scores. In

addition, it is feasible to plot all possible failure rates. To

ensure that panellists have considered these data, the

standard that is being set (e.g. minimal competence,

proficiency, etc.) is discussed, the details of the examin-

ation process are reviewed, and the panellists are asked to

answer four questions:

(1) What is the lowest acceptable percentage of students

who fail the examination? (Minimum fail rate; fmin)

(2) What is the highest acceptable percentage of students

who fail the examination? (Maximum fail rate; fmax)

(3) What is the lowest acceptable percent correct score that

would be considered passing? (Minimum passing score;

kmin)

(4) What is the highest acceptable percent correct score

that would be considered passing? (Maximum passing

score; kmax)

The four data points are calculated by averaging across all

judges. The percentage of examinees that would pass for every

possible value of the passing score on the test is graphed and

the four data points are plotted, based on the four judgments

of the standard setting panel. Figure 7 provides an example of

application of the Hofstee method. In this example, 140

students took a 50-item end-of-year test.

The curve in the chart shows the projected failure rate

based on percent correct scores on the test. Instructors were

asked the four questions that appeared above:

(1) What is the lowest acceptable percentage of students

who fail the examination? Average: 20%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

22% 26% 30% 34% 38% 42% 46% 50% 54% 58% 62% 66% 70% 74% 78% 82% 86% 90% 94% 98%

60% min score30% max fail rate

75% max score20% min fail rate

Passingscore 63%

% E

xam

inee

s Pa

ssin

g

Test Scores (% Correct)

Figure 7. Example of the Hofstee method.

Standard setting on performance-based exams

107

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

(2) What is the highest acceptable percentage of students

who fail the examination? Average: 30%

(3) What is the lowest acceptable percent correct score that

would be considered passing? Average: 60%

(4) What is the highest acceptable percent correct score

that would be considered passing? Average: 75%

Using the information from the judges, two points are

plotted: the intersection of the lowest acceptable fail rate and

the highest acceptable percent correct score; and the intersec-

tion of the highest acceptable fail rate and the lowest

acceptable percent correct score (see Figure 7). These two

points create a line that intersects the curve that is defined by

percent correct score and projected failure rate. The passing

score is found by following the dotted line from the intersec-

tion to the x-axis (percent correct scores). The fail rate is found

by following the dotted line from the intersection to the y-axis

(percent fail).

In a modification of Hofstee method, Beuk (1984) sug-

gested that the panellists report to what extent each of their

judgments should be considered in deriving the final passing

score. That is, panellists are asked the degree to which their

decisions are examinee-oriented or test-oriented. The means

and standard deviations of both the passing scores and

acceptable pass rates are computed. The mean passing rate

and mean passing score are plotted. The point on the chart

where these two points intersect is identified. The compromise

consists of using the ratio of the standard deviation of pass rate

to the standard deviation of passing score. The point where the

distribution of scores intersects the line generated based on the

slope constitutes the passing score. De Gruijter (1985) further

suggested that an additional question be posed to panellists,

that of the level of uncertainty regarding these two judgments.

Beuk’s and De Gruijter’s methods have not been reported in

the literature for medical education, but the Hofstee method

has been used by a number of researchers.

Schindler et al. (2007) reported on the use of the Hofstee

approach to set passing scores for a surgery clerkship. Because

the goal was to set a passing score for the clerkship as a whole

instead of individual assessments (multiple-choice examin-

ations, OSCEs, clerkship grades, ratings of professionalism) the

standard setting panel determined that the use of the Hofstee

method was appropriate. The use of multiple, related assess-

ments led the group to conclude that compensatory standards

would be set, although a breech in professionalism could

result in failing. Panellists reviewed score distributions for all

students as well as those who had failed in previous years,

along with scoring rubrics and examination materials before

they responded to the four questions in the Hofstee method.

The authors found that there was a high level of agreement

amongst the judges, and that the pass rate derived was

reasonable when applied to previous clerkship data.

Selecting a standard settingmethod

With many methods available, it may seem difficult to decide

which the ‘‘best’’ method is. When selecting a standard setting

method, there are practical considerations to be made. The

method should permit judgments that are based on informa-

tion; processes that permit expert judgment in light of

performance data are preferable. The method chosen should

be closely aligned with the goal of assessment. The method

should require thoughtful effort of those participating in the

process, and it should be based on research. Finally, the

method should be easy to explain to participants, and easy to

implement.

It is important to keep in mind that the standard setting

study will generate a recommended passing score and that the

score should correspond to a level of performance that meets

the purpose for the test and the standard setting process. For

example, if the test is used for identification of students who

may need additional training or remediation, then passing

denotes the group of students ready for the next phase of

study, while failing identifies the group who may repeat the

course. In this case the level of performance may not be as

high as the level that corresponds to competence in inde-

pendent practice. If the test is used to represent those who are

ready to graduate, and enter a setting with supervised practice,

those who pass the test possess the characteristics associated

with readiness to enter supervised practice. The result of

passing these tests has different meanings and the final

determination of the passing score will take these differences

into account. While it is not possible to identify the best

method, the selection should be based on the purpose of the

test, as well as practical considerations delineated in this guide.

Implementing the standard

Since the standard setting study will generate a ‘‘recom-

mended’’ passing score, there are additional issues to be

considered before implementing the results of the standard

setting process. One important decision to make is whether the

passing score will be compensatory or conjunctive. For OSCEs

and standardised patient examinations, several stations are

typically included. If the assessment is averaged (or summed)

across cases, the passing score should be generated in a similar

fashion (i.e. averaged or summed across cases). In this

example, the standard would be considered compensatory;

those who meet or exceed the passing score will pass, and

poor performance at one station can be compensated by better

performance at another station. Alternately, a passing score

could be derived for each case/station, and an additional

requirement could be that a set number of cases have to be

passed in order to pass the assessment. In this case, the

standard would be conjunctive. Because cases often measure

both clinical and interpersonal skills, passing scores could be

generated for each of these skills, and the requirement to pass

would be to meet or exceed the passing score in each skill

area. This approach would also be considered conjunctive.

When deciding whether the pass–fail decision will be

compensatory or conjunctive, it is important to consider the

research done in this area. Performance on different tasks can

be quite variable (Traub 1994), and performance on a single

case is not likely to be a reliable indicator of an examinee’s

ability (Linn & Burton 1994). Conjunctive standards based on

individual stations will result in higher failure rates, and can

result in incorrect decisions due to measurement error

D. W. McKinley & J. J. Norcini

108

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

(Hambleton & Slater 1997; Ben-David 2000). While higher

failure rates will also result from conjunctive standards based

on skill area, it is reasonable to require that the passing score

be met for each area without compensation in each area. Ben-

David (2000) suggests that consideration of the construct

measured by the assessment is essential in making a decision

about compensatory and conjunctive standards. The purpose

of the assessment and the feedback given regarding the results

are important criteria to include in making a decision. For

example, it would be very useful to have examinees know that

they need to improve their physical examination manoeuvres,

but that their history taking and communication skills are

adequate. In this case, it would be reasonable to set separate

passing scores based on skills measured.

Another consideration is the format of reporting the results

of the examination to test-takers and other stakeholders. If the

OSCE is administered as an end-of-year assessment, students

who fail (and their instructors) may want to know about areas

of strength and weakness, so that they can concentrate their

efforts on skill improvement. Even students who pass may

want to know whether there were any areas in which they

could improve. Providing feedback is important, particularly

for failing examinees (Livingston & Zieky 1982; American

Educational Research Association et al. 1999).

Finally, consideration of the percentage of examinees

passing is essential. Understanding the consequences of the

decisions generated is vital to ensuring that decision makers

comprehend and endorse the process. It is not likely that it will

be feasible to generate the recommended passing score during

the standard setting meeting, so a meeting with stakeholders

(e.g. faculty members, head of departments) should be

conducted to inform them of the results of the study, and to

present the implications (i.e. number passed).

Maintaining the standard

Once the meetings have been conducted and the passing

score has been generated and endorsed, it is time to consider

how the passing score will be generated for the next testing

cycle. Because the performance of examinees and the

difficulty of the test can change from administration to

administration, the same passing score may not have the

same effect over time. If test materials are revised, it is

essential to conduct the standard setting meeting once again.

Even if the test materials are not changed, it is important to

monitor the performance of examinees and difficulty of the

test, as well as the consequences of implementing the

passing score (i.e. changes in passing rates). If the test

becomes easier (i.e. examinees obtain higher scores) and the

passing score remains the same, the passing rate is likely to

increase. Conversely, if the test becomes more difficult,

the passing rate is likely to decrease. Revisiting the definition

of the standard as well as the passing score in light of

changes associated with the test on a regular basis is advised.

Monitoring test performance is essential if the test is used

for determining examinee qualifications, whether it means

going on to the next level of training or entering

independent practice.

Conclusions

Although there is extensive research on standard setting with

both multiple-choice and performance-based tests, there is no

‘‘right’’ passing score, and no ‘‘best’’ method. Different

methods yield different results. Selecting a method depends

on the purpose of the examination and the resources

available for the standard setting effort. The methods

presented, the guidelines provided, and the examples given

are meant to provide information to inform decisions

regarding selection of a method, preparation for a standard

setting meeting, conducting the meeting and analysing the

data obtained, and implementing and maintaining the

standard.

Notes on contributors

DANETTE W. MCKINLEY, PhD, is the Director, Research and Data

Resources. Dr McKinley determines research priorities, defines scope,

and proposes methodology for studies focused on understanding and

promoting international medical education. She supports research activities

related to the certification of graduates of international medical programs.

Her interests include educational research methodology and assessment,

particularly for licensure or certification. With more than 20 years of

experience in licensure and certification testing, she now concentrates her

efforts on the development of research programs on international medical

education and the migration of health care workers.

JOHN J. NORCINI, PhD, President and Chief Executive Officer. Dr Norcini

became FAIMER’s first President and Chief Executive Officer in May 2002.

Before joining FAIMER, Dr Norcini spent 25 years with the American Board

of Internal Medicine serving in various capacities, including Director of

Psychometrics, Executive Vice President for Evaluation and Research, and

finally, Executive Vice President of the Institute for Clinical Evaluation.

Dr Norcini’s principal academic interest is the assessment of physician

performance. Current major research interests include methods for setting

standards, assessing practice performance, and testing professional com-

petence. His research also focuses on physician migration and workforce

issues, as well as the impact of international medical graduates on the U.S.

health care system.

Declaration of interest: The authors report no conflicts of

interests. The authors alone are responsible for the content

and writing of this article.

References

American Educational Research Association, American Psychological

Association, National Council on Measurement in Education, 1999.

Standards for educational and psychological testing. American

Educational Research Association, Washington DC.

Angoff WH. 1971. Scales, norms, and equivalent scores. In: Thorndike RL,

editor. Educational measurement. Washington, DC: American Council

on Education, pp 508–600.

Bandaranayake RC. 2008. Setting and maintaining standards in multiple

choice examinations: AMEE Guide No. 37. Med Teach 30:836–845.

Ben-David MF. 2000. AMEE Guide No. 18: Standard setting in student

assessment. Med Teach 22:120–130.

Beuk CH. 1984. A method for reaching a compromise between absolute

and relative standards in examinations. J Educ Measure 21:147–152.

Boulet JR, De Champlain AF, McKinley DW. 2003. Setting defensible

performance standards on OSCEs and standardized patient examin-

ations. Med Teach 25:245–249.

Burrows PJ, Bingham L, Brailovsky CA. 1999. A modified contrasting

groups method used for setting the passmark in a small scale

Standard setting on performance-based exams

109

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6

standardised patient examination. Adv Health Sci Educ Theory Pract

4:145–154.

Clauser BE, Clyman SG. 1994. A contrasting-groups approach to standard

setting for performance assessments of clinical skills. Acad Med

69:S42–S44.

Cohen-Schotanus J, van der Vleuten CPM. 2010. A standard setting method

with the best performing students as point of reference: Practical and

affordable. Med Teach 32:154–160.

De Gruijter DNM. 1985. Compromise models for establishing examination

standards. J Educ Measure 22:263–269.

Dijkstra J, Van der Vleuten CPM, Schuwirth LWT. 2010. A new framework

for designing programmes of assessment. Adv Health Sci Educ Theory

Pract 15:379–393.

Downing SM, Lieska NG, Raible MD. 2003. Establishing passing standards

for classroom achievement tests in medical education: A comparative

study of four methods. Acad Med 78:S85–S87.

Downing SM, Tekian A, Yudkowsky R. 2006. Procedures for establishing

defensible absolute passing scores on performance examinations in

health professions education. Teach Learn Med 18:50–57.

Ebel R. 1972. Essentials of educational measurement. 2nd ed. Englewood

Cliffs, NJ: Prentice-Hall.

Frank JR, Snell LS, Cate OT, Holmboe ES, Carraccio C, Swing SR,

Harris P, Glasgow NJ, Campbell C, Dath D, et al. 2010.

Competency-based medical education: Theory to practice. Med

Teach 32:638–645.

Geisinger KF. 1991. Using standard-setting data to establish cutoff scores.

Educ Measure: Issu Pract 10:17–22.

Geisinger KF, McCormick CM. 2010. Adopting cut scores: Post-standard-

setting panel considerations for decision makers. Educ Measure: Issu

Pract 29:38–44.

Haladyna T, Hess R. 1999. An evaluation of conjunctive and compensatory

standard-setting strategies for test decisions. Educ Assess 6:129–153.

Hambleton RK, Jaeger RM, Plake BS, Mills C. 2000. Setting performance

standards on complex educational assessments. Appl Psychol Measur

24:355–366.

Hambleton RK, Slater SC. 1997. Reliability of credentialing examinations

and the impact of scoring models and standard-setting policies. Appl

Measur Educ 10:19–28.

Hofstee WKB. 1983. The case for compromise in educational selection and

grading. In: Anderson SB, Helmick JS, editors. On educational testing.

San Francisco, CA: Jossey-Bass. pp 109–127.

Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. 2010. The role of

assessment in competency-based medical education. Med Teach

32:676–682.

Jaeger RM. 1991. Selection of judges for standard-setting. Educ Measu:

Iss Pract 10:3–14.

Kane MT. 1992. The assessment of professional competence. Eval Health

Prof 15:163–182.

Kane MT. 1994. Validating interpretive arguments for licensure and

certification examinations. Eval Health Prof 17:133–159; discussion

236–241.

Kane MT. 2006. Validation. In: Brennan RL, editor. Educational measure-

ment. Westport, CT: American Council on Education and Praeger

Publishers. pp 17–64.

Kaufman DM, Mann KV, Muijtjens AMM, van der Vleuten CPM. 2000. A

comparison of standard-setting procedures for an OSCE in under-

graduate medical education. Acad Med 75:267–271.

Linn RL, Burton E. 1994. Performance-based assessment: Implications of

task specificity. Educ Measure: Issu Pract 13:5–8.

Livingston SA, Zieky MJ. 1982. Passing scores: A manual for setting

standards of perfromance on education and occupational tests. Educ

Testing Serv. Princeton, New Jersey.

Miller GE. 1990. The assessment of clinical skills/competence/performance.

Acad Med 65:S63–S67.

Nedelsky L. 1954. Absolute grading standards for objective tests. Educ

Psychol Measur 14:3–19.

Nestel D, Kneebone R, Black S. 2006. Simulated patients and the

development of procedural and operative skills. Med Teach 28:390–391.

Norcini J, Burch V. 2007. Workplace-based assessment as an educational

tool: AMEE Guide No. 31. Med Teach 29:855–871.

Norcini JJ. 1994. Principles for setting standards on certifying and licensing

examinations. In: Rothman AI, Cohen R, editors. The Sixth Ottawa

Conference on Medical Education. Toronto: University of Toronto

Bookstore, pp 346–347.

Norcini JJ. 2003. Work based assessment. Br Med J 326:753–755.

Norcini J, McKinley D. 2007. Assessment methods in medical education.

Teach Teacher Educ 23:239–250.

Norcini JJ, Stillman PL, Sutnick AI, Regan MB, Haley HL, Williams RG,

Friedman M. 1993. Scoring and standard setting with standardized

patients. Eval Health Prof 16:322–332.

Patil NG, Saing H, Wong J. 2003. Role of OSCE in evaluation of practical

skills. Med Teach 25:271–272.

Pell G, Fuller R, Homer M, Roberts T. 2010. How to measure the quality of

the OSCE: A review of metrics – AMEE guide no. 49. Med Teach

32:802–811.

Raymond MR, Reid J. 2001. Who made thee a judge? Selecting

and training participants for standard setting. In: Cizek GJ,

editor. Setting performance standards: Concepts, methods, and

perspectives. Mahwah, NJ: Lawrence Erlbaum Associates.

pp 119–158.

Reznick RK, Blackmore D, Dauphinee WD, Rothman AI, Smee S. 1996.

Large-scale high-stakes testing with an OSCE: Report from the Medical

Council of Canada. Acad Med 71:S19–S21.

Rothman AI, Cohen R. 1996. A comparison of empirically- and rationally-

defined standards for clinical skills checklists. Acad Med 71:S1–S3.

Schindler N, Corcoran J, DaRosa D. 2007. Description and impact of using a

standard-setting method for determining pass/fail scores in a surgery

clerkship. Am J Surg 193:252–257.

Smee SM, Blackmore DE. 2001. Setting standards for an objective structured

clinical examination: The borderline group method gains ground on

Angoff. Med Educ 35:1009–1010.

Traub RE. 1994. Facing the challenge of multidimensionality in perfor-

mance assessment. In: Rothman AI, Cohen R, editors. Proceedings of

the Sixth Annual Ottawa Conference on Medical Education. Toronto:

University of Toronto Bookstore. pp 9–11.

Wood TJ, Humphrey-Murto SM, Norman GR. 2006. Standard setting in a

small scale OSCE: A comparison of the modified borderline-group

method and the borderline regression method. Adv Health Sci Educ

11:115–122.

D. W. McKinley & J. J. Norcini

110

Dow

nloa

ded

by [

Sist

ema

Inte

grad

o de

Bib

liote

cas

USP

] at

18:

55 2

7 M

arch

201

6